[jira] [Created] (NIFI-5785) LogAttribute processor overrides user input on attributes to log

2018-11-03 Thread Randy Thomasson (JIRA)
Randy Thomasson created NIFI-5785:
-

 Summary: LogAttribute processor overrides user input on attributes 
to log
 Key: NIFI-5785
 URL: https://issues.apache.org/jira/browse/NIFI-5785
 Project: Apache NiFi
  Issue Type: Bug
  Components: Core Framework
Affects Versions: 1.8.0
Reporter: Randy Thomasson


If a user desires to log only specific attributes by listing them in the 
"Attributes to Log" field and then deletes the default ".*" in the "Attributes 
to Log by Regular Expression" field, the LogAttribute processor ignores the 
input and logs additional attributes anyway, specifically:          entryDate
  lineageStartDate
  fileSize

Also, the next time the user returns to configure the processor, the default .* 
is back in the "Attributes to Log by Regular Expression" field.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread James Wing (JIRA)


 [ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Wing updated NIFI-4715:
-
   Resolution: Fixed
Fix Version/s: 1.9.0
   Status: Resolved  (was: Patch Available)

> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Fix For: 1.9.0
>
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674228#comment-16674228
 ] 

ASF GitHub Bot commented on NIFI-4715:
--

Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/3116
  
Thanks, @ijokarumawak and @adamlamar!  The combined change looks good to 
me.  

I ran it through multiple test loops putting and listing about 10,000 
objects in S3.  No objects were missed.  Duplicates were very low (< 100 per 
10,000), coinciding with S3 500 errors "We encountered an internal error. 
Please try again."  I believe we are handling this well to allow a few 
duplicates for at-least-once processing.


> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] nifi issue #3116: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-11-03 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/3116
  
Thanks, @ijokarumawak and @adamlamar!  The combined change looks good to 
me.  

I ran it through multiple test loops putting and listing about 10,000 
objects in S3.  No objects were missed.  Duplicates were very low (< 100 per 
10,000), coinciding with S3 500 errors "We encountered an internal error. 
Please try again."  I believe we are handling this well to allow a few 
duplicates for at-least-once processing.


---


[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674227#comment-16674227
 ] 

ASF GitHub Bot commented on NIFI-4715:
--

Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/2361
  
Thanks, @adamlamar!


> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-11-03 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/2361
  
Thanks, @adamlamar!


---


[jira] [Commented] (NIFI-5753) Add SSL support to HortonworksSchemaRegistry service

2018-11-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674224#comment-16674224
 ] 

ASF GitHub Bot commented on NIFI-5753:
--

GitHub user grzegorz8 opened a pull request:

https://github.com/apache/nifi/pull/3126

NIFI-5753 Add SSL support to HortonworksSchemaRegistry service

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [ ] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [ ] Is your initial contribution a single, squashed commit?

### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?

### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/grzegorz8/nifi nifi-5753

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/3126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3126


commit a6a6bc86af5f829b3debb598c013bf29630257d7
Author: Grzegorz Kołakowski 
Date:   2018-11-03T20:30:53Z

NIFI-5753 Add SSL support to HortonworksSchemaRegistry service




> Add SSL support to HortonworksSchemaRegistry service
> 
>
> Key: NIFI-5753
> URL: https://issues.apache.org/jira/browse/NIFI-5753
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Grzegorz Kołakowski
>Priority: Major
>
> Currently HortonworksSchemaRegistry service does not support communication 
> over HTTPS.
> We should be able to add the ssl context to the processor and pass it to the 
> underlying schema registry client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] nifi pull request #3126: NIFI-5753 Add SSL support to HortonworksSchemaRegis...

2018-11-03 Thread grzegorz8
GitHub user grzegorz8 opened a pull request:

https://github.com/apache/nifi/pull/3126

NIFI-5753 Add SSL support to HortonworksSchemaRegistry service

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [ ] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [ ] Is your initial contribution a single, squashed commit?

### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?

### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/grzegorz8/nifi nifi-5753

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/3126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3126


commit a6a6bc86af5f829b3debb598c013bf29630257d7
Author: Grzegorz Kołakowski 
Date:   2018-11-03T20:30:53Z

NIFI-5753 Add SSL support to HortonworksSchemaRegistry service




---


[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674222#comment-16674222
 ] 

ASF GitHub Bot commented on NIFI-4715:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/3116


> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674221#comment-16674221
 ] 

ASF GitHub Bot commented on NIFI-4715:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/2361


> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] nifi pull request #3116: NIFI-4715: ListS3 produces duplicates in frequently...

2018-11-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/3116


---


[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674220#comment-16674220
 ] 

ASF subversion and git services commented on NIFI-4715:
---

Commit 37a0e1b3048b5db067b6485bb437887cb0869888 in nifi's branch 
refs/heads/master from [~ijokarumawak]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=37a0e1b ]

NIFI-4715: Update currentKeys after listing loop

ListS3 used to update currentKeys within listing loop, that causes
duplicates. Because S3 returns object list in lexicographic order, if we
clear currentKeys during the loop, we cannot tell if the object has been
listed or not, in a case where newer object has a lexicographically
former name.

Signed-off-by: James Wing 

This closes #3116, closes #2361.


> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

2018-11-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/2361


---


[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

2018-11-03 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674219#comment-16674219
 ] 

ASF subversion and git services commented on NIFI-4715:
---

Commit 0a014dcdb13e30084e6378c14f8c8e5568493c33 in nifi's branch 
refs/heads/master from [~adamonduty]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=0a014dc ]

NIFI-4715: ListS3 produces duplicates in frequently updated buckets

Keep totalListCount, reduce unnecessary persistState

This closes #2361.

Signed-off-by: Koji Kawamura 


> ListS3 produces duplicates in frequently updated buckets
> 
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
>Reporter: Milan Das
>Assignee: Koji Kawamura
>Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, 
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)