[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

2024-01-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804595#comment-17804595
 ] 

ASF subversion and git services commented on NIFI-9464:
---

Commit d9a5a013712d74c3ab30d3ad75d0950a557fb233 in nifi's branch 
refs/heads/support/nifi-1.x from tpalfy
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=d9a5a01371 ]

NIFI-9464 Fixed race condition between "Timer-Driven" threads when running 
SiteToSiteProvenanceReportingTask.onTrigger and "Compress Provenance Log" 
threads running EventFileCompressos.run that can cause the 
SiteToSiteProvenanceReportingTask.onTrigger to pair an already compressed 
.prov.gz file with a .toc file that corresponds to the uncompressed .prov file. 
(#8157)

Co-authored-by: Tamas Palfy 

> Provenance Events files corrupted
> -
>
> Key: NIFI-9464
> URL: https://issues.apache.org/jira/browse/NIFI-9464
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.11.0, 1.15.0
> Environment: java 11, centos 7, nifi standalone
>Reporter: Wiktor Kubicki
>Assignee: Tamas Palfy
>Priority: Minor
> Fix For: 1.25.0, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] 
> Failed to retrieve Provenance Events from repository due to: Attempted to 
> skip to byte offset 9149491 for 1125432890.prov.gz but file does not have 
> that many bytes (TOC 
> Reader=StandardTocReader[file=//provenance_repository/toc/1125432890.toc, 
> compressed=false]): java.io.EOFException: Attempted to skip to byte offset 
> 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC 
> Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, 
> compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened 
> about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16  2021 786154249.prov.gz
> -rw-rw-r--. 1 user user 1763198 May 25  2021 852626782.prov.gz
> -rw-rw-r--. 1 user user 1580598 Jun 15 08:32 891934274.prov.gz
> -rw-rw-r--. 1 user user 2960296 Jun 28 17:07 917991812.prov.gz
> -rw-rw-r--. 1 user user 1808037 Jun 28 17:37 918051650.prov.gz
> -rw-rw-rw-. 1 user user  765924 Aug 14 13:09 991505484.prov.gz
> {code}
> BTW it's interesting why thera ere different chmods
> My config for provenance (BTW if you see posibbility for tune it, please tell 
> me):
> {code:java}
> nifi.provenance.repository.directory.default=/../provenance_repository
> nifi.provenance.repository.max.storage.time=730 days
> nifi.provenance.repository.max.storage.size=512 GB
> nifi.provenance.repository.rollover.time=10 mins
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=1
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, 
> ProcessorID
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=1 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=1
> nifi.provenance.repository.buffer.size=10
> {code}
> Now my provenance repo has 140GB of data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

2024-01-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804442#comment-17804442
 ] 

ASF subversion and git services commented on NIFI-9464:
---

Commit 1c26b39fcdae58e9e49a52e694a7f39fd8ad17b1 in nifi's branch 
refs/heads/main from tpalfy
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=1c26b39fcd ]

NIFI-9464 Fixed race condition between "Timer-Driven" threads when running 
SiteToSiteProvenanceReportingTask.onTrigger and "Compress Provenance Log" 
threads running EventFileCompressos.run that can cause the 
SiteToSiteProvenanceReportingTask.onTrigger to pair an already compressed 
.prov.gz file with a .toc file that corresponds to the uncompressed .prov file. 
(#8157)

Co-authored-by: Tamas Palfy 

> Provenance Events files corrupted
> -
>
> Key: NIFI-9464
> URL: https://issues.apache.org/jira/browse/NIFI-9464
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.11.0, 1.15.0
> Environment: java 11, centos 7, nifi standalone
>Reporter: Wiktor Kubicki
>Assignee: Tamas Palfy
>Priority: Minor
> Fix For: 1.25.0, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] 
> Failed to retrieve Provenance Events from repository due to: Attempted to 
> skip to byte offset 9149491 for 1125432890.prov.gz but file does not have 
> that many bytes (TOC 
> Reader=StandardTocReader[file=//provenance_repository/toc/1125432890.toc, 
> compressed=false]): java.io.EOFException: Attempted to skip to byte offset 
> 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC 
> Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, 
> compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened 
> about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16  2021 786154249.prov.gz
> -rw-rw-r--. 1 user user 1763198 May 25  2021 852626782.prov.gz
> -rw-rw-r--. 1 user user 1580598 Jun 15 08:32 891934274.prov.gz
> -rw-rw-r--. 1 user user 2960296 Jun 28 17:07 917991812.prov.gz
> -rw-rw-r--. 1 user user 1808037 Jun 28 17:37 918051650.prov.gz
> -rw-rw-rw-. 1 user user  765924 Aug 14 13:09 991505484.prov.gz
> {code}
> BTW it's interesting why thera ere different chmods
> My config for provenance (BTW if you see posibbility for tune it, please tell 
> me):
> {code:java}
> nifi.provenance.repository.directory.default=/../provenance_repository
> nifi.provenance.repository.max.storage.time=730 days
> nifi.provenance.repository.max.storage.size=512 GB
> nifi.provenance.repository.rollover.time=10 mins
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=1
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, 
> ProcessorID
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=1 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=1
> nifi.provenance.repository.buffer.size=10
> {code}
> Now my provenance repo has 140GB of data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

2024-01-08 Thread Mark Payne (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804441#comment-17804441
 ] 

Mark Payne commented on NIFI-9464:
--

[~tpalfy] I got you. Makes sense. I did a quick look over this again to make 
sure that I fully understand what's happening here. It looks like this was 
actually designed to work as you've proposed in the PR. But when the Encrypted 
Prov Repo was introduced, the base class's init() method was changed to start 
creating its own `EventFileManager`. As a result, the base class has a 
different instance than the concrete class is using. So this change fixes that 
to ensure that both the base class and the concrete class are sharing the same 
instance.  Makes perfect sense. Great catch! Thanks for running that down and 
fixing. I'm a +1 will merge.

> Provenance Events files corrupted
> -
>
> Key: NIFI-9464
> URL: https://issues.apache.org/jira/browse/NIFI-9464
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.11.0, 1.15.0
> Environment: java 11, centos 7, nifi standalone
>Reporter: Wiktor Kubicki
>Assignee: Tamas Palfy
>Priority: Minor
> Fix For: 1.25.0, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] 
> Failed to retrieve Provenance Events from repository due to: Attempted to 
> skip to byte offset 9149491 for 1125432890.prov.gz but file does not have 
> that many bytes (TOC 
> Reader=StandardTocReader[file=//provenance_repository/toc/1125432890.toc, 
> compressed=false]): java.io.EOFException: Attempted to skip to byte offset 
> 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC 
> Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, 
> compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened 
> about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16  2021 786154249.prov.gz
> -rw-rw-r--. 1 user user 1763198 May 25  2021 852626782.prov.gz
> -rw-rw-r--. 1 user user 1580598 Jun 15 08:32 891934274.prov.gz
> -rw-rw-r--. 1 user user 2960296 Jun 28 17:07 917991812.prov.gz
> -rw-rw-r--. 1 user user 1808037 Jun 28 17:37 918051650.prov.gz
> -rw-rw-rw-. 1 user user  765924 Aug 14 13:09 991505484.prov.gz
> {code}
> BTW it's interesting why thera ere different chmods
> My config for provenance (BTW if you see posibbility for tune it, please tell 
> me):
> {code:java}
> nifi.provenance.repository.directory.default=/../provenance_repository
> nifi.provenance.repository.max.storage.time=730 days
> nifi.provenance.repository.max.storage.size=512 GB
> nifi.provenance.repository.rollover.time=10 mins
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=1
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, 
> ProcessorID
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=1 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=1
> nifi.provenance.repository.buffer.size=10
> {code}
> Now my provenance repo has 140GB of data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

2024-01-08 Thread Tamas Palfy (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804425#comment-17804425
 ] 

Tamas Palfy commented on NIFI-9464:
---

Hi [~markap14]
I managed to kind of reproduce the issue by carefully debugging in a very 
particular way. Simply adding a delay before renaming the .toc.tmp doesn't work 
because it slows down the compress thread so much that it cannot keep up with 
the timer-driven thread to cause any race condition.
I could see that the locking didn't work properly due to the different 
EventManager objects.

Based on that do you think we can merge this change?


> Provenance Events files corrupted
> -
>
> Key: NIFI-9464
> URL: https://issues.apache.org/jira/browse/NIFI-9464
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.11.0, 1.15.0
> Environment: java 11, centos 7, nifi standalone
>Reporter: Wiktor Kubicki
>Assignee: Tamas Palfy
>Priority: Minor
> Fix For: 1.25.0, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] 
> Failed to retrieve Provenance Events from repository due to: Attempted to 
> skip to byte offset 9149491 for 1125432890.prov.gz but file does not have 
> that many bytes (TOC 
> Reader=StandardTocReader[file=//provenance_repository/toc/1125432890.toc, 
> compressed=false]): java.io.EOFException: Attempted to skip to byte offset 
> 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC 
> Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, 
> compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened 
> about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16  2021 786154249.prov.gz
> -rw-rw-r--. 1 user user 1763198 May 25  2021 852626782.prov.gz
> -rw-rw-r--. 1 user user 1580598 Jun 15 08:32 891934274.prov.gz
> -rw-rw-r--. 1 user user 2960296 Jun 28 17:07 917991812.prov.gz
> -rw-rw-r--. 1 user user 1808037 Jun 28 17:37 918051650.prov.gz
> -rw-rw-rw-. 1 user user  765924 Aug 14 13:09 991505484.prov.gz
> {code}
> BTW it's interesting why thera ere different chmods
> My config for provenance (BTW if you see posibbility for tune it, please tell 
> me):
> {code:java}
> nifi.provenance.repository.directory.default=/../provenance_repository
> nifi.provenance.repository.max.storage.time=730 days
> nifi.provenance.repository.max.storage.size=512 GB
> nifi.provenance.repository.rollover.time=10 mins
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=1
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, 
> ProcessorID
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=1 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=1
> nifi.provenance.repository.buffer.size=10
> {code}
> Now my provenance repo has 140GB of data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

2023-12-14 Thread Mark Payne (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796766#comment-17796766
 ] 

Mark Payne commented on NIFI-9464:
--

Thanks [~tpalfy] - your analysis seems reasonable. As you said, it's difficult 
to confirm. One thing that I would recommend in order to confirm (although 
you'd not want to leave this in the codebase) would be to temporarily add a 
`Thread.sleep` into the code in between the time that the new .toc.tmp file 
completes its writing and the time that it's renamed to .toc. If you were to 
add a sleep of say 30 seconds or 1 minute, it would be easy to confirm that the 
threading issue is present as described and also that this change addresses the 
concern.

> Provenance Events files corrupted
> -
>
> Key: NIFI-9464
> URL: https://issues.apache.org/jira/browse/NIFI-9464
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.11.0, 1.15.0
> Environment: java 11, centos 7, nifi standalone
>Reporter: Wiktor Kubicki
>Assignee: Tamas Palfy
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] 
> Failed to retrieve Provenance Events from repository due to: Attempted to 
> skip to byte offset 9149491 for 1125432890.prov.gz but file does not have 
> that many bytes (TOC 
> Reader=StandardTocReader[file=//provenance_repository/toc/1125432890.toc, 
> compressed=false]): java.io.EOFException: Attempted to skip to byte offset 
> 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC 
> Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, 
> compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened 
> about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16  2021 786154249.prov.gz
> -rw-rw-r--. 1 user user 1763198 May 25  2021 852626782.prov.gz
> -rw-rw-r--. 1 user user 1580598 Jun 15 08:32 891934274.prov.gz
> -rw-rw-r--. 1 user user 2960296 Jun 28 17:07 917991812.prov.gz
> -rw-rw-r--. 1 user user 1808037 Jun 28 17:37 918051650.prov.gz
> -rw-rw-rw-. 1 user user  765924 Aug 14 13:09 991505484.prov.gz
> {code}
> BTW it's interesting why thera ere different chmods
> My config for provenance (BTW if you see posibbility for tune it, please tell 
> me):
> {code:java}
> nifi.provenance.repository.directory.default=/../provenance_repository
> nifi.provenance.repository.max.storage.time=730 days
> nifi.provenance.repository.max.storage.size=512 GB
> nifi.provenance.repository.rollover.time=10 mins
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=1
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, 
> ProcessorID
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=1 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=1
> nifi.provenance.repository.buffer.size=10
> {code}
> Now my provenance repo has 140GB of data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

2023-12-13 Thread Tamas Palfy (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796442#comment-17796442
 ] 

Tamas Palfy commented on NIFI-9464:
---

[~markap14] I think I found the root cause of this issue but couldn't confirm 
it 100% due to the complexity of the provenance repository bundle and because 
it's very hard to reproduce the issue. Maybe you can help with it and review my 
pull request?

The provenance journal files are regularly compressed on dedicated threads. 
When it is done the content of the uncompressed .prov file is compressed and 
written into a .prov.gz file, after which the .prov file gets deleted. Also, 
the corresponding .toc file is updated but in a different way. The new .toc 
file is named as .toc.tmp and after the .prov file is deleted the .toc.tmp is 
renamed to .toc.

The SiteToSiteProvenanceReportingTask of course runs on a Timer-Driven thread. 
It is important that when it references a compressed .prov.gz it also properly 
references the new .toc file.

However a race condition can occur and it's possible for the Time-Driven thread 
to run when the .prov file is deleted but the .toc.tmp file is not yet renamed 
to .toc. The original .toc file still exists though so it is paired with the 
.prov.gz. That can lead to the reported error.

This wasn't always the case. Both places use an EventFileManager to acquire 
locks properly.
However [PR|https://github.com/apache/nifi/pull/1686] for NIFI-3594 introduced 
a change in WriteAheadProvenanceRepository after which the 2 places started 
using a different EventFileManager object.

This is the important part:
{code:java}
@Override
public synchronized void initialize(final EventReporter eventReporter, 
final Authorizer authorizer, final ProvenanceAuthorizableFactory 
resourceFactory, final IdentifierLookup idLookup) throws IOException {

final EventFileManager fileManager = new EventFileManager();
final RecordReaderFactory recordReaderFactory = (file, logs, maxChars) 
-> {
fileManager.obtainReadLock(file);
try {
return RecordReaders.newRecordReader(file, logs, maxChars);
} finally {
fileManager.releaseReadLock(file);
}
};

   init(recordWriterFactory, recordReaderFactory, eventReporter, 
authorizer, resourceFactory);
}

synchronized void init(RecordWriterFactory recordWriterFactory, 
RecordReaderFactory recordReaderFactory,
   final EventReporter eventReporter, final Authorizer 
authorizer,
   final ProvenanceAuthorizableFactory resourceFactory) 
throws IOException {
final EventFileManager fileManager = new EventFileManager();
...
}
{code}

My assessment is that this change (that the EventFileManager instances should 
be different) is unintentional but the aforementioned change is too complex for 
me to be able to tell 100%.

I'm going to open a pull request according to my assessment.

> Provenance Events files corrupted
> -
>
> Key: NIFI-9464
> URL: https://issues.apache.org/jira/browse/NIFI-9464
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.11.0, 1.15.0
> Environment: java 11, centos 7, nifi standalone
>Reporter: Wiktor Kubicki
>Assignee: Tamas Palfy
>Priority: Minor
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] 
> Failed to retrieve Provenance Events from repository due to: Attempted to 
> skip to byte offset 9149491 for 1125432890.prov.gz but file does not have 
> that many bytes (TOC 
> Reader=StandardTocReader[file=//provenance_repository/toc/1125432890.toc, 
> compressed=false]): java.io.EOFException: Attempted to skip to byte offset 
> 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC 
> Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, 
> compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened 
> about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16