Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files

Mark Payne Fri, 04 Jan 2019 05:48:33 -0800

Josef,

Thanks for the info! There are a few things to consider here. Firstly, you said 
that you are using NiFi 1.8.0.
Are you using the new Load Balancing capability? I.e., do you have any 
Connections configured to balance
load across your cluster? And if so, are you load-balancing any 0-byte files? 
If so, then you may be getting
bitten by [1]. That can result in data staying in the Content Repo and not 
getting cleaned up until restart.


The second thing that is important to consider is the interaction between the 
FlowFile Repositories and Content
Repository. At a high level, the Content Repository stores the FlowFiles' 
content/payload. The FlowFile Repository
stores the FlowFiles' attributes, which queue it is in, and some other 
metadata. Once a FlowFile completes its processing
and is no longer part of the flow, we cannot simply delete the content claim 
from the Content Repository. If we did so,
we could have a condition where the node is restarted and the FlowFile 
Repository has not yet been fully flushed to disk
(NiFi may have already written to the file, but the Operating System may be 
caching that without having flushed/"fsync'ed"
to disk). In such a case, we want the transaction to be "rolled back" and 
reprocessed. So, if we deleted the Content Claim
from the Content Repository immediately when it is no longer needed, and then 
restarted, we could have a case where the
FlowFile repo wasn't flushed to disk and as a result points to a Content Claim 
that has been deleted, and this would result
in data loss.

So, to avoid the above scenario, what we do is instead keep track of how many 
"claims" there are for a Content Claim
and then, when the FlowFile repo performs a checkpoint (every 2 minutes by 
default), we go through and delete any Content
Claims that have a claim count of 0. So this means that any Content Claim that 
has been accessed in the past 2 minutes
(or however long the checkpoint time is) will be considered "active" and will 
not be cleaned up.

I hope this helps to explain some of the behavior, but if not, let's please 
investigate further!

Thanks
-Mark



[1] https://issues.apache.org/jira/browse/NIFI-5771


On Jan 4, 2019, at 7:41 AM, 
<josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> 
<josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote:

Hi Joe

We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the 
partitions below.

[nifi@nifi-12 ~]$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/disk1-root             100G  2.0G   99G   2% /
devtmpfs                           126G     0  126G   0% /dev
tmpfs                              126G     0  126G   0% /dev/shm
tmpfs                              126G  3.1G  123G   3% /run
tmpfs                              126G     0  126G   0% /sys/fs/cgroup
/dev/sda1                         1014M  188M  827M  19% /boot
/dev/mapper/disk1-home              30G   34M   30G   1% /home
/dev/mapper/disk1-var              100G  1.1G   99G   2% /var
/dev/mapper/disk1-opt               50G  5.9G   45G  12% /opt
/dev/mapper/disk1-database_repo   1014M   35M  980M   4% /database_repo
/dev/mapper/disk1-provenance_repo  4.0G   33M  4.0G   1% /provenance_repo
/dev/mapper/disk1-flowfile_repo    530G   34M  530G   1% /flowfile_repo
/dev/mapper/disk2-content_repo     850G   64G  786G   8% /content_repo
tmpfs                               26G     0   26G   0% /run/user/2000


Cheers Josef


From: Joe Witt <joe.w...@gmail.com<mailto:joe.w...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 13:29
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Josef

Not looping for that proc for sure makes sense.  Nifi dying in the middle of a 
process/transaction is no problem..it will restart the transaction.

But we do need to find out what is filling the repo.  You have flowfile, 
content, and prov in different disk volumes or partitins right?  What version 
of nifi?

Lets definitely figure this out.  You should see clean behavior of the repos 
and you should never have to restart.

thanks

On Fri, Jan 4, 2019, 7:16 AM Mike Thomsen 
<mikerthom...@gmail.com<mailto:mikerthom...@gmail.com> wrote:
I agree with Pierre's take on the failure relationship. Corrupted compressed 
files are also going to be nearly impossible to recover in most cases, so your 
best bet is to simply log the file name and other relevant attributes and 
establish a process to notify the source system that they sent you corrupt data.

On Fri, Jan 4, 2019 at 6:48 AM 
<josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote:
Hi Arpad

I’m doing it (hopefully) gracefully:

  *   /opt/nifi/bin/nifi.sh stop
  *   /opt/nifi/bin/nifi.sh restart


But what I see in case of our cluster is, that it takes a more than a few 
seconds until the service is stopped – I have never checked the log during 
shutdown, I guess I should do that to check if it was really graceful or not. 
Do you want to say that it depends on the shutdown into which queue the file 
goes?


“What still doesn’t make sense for me is why NiFi doesn’t release the 
content_repo disk space after a failure without archiving enabled?”

It gives you the possibility to rollback failed operations.

--> sure yes, so does this mean that we can’t get rid of content processed by a 
“failure” queue until we do a NiFi restart? And I want to go further, if we 
wouldn’t use the provenance VolatileProvenanceRepository it wouldn’t be 
possible at all to get rid of it until we delete the files manually via command 
line?

Cheers Josef


From: Arpad Boda <ab...@hortonworks.com<mailto:ab...@hortonworks.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 12:14
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

“but maybe I’m wrong and it goes back to the success queue before the 
“CompressContent” processor?”

How do you shut it down? Some graceful way or just kill it?

“What still doesn’t make sense for me is why NiFi doesn’t release the 
content_repo disk space after a failure without archiving enabled?”

It gives you the possibility to rollback failed operations.

From: <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>>
Reply-To: <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 12:10
To: <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Hi Pierre

Thanks for your feedback. You are right it doesn’t make much sense to connect 
the “failure” as a loop in this specific processor. We have to restart NiFi 
regularly, so my thought was that if we have to decompress a huge file and nifi 
does a hard restart during processing the file it goes into the failure queue – 
but maybe I’m wrong and it goes back to the success queue before the 
“CompressContent” processor? That’s the only reason why I’ve connected 
“failure” as a loop…

Thanks for the tipp regarding the failure relationship, we will do what you 
suggested.

What still doesn’t make sense for me is why NiFi doesn’t release the 
content_repo disk space after a failure without archiving enabled?

Cheers Josef

From: Pierre Villard 
<pierre.villard...@gmail.com<mailto:pierre.villard...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 11:50
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Hi Josef,

I don't think it's a good idea to use the failure relationship as a self-loop 
on the processor. If the decompression failed, it's *very* likely that it will 
fail again and again. Usually, when developing a processor, the best practice 
is to have a 'retry' relationship to handle errors that could be solved few 
seconds later. You have such a relationship on few processors.

The failure relationship gives you the possibility to handle errors the way you 
want. For instance, in your case, you could move the file to a 'quarantine' 
folder and send an email to notify such an error occurred so that it can be 
processed manually if needed.

What we could do is to penalize the flow file when going in the failure 
relationship but that's not necessarily a good idea when the relationship is 
not a self loop.

Hope it makes sense.
Pierre


Le ven. 4 janv. 2019 à 10:21, 
<josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> a écrit :
Hi guys

We had it already two times that our production 8 node cluster with 800GB 
storage each, ran out disk space for the content_repository – at the end 
everything stopped working within a few minutes which under normal 
circumstances isn’t possible as we process in peak no more than 40GB/5min. So 
I’ve investigated that and I’ve found out that the culprit was the NiFi 
“Compress Content” processor which we use to decompress GZIP files, together 
with a few small (few MBs) corrupt GZIPs. After a restart of NiFi the whole 
content_repository was emptied again (I’ve deleted the corrupt GZIPs from the 
queue before the restart).

Today I’ve made a small test with the “Compress Content” processor on a 
standalone NiFi VM. I’ve used a corrupt 10MB GZIP file and let NiFi decompress 
it, at the same time I had observed the size of the content_repository and I 
was shocked how fast the disk space had been eaten up, it took 8s to generate 
3GB content_repository space with this 10MB file…! To free up the space I had 
to restart NiFi.

For us this is a major issue, as in case of a corrupt GZIP NiFi stops really 
fast due to the lack of disk space, so the workaround for now is to connect the 
“failure” to another terminated processor (or terminate it directly on the 
“Compress Content” processor). However that’s not the idea of the “failure” 
connection under normal circumstances.

I don’t think that this is normal behavior right? Why does the NiFi processor 
not free up the content_repo in case of a failure in the decompression of a 
file?

Btw. we use “org.apache.nifi.provenance.VolatileProvenanceRepository” and we 
have “nifi.content.repository.archive.enabled” disabled.

Below some additional outputs/pictures to explain everything.

Cheers Josef





nifi.properties
nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=false


NiFi Canvas
<image001.png>


Disk Consumption content_repo
[user@nifi nifi]$ for i in {1..10}; do du -sh * | grep content; date ; sleep 2; 
done
82M     content_repository
Fri Jan  4 09:35:00 CET 2019
82M     content_repository
Fri Jan  4 09:35:02 CET 2019.   -> Start “Compress Content”
532M    content_repository
Fri Jan  4 09:35:04 CET 2019
1.2G    content_repository
Fri Jan  4 09:35:06 CET 2019
1.8G    content_repository
Fri Jan  4 09:35:08 CET 2019
2.5G    content_repository
Fri Jan  4 09:35:10 CET 2019
3.2G    content_repository
Fri Jan  4 09:35:12 CET 2019
3.5G    content_repository
Fri Jan  4 09:35:14 CET 2019.   -> Stop “Compress Content”
3.5G    content_repository
Fri Jan  4 09:35:16 CET 2019
3.5G    content_repository
Fri Jan  4 09:35:18 CET 2019


NiFi Error Message
CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02] Unable to decompress 
StandardFlowFileRecord[uuid=578b1e61-914a-4b4c-9b82-82de4cb51265,claim=StandardContentClaim
 [resourceClaim=StandardResourceClaim[id=1546590880942-1<tel:1546590880942-1>, 
container=default, section=1], offset=0, 
length=11471455],offset=0,name=name_dhcp_1_log.1545816464.gz,size=11471455] 
using gzip compression format due to IOException thrown from 
CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02]: java.io.IOException: 
Gzip-compressed data is corrupt; routing to failure: 
org.apache.nifi.processor.exception.ProcessException: IOException thrown from 
CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02]: java.io.IOException: 
Gzip-compressed data is corrupt

Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files

Reply via email to