Josef, OK, thanks for confirming. My suspicion is that the Load-Balancing bug is what is biting you, and that when you tried to replicate with the CompressContent in a simple case, you may have just been experiencing the "cleanup lag" related to the way that the repositories interact with one another.
Custom Processors should not be an issue. You should not be able to cause any FlowFile to stay in the Repository. Thanks -Mark On Jan 4, 2019, at 11:48 AM, <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote: Mark, Yes we are using Load Balancing capability and we do that after the ListSFTP processor, so yes we loadbalance 0-Byte files. Seems that we probably facing your Bug here. Thanks a lot for explaining in detail what happens regarding the flowfile/content repo in NiFi. Additionally we have several custom processors, could be as well that one of them causing it? Can someone share a (java) codesnipplet which ensures that a custom processor doesn’t keep the flowfiles in content repo? Cheers Josef From: Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 14:48 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Josef, Thanks for the info! There are a few things to consider here. Firstly, you said that you are using NiFi 1.8.0. Are you using the new Load Balancing capability? I.e., do you have any Connections configured to balance load across your cluster? And if so, are you load-balancing any 0-byte files? If so, then you may be getting bitten by [1]. That can result in data staying in the Content Repo and not getting cleaned up until restart. The second thing that is important to consider is the interaction between the FlowFile Repositories and Content Repository. At a high level, the Content Repository stores the FlowFiles' content/payload. The FlowFile Repository stores the FlowFiles' attributes, which queue it is in, and some other metadata. Once a FlowFile completes its processing and is no longer part of the flow, we cannot simply delete the content claim from the Content Repository. If we did so, we could have a condition where the node is restarted and the FlowFile Repository has not yet been fully flushed to disk (NiFi may have already written to the file, but the Operating System may be caching that without having flushed/"fsync'ed" to disk). In such a case, we want the transaction to be "rolled back" and reprocessed. So, if we deleted the Content Claim from the Content Repository immediately when it is no longer needed, and then restarted, we could have a case where the FlowFile repo wasn't flushed to disk and as a result points to a Content Claim that has been deleted, and this would result in data loss. So, to avoid the above scenario, what we do is instead keep track of how many "claims" there are for a Content Claim and then, when the FlowFile repo performs a checkpoint (every 2 minutes by default), we go through and delete any Content Claims that have a claim count of 0. So this means that any Content Claim that has been accessed in the past 2 minutes (or however long the checkpoint time is) will be considered "active" and will not be cleaned up. I hope this helps to explain some of the behavior, but if not, let's please investigate further! Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-5771 On Jan 4, 2019, at 7:41 AM, <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote: Hi Joe We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the partitions below. [nifi@nifi-12 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/disk1-root 100G 2.0G 99G 2% / devtmpfs 126G 0 126G 0% /dev tmpfs 126G 0 126G 0% /dev/shm tmpfs 126G 3.1G 123G 3% /run tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/sda1 1014M 188M 827M 19% /boot /dev/mapper/disk1-home 30G 34M 30G 1% /home /dev/mapper/disk1-var 100G 1.1G 99G 2% /var /dev/mapper/disk1-opt 50G 5.9G 45G 12% /opt /dev/mapper/disk1-database_repo 1014M 35M 980M 4% /database_repo /dev/mapper/disk1-provenance_repo 4.0G 33M 4.0G 1% /provenance_repo /dev/mapper/disk1-flowfile_repo 530G 34M 530G 1% /flowfile_repo /dev/mapper/disk2-content_repo 850G 64G 786G 8% /content_repo tmpfs 26G 0 26G 0% /run/user/2000 Cheers Josef From: Joe Witt <joe.w...@gmail.com<mailto:joe.w...@gmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 13:29 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Josef Not looping for that proc for sure makes sense. Nifi dying in the middle of a process/transaction is no problem..it will restart the transaction. But we do need to find out what is filling the repo. You have flowfile, content, and prov in different disk volumes or partitins right? What version of nifi? Lets definitely figure this out. You should see clean behavior of the repos and you should never have to restart. thanks On Fri, Jan 4, 2019, 7:16 AM Mike Thomsen <mikerthom...@gmail.com<mailto:mikerthom...@gmail.com> wrote: I agree with Pierre's take on the failure relationship. Corrupted compressed files are also going to be nearly impossible to recover in most cases, so your best bet is to simply log the file name and other relevant attributes and establish a process to notify the source system that they sent you corrupt data. On Fri, Jan 4, 2019 at 6:48 AM <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote: Hi Arpad I’m doing it (hopefully) gracefully: * /opt/nifi/bin/nifi.sh stop * /opt/nifi/bin/nifi.sh restart But what I see in case of our cluster is, that it takes a more than a few seconds until the service is stopped – I have never checked the log during shutdown, I guess I should do that to check if it was really graceful or not. Do you want to say that it depends on the shutdown into which queue the file goes? “What still doesn’t make sense for me is why NiFi doesn’t release the content_repo disk space after a failure without archiving enabled?” It gives you the possibility to rollback failed operations. --> sure yes, so does this mean that we can’t get rid of content processed by a “failure” queue until we do a NiFi restart? And I want to go further, if we wouldn’t use the provenance VolatileProvenanceRepository it wouldn’t be possible at all to get rid of it until we delete the files manually via command line? Cheers Josef From: Arpad Boda <ab...@hortonworks.com<mailto:ab...@hortonworks.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 12:14 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files “but maybe I’m wrong and it goes back to the success queue before the “CompressContent” processor?” How do you shut it down? Some graceful way or just kill it? “What still doesn’t make sense for me is why NiFi doesn’t release the content_repo disk space after a failure without archiving enabled?” It gives you the possibility to rollback failed operations. From: <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> Reply-To: <users@nifi.apache.org<mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 12:10 To: <users@nifi.apache.org<mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Hi Pierre Thanks for your feedback. You are right it doesn’t make much sense to connect the “failure” as a loop in this specific processor. We have to restart NiFi regularly, so my thought was that if we have to decompress a huge file and nifi does a hard restart during processing the file it goes into the failure queue – but maybe I’m wrong and it goes back to the success queue before the “CompressContent” processor? That’s the only reason why I’ve connected “failure” as a loop… Thanks for the tipp regarding the failure relationship, we will do what you suggested. What still doesn’t make sense for me is why NiFi doesn’t release the content_repo disk space after a failure without archiving enabled? Cheers Josef From: Pierre Villard <pierre.villard...@gmail.com<mailto:pierre.villard...@gmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 11:50 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Hi Josef, I don't think it's a good idea to use the failure relationship as a self-loop on the processor. If the decompression failed, it's *very* likely that it will fail again and again. Usually, when developing a processor, the best practice is to have a 'retry' relationship to handle errors that could be solved few seconds later. You have such a relationship on few processors. The failure relationship gives you the possibility to handle errors the way you want. For instance, in your case, you could move the file to a 'quarantine' folder and send an email to notify such an error occurred so that it can be processed manually if needed. What we could do is to penalize the flow file when going in the failure relationship but that's not necessarily a good idea when the relationship is not a self loop. Hope it makes sense. Pierre Le ven. 4 janv. 2019 à 10:21, <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> a écrit : Hi guys We had it already two times that our production 8 node cluster with 800GB storage each, ran out disk space for the content_repository – at the end everything stopped working within a few minutes which under normal circumstances isn’t possible as we process in peak no more than 40GB/5min. So I’ve investigated that and I’ve found out that the culprit was the NiFi “Compress Content” processor which we use to decompress GZIP files, together with a few small (few MBs) corrupt GZIPs. After a restart of NiFi the whole content_repository was emptied again (I’ve deleted the corrupt GZIPs from the queue before the restart). Today I’ve made a small test with the “Compress Content” processor on a standalone NiFi VM. I’ve used a corrupt 10MB GZIP file and let NiFi decompress it, at the same time I had observed the size of the content_repository and I was shocked how fast the disk space had been eaten up, it took 8s to generate 3GB content_repository space with this 10MB file…! To free up the space I had to restart NiFi. For us this is a major issue, as in case of a corrupt GZIP NiFi stops really fast due to the lack of disk space, so the workaround for now is to connect the “failure” to another terminated processor (or terminate it directly on the “Compress Content” processor). However that’s not the idea of the “failure” connection under normal circumstances. I don’t think that this is normal behavior right? Why does the NiFi processor not free up the content_repo in case of a failure in the decompression of a file? Btw. we use “org.apache.nifi.provenance.VolatileProvenanceRepository” and we have “nifi.content.repository.archive.enabled” disabled. Below some additional outputs/pictures to explain everything. Cheers Josef nifi.properties nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository nifi.content.repository.archive.max.retention.period=12 hours nifi.content.repository.archive.max.usage.percentage=50% nifi.content.repository.archive.enabled=false NiFi Canvas <image001.png> Disk Consumption content_repo [user@nifi nifi]$ for i in {1..10}; do du -sh * | grep content; date ; sleep 2; done 82M content_repository Fri Jan 4 09:35:00 CET 2019 82M content_repository Fri Jan 4 09:35:02 CET 2019. -> Start “Compress Content” 532M content_repository Fri Jan 4 09:35:04 CET 2019 1.2G content_repository Fri Jan 4 09:35:06 CET 2019 1.8G content_repository Fri Jan 4 09:35:08 CET 2019 2.5G content_repository Fri Jan 4 09:35:10 CET 2019 3.2G content_repository Fri Jan 4 09:35:12 CET 2019 3.5G content_repository Fri Jan 4 09:35:14 CET 2019. -> Stop “Compress Content” 3.5G content_repository Fri Jan 4 09:35:16 CET 2019 3.5G content_repository Fri Jan 4 09:35:18 CET 2019 NiFi Error Message CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02] Unable to decompress StandardFlowFileRecord[uuid=578b1e61-914a-4b4c-9b82-82de4cb51265,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1546590880942-1<tel:1546590880942-1>, container=default, section=1], offset=0, length=11471455],offset=0,name=name_dhcp_1_log.1545816464.gz,size=11471455] using gzip compression format due to IOException thrown from CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02]: java.io.IOException: Gzip-compressed data is corrupt; routing to failure: org.apache.nifi.processor.exception.ProcessException: IOException thrown from CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02]: java.io.IOException: Gzip-compressed data is corrupt