Ah ok. And 'data/5' is its own partition (same physical disk as data/4?). And data/5 is where you see those large files? Can you show what you see there in terms of files/sizes?
For the checkpoint period the default is 20 seconds. Am curious to know what benefit moving to 300 seconds was giving (might be perfectly fine for some cases - just curious) Thanks On Wed, Jul 12, 2023 at 8:18 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you Joe - > The content repo doesn't seem to be the issue - it's the flowfile repo. > Here is the section from one of the nodes: > > > nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository > nifi.content.claim.max.appendable.size=50 KB > nifi.content.repository.directory.default=/data/4/nifi_content_repository > nifi.content.repository.archive.max.retention.period=2 days > nifi.content.repository.archive.max.usage.percentage=50% > nifi.content.repository.archive.enabled=false > nifi.content.repository.always.sync=false > nifi.content.viewer.url=../nifi-content-viewer/ > > > nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository > > nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog > nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository > nifi.flowfile.repository.checkpoint.interval=300 secs > nifi.flowfile.repository.always.sync=false > nifi.flowfile.repository.retain.orphaned.flowfiles=true > > -Joe > On 7/12/2023 11:07 AM, Joe Witt wrote: > > Joe > > I dont recall the specific version in which we got it truly sorted but > there was an issue with our default settings for an important content repo > property and how we handled mixture of large/small flowfiles written within > the same underlying slab/claim in the content repository. > > Please check what you have for conf/nifi.properties > nifi.content.claim.max.appendable.size= > > What value do you have there? I recommend reducing it to 50KB and > restarting. > > Can you show your full 'nifi.content' section from the nifi.properties? > > Thanks > > On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger < > joseph.obernber...@gmail.com> wrote: > >> Raising this thread from the dead... >> Having issues with IO to the flowfile repository. NiFi will show 500k >> flow files and a size of ~1.7G - but the size on disk on each of the 4 >> nodes is massive - over 100G, and disk IO to the flowfile spindle is just >> pegged doing writes. >> >> I do have ExtractText processors that take the flowfile content (.*) and >> put it into an attribute, but the sizes of these is maybe in the 10k at >> most size. How can I find out what module (there are some 2200) is causing >> the issue? I think I'm doing something fundamentally wrong with NiFi. :) >> Perhaps I should change the size of all the queues to something less than >> 10k/1G? >> >> Under cluster/FLOWFILE STORAGE, one of the nodes shows 3.74TBytes of >> usage, but it's actually ~150G on disk. The other nodes are correct. >> >> Ideas on what to debug? >> Thank you! >> >> -Joe (NiFi 1.18) >> On 3/22/2023 12:49 PM, Mark Payne wrote: >> >> OK. So changing the checkpoint internal to 300 seconds might help reduce >> IO a bit. But it will cause the repo to become much larger, and it will >> take much longer to startup whenever you restart NiFi. >> >> The variance in size between nodes is likely due to how recently it’s >> checkpointed. If it stays large like 31 GB while the other stay small, that >> would be interesting to know. >> >> Thanks >> -Mark >> >> >> On Mar 22, 2023, at 12:45 PM, Joe Obernberger >> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote: >> >> Thanks for this Mark. I'm not seeing any large attributes at the moment >> but will go through this and verify - but I did have one queue that was set >> to 100k instead of 10k. >> I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5) >> and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up >> from 20). >> >> While it's running the size of the flowfile repo varies (wildly?) on each >> of the nodes from 1.5G to over 30G. Disk IO is still very high, but it's >> running now and I can use the UI. Interestingly at this point the UI shows >> 677k files and 1.5G of flow. But disk usage on the flowfile repo is 31G, >> 3.7G, and 2.6G on the 3 nodes. I'd love to throw some SSDs at this >> problem. I can add more nifi nodes. >> >> -Joe >> On 3/22/2023 11:08 AM, Mark Payne wrote: >> >> Joe, >> >> The errors noted are indicating that NiFi cannot communicate with >> registry. Either the registry is offline, NiFi’s Registry Client is not >> configured properly, there’s a firewall in the way, etc. >> >> A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things: >> - You have a huge number of FlowFiles (doesn’t seem to be the case) >> - FlowFiles have a huge number of attributes >> or >> - FlowFiles have 1 or more huge attribute values. >> >> Typically, FlowFile attribute should be kept minimal and should never >> contain chunks of contents from the FlowFile content. Often when we see >> this type of behavior it’s due to using something like ExtractText or >> EvaluateJsonPath to put large blocks of content into attributes. >> >> And in this case, setting Backpressure Threshold above 10,000 is even >> more concerning, as it means even greater disk I/O. >> >> Thanks >> -Mark >> >> >> On Mar 22, 2023, at 11:01 AM, Joe Obernberger >> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote: >> >> Thank you Mark. These are SATA drives - but there's no way for the >> flowfile repo to be on multiple spindles. It's not huge - maybe 35G per >> node. >> I do see a lot of messages like this in the log: >> >> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] >> o.a.nifi.groups.StandardProcessGroup Failed to synchronize >> StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA >> Handle Extract Metadata] with Flow Registry because could not retrieve >> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in >> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused >> (Connection refused) >> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] >> o.a.nifi.groups.StandardProcessGroup Failed to synchronize >> StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB] >> with Flow Registry because could not retrieve version 2 of flow with >> identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket >> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection >> refused) >> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] >> o.a.nifi.groups.StandardProcessGroup Failed to synchronize >> StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA >> Handle Extract Metadata] with Flow Registry because could not retrieve >> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in >> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused >> (Connection refused) >> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] >> o.a.nifi.groups.StandardProcessGroup Failed to synchronize >> StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save >> Binary Data] with Flow Registry because could not retrieve version 1 of >> flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket >> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection >> refused) >> >> A clue? >> >> -joe >> On 3/22/2023 10:49 AM, Mark Payne wrote: >> >> Joe, >> >> 1.8 million FlowFiles is not a concern. But when you say “Should I reduce >> the queue sizes?” it makes me wonder if they’re all in a single queue? >> Generally, you should leave the backpressure threshold at the default >> 10,000 FlowFile max. Increasing this can lead to huge amounts of swapping, >> which will drastically reduce performance and increase disk utilization >> very significantly. >> >> Also from the diagnostics, it looks like you’ve got a lot of CPU cores, >> but you’re not using much. And based on the amount of disk space available >> and the fact that you’re seeing 100% utilization, I’m wondering if you’re >> using spinning disks, rather than SSDs? I would highly recommend always >> running NiFi with ssd/nvme drives. Absent that, if you have multiple disk >> drives, you could also configure the content repository to span multiple >> disks, in order to spread that load. >> >> Thanks >> -Mark >> >> On Mar 22, 2023, at 10:41 AM, Joe Obernberger >> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote: >> >> Thank you. Was able to get in. >> Currently there are 1.8 million flow files and 3.2G. Is this too much >> for a 3 node cluster with mutliple spindles each (SATA drives)? >> Should I reduce the queue sizes? >> >> -Joe >> On 3/22/2023 10:23 AM, Phillip Lord wrote: >> >> Joe, >> >> If you need the UI to come back up, try setting the autoresume setting in >> nifi.properties to false and restart node(s). >> This will bring up every component/controllerService up stopped/disabled >> and may provide some breathing room for the UI to become available again. >> >> Phil >> On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger >> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com>, wrote: >> >> atop shows the disk as being all red with IO - 100% utilization. There >> are a lot of flowfiles currently trying to run through, but I can't >> monitor it because....UI wont' load. >> >> -Joe >> >> On 3/22/2023 10:16 AM, Mark Payne wrote: >> >> Joe, >> >> I’d recommend taking a look at garbage collection. It is far more likely >> the culprit than disk I/O. >> >> Thanks >> -Mark >> >> On Mar 22, 2023, at 10:12 AM, Joe Obernberger >> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote: >> >> I'm getting "java.net.SocketTimeoutException: timeout" from the user >> interface of NiFi when load is heavy. This is 1.18.0 running on a 3 node >> cluster. Disk IO is high and when that happens, I can't get into the UI to >> stop any of the processors. >> Any ideas? >> >> I have put the flowfile repository and content repository on different >> disks on the 3 nodes, but disk usage is still so high that I can't get in. >> Thank you! >> >> -Joe >> >> >> -- >> This email has been checked for viruses by AVG antivirus software. >> www.avg.com >> >> >> >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> Virus-free.www.avg.com >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> >> >> >> >>