Thank you Joe -
The content repo doesn't seem to be the issue - it's the flowfile repo.
Here is the section from one of the nodes:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=50 KB
nifi.content.repository.directory.default=/data/4/nifi_content_repository
nifi.content.repository.archive.max.retention.period=2 days
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=false
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository
nifi.flowfile.repository.checkpoint.interval=300 secs
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.retain.orphaned.flowfiles=true
-Joe
On 7/12/2023 11:07 AM, Joe Witt wrote:
Joe
I dont recall the specific version in which we got it truly sorted but
there was an issue with our default settings for an important content
repo property and how we handled mixture of large/small flowfiles
written within the same underlying slab/claim in the content repository.
Please check what you have for conf/nifi.properties
nifi.content.claim.max.appendable.size=
What value do you have there? I recommend reducing it to 50KB and
restarting.
Can you show your full 'nifi.content' section from the nifi.properties?
Thanks
On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Raising this thread from the dead...
Having issues with IO to the flowfile repository. NiFi will show
500k flow files and a size of ~1.7G - but the size on disk on each
of the 4 nodes is massive - over 100G, and disk IO to the flowfile
spindle is just pegged doing writes.
I do have ExtractText processors that take the flowfile content
(.*) and put it into an attribute, but the sizes of these is maybe
in the 10k at most size. How can I find out what module (there
are some 2200) is causing the issue? I think I'm doing something
fundamentally wrong with NiFi. :)
Perhaps I should change the size of all the queues to something
less than 10k/1G?
Under cluster/FLOWFILE STORAGE, one of the nodes shows 3.74TBytes
of usage, but it's actually ~150G on disk. The other nodes are
correct.
Ideas on what to debug?
Thank you!
-Joe (NiFi 1.18)
On 3/22/2023 12:49 PM, Mark Payne wrote:
OK. So changing the checkpoint internal to 300 seconds might help
reduce IO a bit. But it will cause the repo to become much
larger, and it will take much longer to startup whenever you
restart NiFi.
The variance in size between nodes is likely due to how recently
it’s checkpointed. If it stays large like 31 GB while the other
stay small, that would be interesting to know.
Thanks
-Mark
On Mar 22, 2023, at 12:45 PM, Joe Obernberger
<joseph.obernber...@gmail.com>
<mailto:joseph.obernber...@gmail.com> wrote:
Thanks for this Mark. I'm not seeing any large attributes at
the moment but will go through this and verify - but I did have
one queue that was set to 100k instead of 10k.
I set the nifi.cluster.node.connection.timeout to 30 seconds (up
from 5) and the nifi.flowfile.repository.checkpoint.interval to
300 seconds (up from 20).
While it's running the size of the flowfile repo varies
(wildly?) on each of the nodes from 1.5G to over 30G. Disk IO
is still very high, but it's running now and I can use the UI.
Interestingly at this point the UI shows 677k files and 1.5G of
flow. But disk usage on the flowfile repo is 31G, 3.7G, and
2.6G on the 3 nodes. I'd love to throw some SSDs at this
problem. I can add more nifi nodes.
-Joe
On 3/22/2023 11:08 AM, Mark Payne wrote:
Joe,
The errors noted are indicating that NiFi cannot communicate
with registry. Either the registry is offline, NiFi’s Registry
Client is not configured properly, there’s a firewall in the
way, etc.
A FlowFile repo of 35 GB is rather huge. This would imply one
of 3 things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.
Typically, FlowFile attribute should be kept minimal and should
never contain chunks of contents from the FlowFile content.
Often when we see this type of behavior it’s due to using
something like ExtractText or EvaluateJsonPath to put large
blocks of content into attributes.
And in this case, setting Backpressure Threshold above 10,000
is even more concerning, as it means even greater disk I/O.
Thanks
-Mark
On Mar 22, 2023, at 11:01 AM, Joe Obernberger
<joseph.obernber...@gmail.com>
<mailto:joseph.obernber...@gmail.com> wrote:
Thank you Mark. These are SATA drives - but there's no way
for the flowfile repo to be on multiple spindles. It's not
huge - maybe 35G per node.
I do see a lot of messages like this in the log:
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
Handle Extract Metadata] with Flow Registry because could not
retrieve version 1 of flow with identifier
d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
refused (Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
with Flow Registry because could not retrieve version 2 of
flow with identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in
bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
refused (Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
Handle Extract Metadata] with Flow Registry because could not
retrieve version 1 of flow with identifier
d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
refused (Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
Binary Data] with Flow Registry because could not retrieve
version 1 of flow with identifier
7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
refused (Connection refused)
A clue?
-joe
On 3/22/2023 10:49 AM, Mark Payne wrote:
Joe,
1.8 million FlowFiles is not a concern. But when you say
“Should I reduce the queue sizes?” it makes me wonder if
they’re all in a single queue?
Generally, you should leave the backpressure threshold at the
default 10,000 FlowFile max. Increasing this can lead to huge
amounts of swapping, which will drastically reduce
performance and increase disk utilization very significantly.
Also from the diagnostics, it looks like you’ve got a lot of
CPU cores, but you’re not using much. And based on the amount
of disk space available and the fact that you’re seeing 100%
utilization, I’m wondering if you’re using spinning disks,
rather than SSDs? I would highly recommend always running
NiFi with ssd/nvme drives. Absent that, if you have multiple
disk drives, you could also configure the content repository
to span multiple disks, in order to spread that load.
Thanks
-Mark
On Mar 22, 2023, at 10:41 AM, Joe Obernberger
<joseph.obernber...@gmail.com>
<mailto:joseph.obernber...@gmail.com> wrote:
Thank you. Was able to get in.
Currently there are 1.8 million flow files and 3.2G. Is
this too much for a 3 node cluster with mutliple spindles
each (SATA drives)?
Should I reduce the queue sizes?
-Joe
On 3/22/2023 10:23 AM, Phillip Lord wrote:
Joe,
If you need the UI to come back up, try setting the
autoresume setting in nifi.properties to false and restart
node(s).
This will bring up every component/controllerService up
stopped/disabled and may provide some breathing room for
the UI to become available again.
Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
<joseph.obernber...@gmail.com>
<mailto:joseph.obernber...@gmail.com>, wrote:
atop shows the disk as being all red with IO - 100%
utilization. There
are a lot of flowfiles currently trying to run through,
but I can't
monitor it because....UI wont' load.
-Joe
On 3/22/2023 10:16 AM, Mark Payne wrote:
Joe,
I’d recommend taking a look at garbage collection. It is
far more likely the culprit than disk I/O.
Thanks
-Mark
On Mar 22, 2023, at 10:12 AM, Joe Obernberger
<joseph.obernber...@gmail.com>
<mailto:joseph.obernber...@gmail.com> wrote:
I'm getting "java.net.SocketTimeoutException: timeout"
from the user interface of NiFi when load is heavy. This
is 1.18.0 running on a 3 node cluster. Disk IO is high
and when that happens, I can't get into the UI to stop
any of the processors.
Any ideas?
I have put the flowfile repository and content
repository on different disks on the 3 nodes, but disk
usage is still so high that I can't get in.
Thank you!
-Joe
--
This email has been checked for viruses by AVG antivirus
software.
www.avg.com <http://www.avg.com/>
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free.www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com