Thanks for this Mark. I'm not seeing any large attributes at the moment
but will go through this and verify - but I did have one queue that was
set to 100k instead of 10k.
I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5)
and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up
from 20).
While it's running the size of the flowfile repo varies (wildly?) on
each of the nodes from 1.5G to over 30G. Disk IO is still very high,
but it's running now and I can use the UI. Interestingly at this point
the UI shows 677k files and 1.5G of flow. But disk usage on the
flowfile repo is 31G, 3.7G, and 2.6G on the 3 nodes. I'd love to throw
some SSDs at this problem. I can add more nifi nodes.
-Joe
On 3/22/2023 11:08 AM, Mark Payne wrote:
Joe,
The errors noted are indicating that NiFi cannot communicate with
registry. Either the registry is offline, NiFi’s Registry Client is
not configured properly, there’s a firewall in the way, etc.
A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.
Typically, FlowFile attribute should be kept minimal and should never
contain chunks of contents from the FlowFile content. Often when we
see this type of behavior it’s due to using something like ExtractText
or EvaluateJsonPath to put large blocks of content into attributes.
And in this case, setting Backpressure Threshold above 10,000 is even
more concerning, as it means even greater disk I/O.
Thanks
-Mark
On Mar 22, 2023, at 11:01 AM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Thank you Mark. These are SATA drives - but there's no way for the
flowfile repo to be on multiple spindles. It's not huge - maybe 35G
per node.
I do see a lot of messages like this in the log:
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
Handle Extract Metadata] with Flow Registry because could not
retrieve version 1 of flow with identifier
d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
with Flow Registry because could not retrieve version 2 of flow with
identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
Handle Extract Metadata] with Flow Registry because could not
retrieve version 1 of flow with identifier
d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
Binary Data] with Flow Registry because could not retrieve version 1
of flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in
bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
refused (Connection refused)
A clue?
-joe
On 3/22/2023 10:49 AM, Mark Payne wrote:
Joe,
1.8 million FlowFiles is not a concern. But when you say “Should I
reduce the queue sizes?” it makes me wonder if they’re all in a
single queue?
Generally, you should leave the backpressure threshold at the
default 10,000 FlowFile max. Increasing this can lead to huge
amounts of swapping, which will drastically reduce performance and
increase disk utilization very significantly.
Also from the diagnostics, it looks like you’ve got a lot of CPU
cores, but you’re not using much. And based on the amount of disk
space available and the fact that you’re seeing 100% utilization,
I’m wondering if you’re using spinning disks, rather than SSDs? I
would highly recommend always running NiFi with ssd/nvme drives.
Absent that, if you have multiple disk drives, you could also
configure the content repository to span multiple disks, in order to
spread that load.
Thanks
-Mark
On Mar 22, 2023, at 10:41 AM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Thank you. Was able to get in.
Currently there are 1.8 million flow files and 3.2G. Is this too
much for a 3 node cluster with mutliple spindles each (SATA drives)?
Should I reduce the queue sizes?
-Joe
On 3/22/2023 10:23 AM, Phillip Lord wrote:
Joe,
If you need the UI to come back up, try setting the autoresume
setting in nifi.properties to false and restart node(s).
This will bring up every component/controllerService up
stopped/disabled and may provide some breathing room for the UI to
become available again.
Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
<joseph.obernber...@gmail.com>, wrote:
atop shows the disk as being all red with IO - 100% utilization.
There
are a lot of flowfiles currently trying to run through, but I can't
monitor it because....UI wont' load.
-Joe
On 3/22/2023 10:16 AM, Mark Payne wrote:
Joe,
I’d recommend taking a look at garbage collection. It is far
more likely the culprit than disk I/O.
Thanks
-Mark
On Mar 22, 2023, at 10:12 AM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
I'm getting "java.net.SocketTimeoutException: timeout" from the
user interface of NiFi when load is heavy. This is 1.18.0
running on a 3 node cluster. Disk IO is high and when that
happens, I can't get into the UI to stop any of the processors.
Any ideas?
I have put the flowfile repository and content repository on
different disks on the 3 nodes, but disk usage is still so high
that I can't get in.
Thank you!
-Joe
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free.www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com