Re: UI SocketTimeoutException - heavy IO

Mark Payne Wed, 12 Jul 2023 09:44:25 -0700

Joe,

The way that the processor works is that it adds an attribute for every 
“Capturing Group” in the regular expression.
This includes a “Capturing Group” 0, which contains the entire value that the 
regex was run against.
You can actually disable capturing this as an attribute by setting the “Include 
Capture Group 0” property to false.

Thanks
-Mark

On Jul 12, 2023, at 12:36 PM, Joe Obernberger <joseph.obernber...@gmail.com>
wrote:

Thank you Mark - it looks like attributes is to blame. I'm adding lots of
UpdateAttribute to delete them as soon as they are not needed and disk IO has
dropped.
Right now, it's all going to 'spinning rust' - soon to all new SSDs, but either
way, this needed addressing.

One oddity, is when I do ExtractText to a property (call it value) of (.*),
I'll see value, and value.0, value.1 in the attributes list. Not sure why it
makes multiple copies.

-Joe

On 7/12/2023 11:27 AM, Mark Payne wrote:
Joe,

How many FlowFiles are you processing here? Let’s say, per second? How many
processors are in those flows?

Is the FlowFile Repo a spinning disk, SSD, or NAS?

You said you’re using ExtractText to pull 10 KB into an attribute. I presume
you’re then doing something with it. So maybe you’re extracting a few parts of
it using jsonPath in expression language or whatever the case may be. So that
one 10KB attribute is not the only attribute you have. So theoretically, let’s
consider:

- Total of all attributes for a FlowFiles is 20 KB
- You process an average of 1,000 FlowFiles per second
- Each FlowFile goes through 15 processors, each of which modifies at least
attribute.

That means you’re writing to the flowfile repository about 20 KB * 1000 * 15
per second - or about 300 MB/sec.
This is why we constantly warn against creating large attributes. Attributes
are meant to be on the order of say 100-200 characters - not 10 KB.
If you’re processing a few thousand FlowFiles per hour then 10 KB is fine, but
if you’re processing a bunch of FlowFiles it adds up very quickly.

Thanks
-Mark

On Jul 12, 2023, at 11:16 AM, Joe Obernberger
<joseph.obernber...@gmail.com><mailto:joseph.obernber...@gmail.com> wrote:

Thank you Joe -
The content repo doesn't seem to be the issue - it's the flowfile repo.
Here is the section from one of the nodes:

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=50 KB
nifi.content.repository.directory.default=/data/4/nifi_content_repository
nifi.content.repository.archive.max.retention.period=2 days
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=false
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository
nifi.flowfile.repository.checkpoint.interval=300 secs
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.retain.orphaned.flowfiles=true

-Joe

On 7/12/2023 11:07 AM, Joe Witt wrote:
Joe

I dont recall the specific version in which we got it truly sorted but there
was an issue with our default settings for an important content repo property
and how we handled mixture of large/small flowfiles written within the same
underlying slab/claim in the content repository.

Please check what you have for conf/nifi.properties
nifi.content.claim.max.appendable.size=

What value do you have there? I recommend reducing it to 50KB and restarting.

Can you show your full 'nifi.content' section from the nifi.properties?

Thanks

On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger
<joseph.obernber...@gmail.com<mailto:joseph.obernber...@gmail.com>> wrote:

Raising this thread from the dead...
Having issues with IO to the flowfile repository. NiFi will show 500k flow
files and a size of ~1.7G - but the size on disk on each of the 4 nodes is
massive - over 100G, and disk IO to the flowfile spindle is just pegged doing
writes.

I do have ExtractText processors that take the flowfile content (.*) and put it
into an attribute, but the sizes of these is maybe in the 10k at most size.
How can I find out what module (there are some 2200) is causing the issue? I
think I'm doing something fundamentally wrong with NiFi. :)
Perhaps I should change the size of all the queues to something less than
10k/1G?

Under cluster/FLOWFILE STORAGE, one of the nodes shows 3.74TBytes of usage, but
it's actually ~150G on disk. The other nodes are correct.

Ideas on what to debug?
Thank you!

-Joe (NiFi 1.18)

On 3/22/2023 12:49 PM, Mark Payne wrote:
OK. So changing the checkpoint internal to 300 seconds might help reduce IO a
bit. But it will cause the repo to become much larger, and it will take much
longer to startup whenever you restart NiFi.

The variance in size between nodes is likely due to how recently it’s
checkpointed. If it stays large like 31 GB while the other stay small, that
would be interesting to know.

Thanks
-Mark

On Mar 22, 2023, at 12:45 PM, Joe Obernberger
<joseph.obernber...@gmail.com><mailto:joseph.obernber...@gmail.com> wrote:

Thanks for this Mark. I'm not seeing any large attributes at the moment but
will go through this and verify - but I did have one queue that was set to 100k
instead of 10k.
I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5) and
the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up from 20).

While it's running the size of the flowfile repo varies (wildly?) on each of
the nodes from 1.5G to over 30G. Disk IO is still very high, but it's running
now and I can use the UI. Interestingly at this point the UI shows 677k files
and 1.5G of flow. But disk usage on the flowfile repo is 31G, 3.7G, and 2.6G
on the 3 nodes. I'd love to throw some SSDs at this problem. I can add more
nifi nodes.

-Joe

On 3/22/2023 11:08 AM, Mark Payne wrote:
Joe,

The errors noted are indicating that NiFi cannot communicate with registry.
Either the registry is offline, NiFi’s Registry Client is not configured
properly, there’s a firewall in the way, etc.

A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.

Typically, FlowFile attribute should be kept minimal and should never contain
chunks of contents from the FlowFile content. Often when we see this type of
behavior it’s due to using something like ExtractText or EvaluateJsonPath to
put large blocks of content into attributes.

And in this case, setting Backpressure Threshold above 10,000 is even more
concerning, as it means even greater disk I/O.

Thanks
-Mark

On Mar 22, 2023, at 11:01 AM, Joe Obernberger
<joseph.obernber...@gmail.com><mailto:joseph.obernber...@gmail.com> wrote:

Thank you Mark. These are SATA drives - but there's no way for the flowfile
repo to be on multiple spindles. It's not huge - maybe 35G per node.
I do see a lot of messages like this in the log:

2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
Handle Extract Metadata] with Flow Registry because could not retrieve version
1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
with Flow Registry because could not retrieve version 2 of flow with
identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
Handle Extract Metadata] with Flow Registry because could not retrieve version
1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
o.a.nifi.groups.StandardProcessGroup Failed to synchronize
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
Binary Data] with Flow Registry because could not retrieve version 1 of flow
with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
refused)

A clue?

-joe

On 3/22/2023 10:49 AM, Mark Payne wrote:
Joe,

1.8 million FlowFiles is not a concern. But when you say “Should I reduce the
queue sizes?” it makes me wonder if they’re all in a single queue?
Generally, you should leave the backpressure threshold at the default 10,000
FlowFile max. Increasing this can lead to huge amounts of swapping, which will
drastically reduce performance and increase disk utilization very significantly.

Also from the diagnostics, it looks like you’ve got a lot of CPU cores, but
you’re not using much. And based on the amount of disk space available and the
fact that you’re seeing 100% utilization, I’m wondering if you’re using
spinning disks, rather than SSDs? I would highly recommend always running NiFi
with ssd/nvme drives. Absent that, if you have multiple disk drives, you could
also configure the content repository to span multiple disks, in order to
spread that load.

Thanks
-Mark

On Mar 22, 2023, at 10:41 AM, Joe Obernberger
<joseph.obernber...@gmail.com><mailto:joseph.obernber...@gmail.com> wrote:

Thank you. Was able to get in.
Currently there are 1.8 million flow files and 3.2G. Is this too much for a 3
node cluster with mutliple spindles each (SATA drives)?
Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:
Joe,

If you need the UI to come back up, try setting the autoresume setting in
nifi.properties to false and restart node(s).
This will bring up every component/controllerService up stopped/disabled and
may provide some breathing room for the UI to become available again.

Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
<joseph.obernber...@gmail.com><mailto:joseph.obernber...@gmail.com>, wrote:
atop shows the disk as being all red with IO - 100% utilization. There
are a lot of flowfiles currently trying to run through, but I can't
monitor it because....UI wont' load.

-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:
Joe,

I’d recommend taking a look at garbage collection. It is far more likely the
culprit than disk I/O.

Thanks
-Mark

On Mar 22, 2023, at 10:12 AM, Joe Obernberger
<joseph.obernber...@gmail.com><mailto:joseph.obernber...@gmail.com> wrote:

I'm getting "java.net.SocketTimeoutException: timeout" from the user interface
of NiFi when load is heavy. This is 1.18.0 running on a 3 node cluster. Disk IO
is high and when that happens, I can't get into the UI to stop any of the
processors.
Any ideas?

I have put the flowfile repository and content repository on different disks on
the 3 nodes, but disk usage is still so high that I can't get in.
Thank you!

-Joe

--
This email has been checked for viruses by AVG antivirus software.
www.avg.com<http://www.avg.com/>

[https://s-install.avcdn.net/ipm/preview/icons/icon-envelope-tick-green-avg-v1.png]<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>

Virus-free.www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>

Re: UI SocketTimeoutException - heavy IO

Reply via email to