Hi Joe - yes - /data/4, /data/5 are separate spindles, and yes /data/5
is where the flowfile repo is; which is large.
ls -lh
-rw-r--r-- 1 root root 6.5G Jul 12 12:36 checkpoint
-rw-r--r-- 1 root root 5.2G Jul 12 12:46 checkpoint.partial
drwxr-xr-x 4 root root 132 Jul 12 12:46 journals
Joe,
The way that the processor works is that it adds an attribute for every
“Capturing Group” in the regular expression.
This includes a “Capturing Group” 0, which contains the entire value that the
regex was run against.
You can actually disable capturing this as an attribute by setting the
Thank you Mark - it looks like attributes is to blame. I'm adding lots
of UpdateAttribute to delete them as soon as they are not needed and
disk IO has dropped.
Right now, it's all going to 'spinning rust' - soon to all new SSDs, but
either way, this needed addressing.
One oddity, is when I
Some thoughts… putting 10kb of text into an attribute probably isn’t ideal.
Is there another way perhaps to accomplish what you’re doing?
Also your flowfile.repo.checkpoint.interval is pretty high. I’d consider
lowering this considerably…
On Jul 12, 2023 at 11:18 AM -0400, Joe Obernberger
,
Ah ok. And 'data/5' is its own partition (same physical disk as data/4?).
And data/5 is where you see those large files? Can you show what you see
there in terms of files/sizes?
For the checkpoint period the default is 20 seconds. Am curious to
know what benefit moving to 300 seconds was
Joe,
How many FlowFiles are you processing here? Let’s say, per second? How many
processors are in those flows?
Is the FlowFile Repo a spinning disk, SSD, or NAS?
You said you’re using ExtractText to pull 10 KB into an attribute. I presume
you’re then doing something with it. So maybe you’re
Thank you Joe -
The content repo doesn't seem to be the issue - it's the flowfile repo.
Here is the section from one of the nodes:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=50 KB
Joe
I dont recall the specific version in which we got it truly sorted but
there was an issue with our default settings for an important content repo
property and how we handled mixture of large/small flowfiles written within
the same underlying slab/claim in the content repository.
Please check
Raising this thread from the dead...
Having issues with IO to the flowfile repository. NiFi will show 500k
flow files and a size of ~1.7G - but the size on disk on each of the 4
nodes is massive - over 100G, and disk IO to the flowfile spindle is
just pegged doing writes.
I do have