Re: UI SocketTimeoutException - heavy IO

Joe Witt Wed, 12 Jul 2023 08:28:36 -0700

Ah ok.  And 'data/5' is its own partition (same physical disk as data/4?).
And data/5 is where you see those large files?  Can you show what you see
there in terms of files/sizes?


For the checkpoint period the default is 20 seconds.  Am curious to
know what benefit moving to 300 seconds was giving (might be perfectly fine
for some cases - just curious)

Thanks

On Wed, Jul 12, 2023 at 8:18 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you Joe -
> The content repo doesn't seem to be the issue - it's the flowfile repo.
> Here is the section from one of the nodes:
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=50 KB
> nifi.content.repository.directory.default=/data/4/nifi_content_repository
> nifi.content.repository.archive.max.retention.period=2 days
> nifi.content.repository.archive.max.usage.percentage=50%
> nifi.content.repository.archive.enabled=false
> nifi.content.repository.always.sync=false
> nifi.content.viewer.url=../nifi-content-viewer/
>
>
> nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
>
> nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
> nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository
> nifi.flowfile.repository.checkpoint.interval=300 secs
> nifi.flowfile.repository.always.sync=false
> nifi.flowfile.repository.retain.orphaned.flowfiles=true
>
> -Joe
> On 7/12/2023 11:07 AM, Joe Witt wrote:
>
> Joe
>
> I dont recall the specific version in which we got it truly sorted but
> there was an issue with our default settings for an important content repo
> property and how we handled mixture of large/small flowfiles written within
> the same underlying slab/claim in the content repository.
>
> Please check what you have for conf/nifi.properties
>   nifi.content.claim.max.appendable.size=
>
> What value do you have there?  I recommend reducing it to 50KB and
> restarting.
>
> Can you show your full 'nifi.content' section from the nifi.properties?
>
> Thanks
>
> On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Raising this thread from the dead...
>> Having issues with IO to the flowfile repository.  NiFi will show 500k
>> flow files and a size of ~1.7G - but the size on disk on each of the 4
>> nodes is massive - over 100G, and disk IO to the flowfile spindle is just
>> pegged doing writes.
>>
>> I do have ExtractText processors that take the flowfile content (.*) and
>> put it into an attribute, but the sizes of these is maybe in the 10k at
>> most size.  How can I find out what module (there are some 2200) is causing
>> the issue?  I think I'm doing something fundamentally wrong with NiFi.  :)
>> Perhaps I should change the size of all the queues to something less than
>> 10k/1G?
>>
>> Under cluster/FLOWFILE STORAGE, one of the nodes shows 3.74TBytes of
>> usage, but it's actually ~150G on disk.  The other nodes are correct.
>>
>> Ideas on what to debug?
>> Thank you!
>>
>> -Joe (NiFi 1.18)
>> On 3/22/2023 12:49 PM, Mark Payne wrote:
>>
>> OK. So changing the checkpoint internal to 300 seconds might help reduce
>> IO a bit. But it will cause the repo to become much larger, and it will
>> take much longer to startup whenever you restart NiFi.
>>
>> The variance in size between nodes is likely due to how recently it’s
>> checkpointed. If it stays large like 31 GB while the other stay small, that
>> would be interesting to know.
>>
>> Thanks
>> -Mark
>>
>>
>> On Mar 22, 2023, at 12:45 PM, Joe Obernberger
>> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote:
>>
>> Thanks for this Mark.  I'm not seeing any large attributes at the moment
>> but will go through this and verify - but I did have one queue that was set
>> to 100k instead of 10k.
>> I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5)
>> and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up
>> from 20).
>>
>> While it's running the size of the flowfile repo varies (wildly?) on each
>> of the nodes from 1.5G to over 30G.  Disk IO is still very high, but it's
>> running now and I can use the UI.  Interestingly at this point the UI shows
>> 677k files and 1.5G of flow.  But disk usage on the flowfile repo is 31G,
>> 3.7G, and 2.6G on the 3 nodes.  I'd love to throw some SSDs at this
>> problem.  I can add more nifi nodes.
>>
>> -Joe
>> On 3/22/2023 11:08 AM, Mark Payne wrote:
>>
>> Joe,
>>
>> The errors noted are indicating that NiFi cannot communicate with
>> registry. Either the registry is offline, NiFi’s Registry Client is not
>> configured properly, there’s a firewall in the way, etc.
>>
>> A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
>> - You have a huge number of FlowFiles (doesn’t seem to be the case)
>> - FlowFiles have a huge number of attributes
>> or
>> - FlowFiles have 1 or more huge attribute values.
>>
>> Typically, FlowFile attribute should be kept minimal and should never
>> contain chunks of contents from the FlowFile content. Often when we see
>> this type of behavior it’s due to using something like ExtractText or
>> EvaluateJsonPath to put large blocks of content into attributes.
>>
>> And in this case, setting Backpressure Threshold above 10,000 is even
>> more concerning, as it means even greater disk I/O.
>>
>> Thanks
>> -Mark
>>
>>
>> On Mar 22, 2023, at 11:01 AM, Joe Obernberger
>> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote:
>>
>> Thank you Mark.  These are SATA drives - but there's no way for the
>> flowfile repo to be on multiple spindles.  It's not huge - maybe 35G per
>> node.
>> I do see a lot of messages like this in the log:
>>
>> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
>> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
>> StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
>> Handle Extract Metadata] with Flow Registry because could not retrieve
>> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
>> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
>> (Connection refused)
>> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
>> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
>> StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
>> with Flow Registry because could not retrieve version 2 of flow with
>> identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
>> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
>> refused)
>> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
>> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
>> StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
>> Handle Extract Metadata] with Flow Registry because could not retrieve
>> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
>> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
>> (Connection refused)
>> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
>> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
>> StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
>> Binary Data] with Flow Registry because could not retrieve version 1 of
>> flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
>> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
>> refused)
>>
>> A clue?
>>
>> -joe
>> On 3/22/2023 10:49 AM, Mark Payne wrote:
>>
>> Joe,
>>
>> 1.8 million FlowFiles is not a concern. But when you say “Should I reduce
>> the queue sizes?” it makes me wonder if they’re all in a single queue?
>> Generally, you should leave the backpressure threshold at the default
>> 10,000 FlowFile max. Increasing this can lead to huge amounts of swapping,
>> which will drastically reduce performance and increase disk utilization
>> very significantly.
>>
>> Also from the diagnostics, it looks like you’ve got a lot of CPU cores,
>> but you’re not using much. And based on the amount of disk space available
>> and the fact that you’re seeing 100% utilization, I’m wondering if you’re
>> using spinning disks, rather than SSDs? I would highly recommend always
>> running NiFi with ssd/nvme drives. Absent that, if you have multiple disk
>> drives, you could also configure the content repository to span multiple
>> disks, in order to spread that load.
>>
>> Thanks
>> -Mark
>>
>> On Mar 22, 2023, at 10:41 AM, Joe Obernberger
>> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote:
>>
>> Thank you.  Was able to get in.
>> Currently there are 1.8 million flow files and 3.2G.  Is this too much
>> for a 3 node cluster with mutliple spindles each (SATA drives)?
>> Should I reduce the queue sizes?
>>
>> -Joe
>> On 3/22/2023 10:23 AM, Phillip Lord wrote:
>>
>> Joe,
>>
>> If you need the UI to come back up, try setting the autoresume setting in
>> nifi.properties to false and restart node(s).
>> This will bring up every component/controllerService up stopped/disabled
>> and may provide some breathing room for the UI to become available again.
>>
>> Phil
>> On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
>> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com>, wrote:
>>
>> atop shows the disk as being all red with IO - 100% utilization. There
>> are a lot of flowfiles currently trying to run through, but I can't
>> monitor it because....UI wont' load.
>>
>> -Joe
>>
>> On 3/22/2023 10:16 AM, Mark Payne wrote:
>>
>> Joe,
>>
>> I’d recommend taking a look at garbage collection. It is far more likely
>> the culprit than disk I/O.
>>
>> Thanks
>> -Mark
>>
>> On Mar 22, 2023, at 10:12 AM, Joe Obernberger
>> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote:
>>
>> I'm getting "java.net.SocketTimeoutException: timeout" from the user
>> interface of NiFi when load is heavy. This is 1.18.0 running on a 3 node
>> cluster. Disk IO is high and when that happens, I can't get into the UI to
>> stop any of the processors.
>> Any ideas?
>>
>> I have put the flowfile repository and content repository on different
>> disks on the 3 nodes, but disk usage is still so high that I can't get in.
>> Thank you!
>>
>> -Joe
>>
>>
>> --
>> This email has been checked for viruses by AVG antivirus software.
>> www.avg.com
>>
>>
>>
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> Virus-free.www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>
>>
>>
>>
>>

Re: UI SocketTimeoutException - heavy IO

Reply via email to