Re: UI SocketTimeoutException - heavy IO

Joe Obernberger Wed, 12 Jul 2023 08:18:30 -0700

Thank you Joe -
The content repo doesn't seem to be the issue - it's the flowfile repo.
Here is the section from one of the nodes:


nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=50 KB
nifi.content.repository.directory.default=/data/4/nifi_content_repository
nifi.content.repository.archive.max.retention.period=2 days
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=false
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository
nifi.flowfile.repository.checkpoint.interval=300 secs
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.retain.orphaned.flowfiles=true

-Joe

On 7/12/2023 11:07 AM, Joe Witt wrote:

Joe

I dont recall the specific version in which we got it truly sorted butthere was an issue with our default settings for an important contentrepo property and how we handled mixture of large/small flowfileswritten within the same underlying slab/claim in the content repository.


Please check what you have for conf/nifi.properties
  nifi.content.claim.max.appendable.size=

What value do you have there? I recommend reducing it to 50KB andrestarting.


Can you show your full 'nifi.content' section from the nifi.properties?

Thanks

On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger<joseph.obernber...@gmail.com> wrote:


    Raising this thread from the dead...
    Having issues with IO to the flowfile repository.  NiFi will show
    500k flow files and a size of ~1.7G - but the size on disk on each
    of the 4 nodes is massive - over 100G, and disk IO to the flowfile
    spindle is just pegged doing writes.

    I do have ExtractText processors that take the flowfile content
    (.*) and put it into an attribute, but the sizes of these is maybe
    in the 10k at most size.  How can I find out what module (there
    are some 2200) is causing the issue?  I think I'm doing something
    fundamentally wrong with NiFi.  :)
    Perhaps I should change the size of all the queues to something
    less than 10k/1G?

    Under cluster/FLOWFILE STORAGE, one of the nodes shows 3.74TBytes
    of usage, but it's actually ~150G on disk.  The other nodes are
    correct.

    Ideas on what to debug?
    Thank you!

    -Joe (NiFi 1.18)

    On 3/22/2023 12:49 PM, Mark Payne wrote:

    OK. So changing the checkpoint internal to 300 seconds might help
    reduce IO a bit. But it will cause the repo to become much
    larger, and it will take much longer to startup whenever you
    restart NiFi.

    The variance in size between nodes is likely due to how recently
    it’s checkpointed. If it stays large like 31 GB while the other
    stay small, that would be interesting to know.

    Thanks
    -Mark

    On Mar 22, 2023, at 12:45 PM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    Thanks for this Mark.  I'm not seeing any large attributes at
    the moment but will go through this and verify - but I did have
    one queue that was set to 100k instead of 10k.
    I set the nifi.cluster.node.connection.timeout to 30 seconds (up
    from 5) and the nifi.flowfile.repository.checkpoint.interval to
    300 seconds (up from 20).

    While it's running the size of the flowfile repo varies
    (wildly?) on each of the nodes from 1.5G to over 30G.  Disk IO
    is still very high, but it's running now and I can use the UI. 
    Interestingly at this point the UI shows 677k files and 1.5G of
    flow.  But disk usage on the flowfile repo is 31G, 3.7G, and
    2.6G on the 3 nodes.  I'd love to throw some SSDs at this
    problem.  I can add more nifi nodes.

    -Joe

    On 3/22/2023 11:08 AM, Mark Payne wrote:

    Joe,

    The errors noted are indicating that NiFi cannot communicate
    with registry. Either the registry is offline, NiFi’s Registry
    Client is not configured properly, there’s a firewall in the
    way, etc.

    A FlowFile repo of 35 GB is rather huge. This would imply one
    of 3 things:
    - You have a huge number of FlowFiles (doesn’t seem to be the case)
    - FlowFiles have a huge number of attributes
    or
    - FlowFiles have 1 or more huge attribute values.

    Typically, FlowFile attribute should be kept minimal and should
    never contain chunks of contents from the FlowFile content.
    Often when we see this type of behavior it’s due to using
    something like ExtractText or EvaluateJsonPath to put large
    blocks of content into attributes.

    And in this case, setting Backpressure Threshold above 10,000
    is even more concerning, as it means even greater disk I/O.

    Thanks
    -Mark

    On Mar 22, 2023, at 11:01 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    Thank you Mark.  These are SATA drives - but there's no way
    for the flowfile repo to be on multiple spindles.  It's not
    huge - maybe 35G per node.
    I do see a lot of messages like this in the log:

    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
    o.a.nifi.groups.StandardProcessGroup Failed to synchronize
    
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
    Handle Extract Metadata] with Flow Registry because could not
    retrieve version 1 of flow with identifier
    d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
    736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)
    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
    o.a.nifi.groups.StandardProcessGroup Failed to synchronize
    
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
    with Flow Registry because could not retrieve version 2 of
    flow with identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in
    bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)
    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
    o.a.nifi.groups.StandardProcessGroup Failed to synchronize
    
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
    Handle Extract Metadata] with Flow Registry because could not
    retrieve version 1 of flow with identifier
    d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
    736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)
    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
    o.a.nifi.groups.StandardProcessGroup Failed to synchronize
    
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
    Binary Data] with Flow Registry because could not retrieve
    version 1 of flow with identifier
    7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
    736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)

    A clue?

    -joe

    On 3/22/2023 10:49 AM, Mark Payne wrote:

    Joe,

    1.8 million FlowFiles is not a concern. But when you say
    “Should I reduce the queue sizes?” it makes me wonder if
    they’re all in a single queue?
    Generally, you should leave the backpressure threshold at the
    default 10,000 FlowFile max. Increasing this can lead to huge
    amounts of swapping, which will drastically reduce
    performance and increase disk utilization very significantly.

    Also from the diagnostics, it looks like you’ve got a lot of
    CPU cores, but you’re not using much. And based on the amount
    of disk space available and the fact that you’re seeing 100%
    utilization, I’m wondering if you’re using spinning disks,
    rather than SSDs? I would highly recommend always running
    NiFi with ssd/nvme drives. Absent that, if you have multiple
    disk drives, you could also configure the content repository
    to span multiple disks, in order to spread that load.

    Thanks
    -Mark

    On Mar 22, 2023, at 10:41 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    Thank you.  Was able to get in.
    Currently there are 1.8 million flow files and 3.2G.  Is
    this too much for a 3 node cluster with mutliple spindles
    each (SATA drives)?
    Should I reduce the queue sizes?

    -Joe

    On 3/22/2023 10:23 AM, Phillip Lord wrote:

    Joe,

    If you need the UI to come back up, try setting the
    autoresume setting in nifi.properties to false and restart
    node(s).
    This will bring up every component/controllerService up
    stopped/disabled and may provide some breathing room for
    the UI to become available again.

    Phil
    On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com>, wrote:

    atop shows the disk as being all red with IO - 100%
    utilization. There
    are a lot of flowfiles currently trying to run through,
    but I can't
    monitor it because....UI wont' load.

    -Joe

    On 3/22/2023 10:16 AM, Mark Payne wrote:

    Joe,

    I’d recommend taking a look at garbage collection. It is
    far more likely the culprit than disk I/O.

    Thanks
    -Mark

    On Mar 22, 2023, at 10:12 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    I'm getting "java.net.SocketTimeoutException: timeout"
    from the user interface of NiFi when load is heavy. This
    is 1.18.0 running on a 3 node cluster. Disk IO is high
    and when that happens, I can't get into the UI to stop
    any of the processors.
    Any ideas?

    I have put the flowfile repository and content
    repository on different disks on the 3 nodes, but disk
    usage is still so high that I can't get in.
    Thank you!

    -Joe


    --
    This email has been checked for viruses by AVG antivirus
    software.
    www.avg.com <http://www.avg.com/>


    
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
        Virus-free.www.avg.com
    
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

Reply via email to