Re: UI SocketTimeoutException - heavy IO

Joe Obernberger Wed, 12 Jul 2023 09:49:28 -0700

Hi Joe - yes - /data/4, /data/5 are separate spindles, and yes /data/5is where the flowfile repo is; which is large.


ls -lh
-rw-r--r-- 1 root root 6.5G Jul 12 12:36 checkpoint
-rw-r--r-- 1 root root 5.2G Jul 12 12:46 checkpoint.partial
drwxr-xr-x 4 root root  132 Jul 12 12:46 journals
drwxr-xr-x 2 root root   10 Jul 11 18:18 swap


du -s -h ./*
6.5G    ./checkpoint
8.0G    ./checkpoint.partial
54G     ./journals
0       ./swap

 cd journals/
ls -l
total 53268496
-rw-r--r-- 1 root root 41727957614 Jul 12 12:46 7012019840.journal
-rw-r--r-- 1 root root  8570212495 Jul 12 12:47 7012272858.journal
drwxr-xr-x 2 root root        4096 Jul 12 12:42 overflow-7012019840
drwxr-xr-x 2 root root        4096 Jul 12 12:47 overflow-7012272858

du -s -h ./*
12G     ./7012272858.journal
4.8G    ./overflow-7012019840
4.3G    ./overflow-7012272858

-Joe

On 7/12/2023 11:27 AM, Joe Witt wrote:

Ah ok. And 'data/5' is its own partition (same physical disk asdata/4?). And data/5 is where you see those large files? Can youshow what you see there in terms of files/sizes?

For the checkpoint period the default is 20 seconds. Am curious toknow what benefit moving to 300 seconds was giving (might be perfectlyfine for some cases - just curious)


Thanks

On Wed, Jul 12, 2023 at 8:18 AM Joe Obernberger<joseph.obernber...@gmail.com> wrote:


    Thank you Joe -
    The content repo doesn't seem to be the issue - it's the flowfile
    repo.
    Here is the section from one of the nodes:

    
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
    nifi.content.claim.max.appendable.size=50 KB
    nifi.content.repository.directory.default=/data/4/nifi_content_repository
    nifi.content.repository.archive.max.retention.period=2 days
    nifi.content.repository.archive.max.usage.percentage=50%
    nifi.content.repository.archive.enabled=false
    nifi.content.repository.always.sync=false
    nifi.content.viewer.url=../nifi-content-viewer/

    
nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
    
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
    nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository
    nifi.flowfile.repository.checkpoint.interval=300 secs
    nifi.flowfile.repository.always.sync=false
    nifi.flowfile.repository.retain.orphaned.flowfiles=true

    -Joe

    On 7/12/2023 11:07 AM, Joe Witt wrote:

    Joe

    I dont recall the specific version in which we got it truly
    sorted but there was an issue with our default settings for an
    important content repo property and how we handled mixture of
    large/small flowfiles written within the same underlying
    slab/claim in the content repository.

    Please check what you have for conf/nifi.properties
      nifi.content.claim.max.appendable.size=

    What value do you have there?  I recommend reducing it to 50KB
    and restarting.

    Can you show your full 'nifi.content' section from the
    nifi.properties?

    Thanks

    On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger
    <joseph.obernber...@gmail.com> wrote:

        Raising this thread from the dead...
        Having issues with IO to the flowfile repository. NiFi will
        show 500k flow files and a size of ~1.7G - but the size on
        disk on each of the 4 nodes is massive - over 100G, and disk
        IO to the flowfile spindle is just pegged doing writes.

        I do have ExtractText processors that take the flowfile
        content (.*) and put it into an attribute, but the sizes of
        these is maybe in the 10k at most size.  How can I find out
        what module (there are some 2200) is causing the issue?  I
        think I'm doing something fundamentally wrong with NiFi.  :)
        Perhaps I should change the size of all the queues to
        something less than 10k/1G?

        Under cluster/FLOWFILE STORAGE, one of the nodes shows
        3.74TBytes of usage, but it's actually ~150G on disk.  The
        other nodes are correct.

        Ideas on what to debug?
        Thank you!

        -Joe (NiFi 1.18)

        On 3/22/2023 12:49 PM, Mark Payne wrote:

        OK. So changing the checkpoint internal to 300 seconds might
        help reduce IO a bit. But it will cause the repo to become
        much larger, and it will take much longer to startup
        whenever you restart NiFi.

        The variance in size between nodes is likely due to how
        recently it’s checkpointed. If it stays large like 31 GB
        while the other stay small, that would be interesting to know.

        Thanks
        -Mark

        On Mar 22, 2023, at 12:45 PM, Joe Obernberger
        <joseph.obernber...@gmail.com>
        <mailto:joseph.obernber...@gmail.com> wrote:

        Thanks for this Mark.  I'm not seeing any large attributes
        at the moment but will go through this and verify - but I
        did have one queue that was set to 100k instead of 10k.
        I set the nifi.cluster.node.connection.timeout to 30
        seconds (up from 5) and the
        nifi.flowfile.repository.checkpoint.interval to 300 seconds
        (up from 20).

        While it's running the size of the flowfile repo varies
        (wildly?) on each of the nodes from 1.5G to over 30G. Disk
        IO is still very high, but it's running now and I can use
        the UI. Interestingly at this point the UI shows 677k files
        and 1.5G of flow. But disk usage on the flowfile repo is
        31G, 3.7G, and 2.6G on the 3 nodes. I'd love to throw some
        SSDs at this problem.  I can add more nifi nodes.

        -Joe

        On 3/22/2023 11:08 AM, Mark Payne wrote:

        Joe,

        The errors noted are indicating that NiFi cannot
        communicate with registry. Either the registry is offline,
        NiFi’s Registry Client is not configured properly, there’s
        a firewall in the way, etc.

        A FlowFile repo of 35 GB is rather huge. This would imply
        one of 3 things:
        - You have a huge number of FlowFiles (doesn’t seem to be
        the case)
        - FlowFiles have a huge number of attributes
        or
        - FlowFiles have 1 or more huge attribute values.

        Typically, FlowFile attribute should be kept minimal and
        should never contain chunks of contents from the FlowFile
        content. Often when we see this type of behavior it’s due
        to using something like ExtractText or EvaluateJsonPath to
        put large blocks of content into attributes.

        And in this case, setting Backpressure Threshold above
        10,000 is even more concerning, as it means even greater
        disk I/O.

        Thanks
        -Mark

        On Mar 22, 2023, at 11:01 AM, Joe Obernberger
        <joseph.obernber...@gmail.com>
        <mailto:joseph.obernber...@gmail.com> wrote:

        Thank you Mark.  These are SATA drives - but there's no
        way for the flowfile repo to be on multiple spindles. 
        It's not huge - maybe 35G per node.
        I do see a lot of messages like this in the log:

        2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
        Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
        synchronize
        
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
        Handle Extract Metadata] with Flow Registry because could
        not retrieve version 1 of flow with identifier
        d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
        736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
        refused (Connection refused)
        2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
        Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
        synchronize
        
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
        with Flow Registry because could not retrieve version 2
        of flow with identifier
        ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
        736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
        refused (Connection refused)
        2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
        Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
        synchronize
        
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
        Handle Extract Metadata] with Flow Registry because could
        not retrieve version 1 of flow with identifier
        d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
        736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
        refused (Connection refused)
        2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
        Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
        synchronize
        
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
        Binary Data] with Flow Registry because could not
        retrieve version 1 of flow with identifier
        7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
        736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
        refused (Connection refused)

        A clue?

        -joe

        On 3/22/2023 10:49 AM, Mark Payne wrote:

        Joe,

        1.8 million FlowFiles is not a concern. But when you say
        “Should I reduce the queue sizes?” it makes me wonder if
        they’re all in a single queue?
        Generally, you should leave the backpressure threshold
        at the default 10,000 FlowFile max. Increasing this can
        lead to huge amounts of swapping, which will drastically
        reduce performance and increase disk utilization very
        significantly.

        Also from the diagnostics, it looks like you’ve got a
        lot of CPU cores, but you’re not using much. And based
        on the amount of disk space available and the fact that
        you’re seeing 100% utilization, I’m wondering if you’re
        using spinning disks, rather than SSDs? I would highly
        recommend always running NiFi with ssd/nvme drives.
        Absent that, if you have multiple disk drives, you could
        also configure the content repository to span multiple
        disks, in order to spread that load.

        Thanks
        -Mark

        On Mar 22, 2023, at 10:41 AM, Joe Obernberger
        <joseph.obernber...@gmail.com>
        <mailto:joseph.obernber...@gmail.com> wrote:

        Thank you. Was able to get in.
        Currently there are 1.8 million flow files and 3.2G. 
        Is this too much for a 3 node cluster with mutliple
        spindles each (SATA drives)?
        Should I reduce the queue sizes?

        -Joe

        On 3/22/2023 10:23 AM, Phillip Lord wrote:

        Joe,

        If you need the UI to come back up, try setting the
        autoresume setting in nifi.properties to false and
        restart node(s).
        This will bring up every component/controllerService
        up stopped/disabled and may provide some breathing
        room for the UI to become available again.

        Phil
        On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
        <joseph.obernber...@gmail.com>
        <mailto:joseph.obernber...@gmail.com>, wrote:

        atop shows the disk as being all red with IO - 100%
        utilization. There
        are a lot of flowfiles currently trying to run
        through, but I can't
        monitor it because....UI wont' load.

        -Joe

        On 3/22/2023 10:16 AM, Mark Payne wrote:

        Joe,

        I’d recommend taking a look at garbage collection.
        It is far more likely the culprit than disk I/O.

        Thanks
        -Mark

        On Mar 22, 2023, at 10:12 AM, Joe Obernberger
        <joseph.obernber...@gmail.com>
        <mailto:joseph.obernber...@gmail.com> wrote:

        I'm getting "java.net.SocketTimeoutException:
        timeout" from the user interface of NiFi when load
        is heavy. This is 1.18.0 running on a 3 node
        cluster. Disk IO is high and when that happens, I
        can't get into the UI to stop any of the processors.
        Any ideas?

        I have put the flowfile repository and content
        repository on different disks on the 3 nodes, but
        disk usage is still so high that I can't get in.
        Thank you!

        -Joe


        --
        This email has been checked for viruses by AVG
        antivirus software.
        www.avg.com <http://www.avg.com/>


        
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
                Virus-free.www.avg.com
        
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

Reply via email to