Re: UI SocketTimeoutException - heavy IO

Joe Obernberger Wed, 12 Jul 2023 09:38:29 -0700

Thank you Mark - it looks like attributes is to blame. I'm adding lotsof UpdateAttribute to delete them as soon as they are not needed anddisk IO has dropped.Right now, it's all going to 'spinning rust' - soon to all new SSDs, buteither way, this needed addressing.

One oddity, is when I do ExtractText to a property (call it value) of(.*), I'll see value, and value.0, value.1 in the attributes list. Notsure why it makes multiple copies.


-Joe

On 7/12/2023 11:27 AM, Mark Payne wrote:

Joe,

How many FlowFiles are you processing here? Let’s say, per second? Howmany processors are in those flows?


Is the FlowFile Repo a spinning disk, SSD, or NAS?

You said you’re using ExtractText to pull 10 KB into an attribute. Ipresume you’re then doing something with it. So maybe you’reextracting a few parts of it using jsonPath in expression language orwhatever the case may be. So that one 10KB attribute is not the onlyattribute you have. So theoretically, let’s consider:


- Total of all attributes for a FlowFiles is 20 KB
- You process an average of 1,000 FlowFiles per second

- Each FlowFile goes through 15 processors, each of which modifies atleast attribute.

That means you’re writing to the flowfile repository about 20 KB *1000 * 15 per second - or about 300 MB/sec.This is why we constantly warn against creating large attributes.Attributes are meant to be on the order of say 100-200 characters -not 10 KB.If you’re processing a few thousand FlowFiles per hour then 10 KB isfine, but if you’re processing a bunch of FlowFiles it adds up veryquickly.


Thanks
-Mark

On Jul 12, 2023, at 11:16 AM, Joe Obernberger<joseph.obernber...@gmail.com> wrote:


Thank you Joe -
The content repo doesn't seem to be the issue - it's the flowfile repo.
Here is the section from one of the nodes:

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=50 KB
nifi.content.repository.directory.default=/data/4/nifi_content_repository
nifi.content.repository.archive.max.retention.period=2 days
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=false
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=/data/5/nifi_flowfile_repository
nifi.flowfile.repository.checkpoint.interval=300 secs
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.retain.orphaned.flowfiles=true

-Joe

On 7/12/2023 11:07 AM, Joe Witt wrote:

Joe

I dont recall the specific version in which we got it truly sortedbut there was an issue with our default settings for an importantcontent repo property and how we handled mixture of large/smallflowfiles written within the same underlying slab/claim in thecontent repository.


Please check what you have for conf/nifi.properties
  nifi.content.claim.max.appendable.size=

What value do you have there? I recommend reducing it to 50KB andrestarting.


Can you show your full 'nifi.content' section from the nifi.properties?

Thanks

On Wed, Jul 12, 2023 at 7:54 AM Joe Obernberger<joseph.obernber...@gmail.com> wrote:


    Raising this thread from the dead...
    Having issues with IO to the flowfile repository.  NiFi will
    show 500k flow files and a size of ~1.7G - but the size on disk
    on each of the 4 nodes is massive - over 100G, and disk IO to
    the flowfile spindle is just pegged doing writes.

    I do have ExtractText processors that take the flowfile content
    (.*) and put it into an attribute, but the sizes of these is
    maybe in the 10k at most size.  How can I find out what module
    (there are some 2200) is causing the issue?  I think I'm doing
    something fundamentally wrong with NiFi.  :)
    Perhaps I should change the size of all the queues to something
    less than 10k/1G?

    Under cluster/FLOWFILE STORAGE, one of the nodes shows
    3.74TBytes of usage, but it's actually ~150G on disk.  The other
    nodes are correct.

    Ideas on what to debug?
    Thank you!

    -Joe (NiFi 1.18)

    On 3/22/2023 12:49 PM, Mark Payne wrote:

    OK. So changing the checkpoint internal to 300 seconds might
    help reduce IO a bit. But it will cause the repo to become much
    larger, and it will take much longer to startup whenever you
    restart NiFi.

    The variance in size between nodes is likely due to how
    recently it’s checkpointed. If it stays large like 31 GB while
    the other stay small, that would be interesting to know.

    Thanks
    -Mark

    On Mar 22, 2023, at 12:45 PM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    Thanks for this Mark.  I'm not seeing any large attributes at
    the moment but will go through this and verify - but I did
    have one queue that was set to 100k instead of 10k.
    I set the nifi.cluster.node.connection.timeout to 30 seconds
    (up from 5) and the
    nifi.flowfile.repository.checkpoint.interval to 300 seconds
    (up from 20).

    While it's running the size of the flowfile repo varies
    (wildly?) on each of the nodes from 1.5G to over 30G.  Disk IO
    is still very high, but it's running now and I can use the
    UI.  Interestingly at this point the UI shows 677k files and
    1.5G of flow.  But disk usage on the flowfile repo is 31G,
    3.7G, and 2.6G on the 3 nodes.  I'd love to throw some SSDs at
    this problem.  I can add more nifi nodes.

    -Joe

    On 3/22/2023 11:08 AM, Mark Payne wrote:

    Joe,

    The errors noted are indicating that NiFi cannot communicate
    with registry. Either the registry is offline, NiFi’s
    Registry Client is not configured properly, there’s a
    firewall in the way, etc.

    A FlowFile repo of 35 GB is rather huge. This would imply one
    of 3 things:
    - You have a huge number of FlowFiles (doesn’t seem to be the
    case)
    - FlowFiles have a huge number of attributes
    or
    - FlowFiles have 1 or more huge attribute values.

    Typically, FlowFile attribute should be kept minimal and
    should never contain chunks of contents from the FlowFile
    content. Often when we see this type of behavior it’s due to
    using something like ExtractText or EvaluateJsonPath to put
    large blocks of content into attributes.

    And in this case, setting Backpressure Threshold above 10,000
    is even more concerning, as it means even greater disk I/O.

    Thanks
    -Mark

    On Mar 22, 2023, at 11:01 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    Thank you Mark. These are SATA drives - but there's no way
    for the flowfile repo to be on multiple spindles.  It's not
    huge - maybe 35G per node.
    I do see a lot of messages like this in the log:

    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
    Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
    synchronize
    
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
    Handle Extract Metadata] with Flow Registry because could
    not retrieve version 1 of flow with identifier
    d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
    736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)
    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
    Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
    synchronize
    
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
    with Flow Registry because could not retrieve version 2 of
    flow with identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in
    bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to:
    Connection refused (Connection refused)
    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
    Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
    synchronize
    
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
    Handle Extract Metadata] with Flow Registry because could
    not retrieve version 1 of flow with identifier
    d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket
    736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)
    2023-03-22 10:52:13,960 ERROR [Timer-Driven Process
    Thread-62] o.a.nifi.groups.StandardProcessGroup Failed to
    synchronize
    
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
    Binary Data] with Flow Registry because could not retrieve
    version 1 of flow with identifier
    7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
    736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection
    refused (Connection refused)

    A clue?

    -joe

    On 3/22/2023 10:49 AM, Mark Payne wrote:

    Joe,

    1.8 million FlowFiles is not a concern. But when you say
    “Should I reduce the queue sizes?” it makes me wonder if
    they’re all in a single queue?
    Generally, you should leave the backpressure threshold at
    the default 10,000 FlowFile max. Increasing this can lead
    to huge amounts of swapping, which will drastically reduce
    performance and increase disk utilization very significantly.

    Also from the diagnostics, it looks like you’ve got a lot
    of CPU cores, but you’re not using much. And based on the
    amount of disk space available and the fact that you’re
    seeing 100% utilization, I’m wondering if you’re using
    spinning disks, rather than SSDs? I would highly recommend
    always running NiFi with ssd/nvme drives. Absent that, if
    you have multiple disk drives, you could also configure the
    content repository to span multiple disks, in order to
    spread that load.

    Thanks
    -Mark

    On Mar 22, 2023, at 10:41 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    Thank you. Was able to get in.
    Currently there are 1.8 million flow files and 3.2G.  Is
    this too much for a 3 node cluster with mutliple spindles
    each (SATA drives)?
    Should I reduce the queue sizes?

    -Joe

    On 3/22/2023 10:23 AM, Phillip Lord wrote:

    Joe,

    If you need the UI to come back up, try setting the
    autoresume setting in nifi.properties to false and
    restart node(s).
    This will bring up every component/controllerService up
    stopped/disabled and may provide some breathing room for
    the UI to become available again.

    Phil
    On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com>, wrote:

    atop shows the disk as being all red with IO - 100%
    utilization. There
    are a lot of flowfiles currently trying to run through,
    but I can't
    monitor it because....UI wont' load.

    -Joe

    On 3/22/2023 10:16 AM, Mark Payne wrote:

    Joe,

    I’d recommend taking a look at garbage collection. It
    is far more likely the culprit than disk I/O.

    Thanks
    -Mark

    On Mar 22, 2023, at 10:12 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:

    I'm getting "java.net.SocketTimeoutException: timeout"
    from the user interface of NiFi when load is heavy.
    This is 1.18.0 running on a 3 node cluster. Disk IO is
    high and when that happens, I can't get into the UI to
    stop any of the processors.
    Any ideas?

    I have put the flowfile repository and content
    repository on different disks on the 3 nodes, but disk
    usage is still so high that I can't get in.
    Thank you!

    -Joe


    --
    This email has been checked for viruses by AVG
    antivirus software.
    www.avg.com <http://www.avg.com/>


    
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
        Virus-free.www.avg.com
    
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

Reply via email to