[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only

Guus der Kinderen (JIRA) Tue, 21 May 2019 02:59:12 -0700


    [ 
https://issues.apache.org/jira/browse/DIRMINA-1111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844687#comment-16844687
 ]


Guus der Kinderen edited comment on DIRMINA-1111 at 5/21/19 9:58 AM:
---------------------------------------------------------------------

The CPU cycles that get burned are reported to be "system", not "user" cycles.

We've obtained a Call Tree graph from JProfiler, from which I included a 
screenshot (the entire thing is 18MB). Just under the section from which I 
created this screenshot, there's another 10.2% native CPU usage for 
{{java.lang.System.currentTimeMillis}} that's called from within  
{{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}}

>From what I can deduce, this looks like it iterates over this snippet from a 
>lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected 
>is 0, delta =}} being logged, which suggests that the workaround for the EPOLL 
>/ spinning selector bug is not even being considered.

{code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java}
/*(...snip..)*/
// This select has a timeout so that we can manage
// idle session when we get out of the select every
// second. (note : this is a hack to avoid creating
// a dedicated thread).
long t0 = System.currentTimeMillis();
int selected = select(SELECT_TIMEOUT);
long t1 = System.currentTimeMillis();
long delta = t1 - t0;

if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) {
    // Last chance : the select() may have been
    // interrupted because we have had an closed channel.
    if (isBrokenConnection()) {
        LOG.warn("Broken connection");
    } else {
        // Ok, we are hit by the nasty epoll
        // spinning.
        // Basically, there is a race condition
        // which causes a closing file descriptor not to be
        // considered as available as a selected channel,
        // but
        // it stopped the select. The next time we will
        // call select(), it will exit immediately for the
        // same
        // reason, and do so forever, consuming 100%
        // CPU.
        // We have to destroy the selector, and
        // register all the socket on a new one.
        if (nbTries == 0) {
            LOG.warn("Create a new selector. Selected is 0, delta = " + delta);
            registerNewSelector();
            nbTries = 10;
        } else {
            nbTries--;
        }
    }
} else {
    nbTries = 10;
}
/*(...snip..)*/
{code}

!image-2019-05-21-11-37-41-931.png|thumbnail!


was (Author: guus.der.kinderen):
The CPU cycles that get burned are reported to be "system", not "user" cycles.

We've obtained a Call Tree graph from JProfiler, from which I included a 
screenshot (the entire thing is 18MB). Just under the section from which I 
created this screenshot, there's another 10.2% native CPU usage for 
{{java.lang.System.currentTimeMillis}} that's called from within  
{{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}}

>From what I can deduce, this looks like it iterates over this snippet from a 
>lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected 
>is 0, delta =}} being logged, which suggests that the workaround for the EPOLL 
>/ spinning selector bug is not even being considered.

{code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java}
/*(...snip..)*/
// This select has a timeout so that we can manage
// idle session when we get out of the select every
// second. (note : this is a hack to avoid creating
// a dedicated thread).
long t0 = System.currentTimeMillis();
int selected = select(SELECT_TIMEOUT);
long t1 = System.currentTimeMillis();
long delta = t1 - t0;

if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) {
    // Last chance : the select() may have been
    // interrupted because we have had an closed channel.
    if (isBrokenConnection()) {
        LOG.warn("Broken connection");
    } else {
        // Ok, we are hit by the nasty epoll
        // spinning.
        // Basically, there is a race condition
        // which causes a closing file descriptor not to be
        // considered as available as a selected channel,
        // but
        // it stopped the select. The next time we will
        // call select(), it will exit immediately for the
        // same
        // reason, and do so forever, consuming 100%
        // CPU.
        // We have to destroy the selector, and
        // register all the socket on a new one.
        if (nbTries == 0) {
            LOG.warn("Create a new selector. Selected is 0, delta = " + delta);
            registerNewSelector();
            nbTries = 10;
        } else {
            nbTries--;
        }
    }
} else {
    nbTries = 10;
}
/*(...snip..)*/
{code}

!image-2019-05-21-11-37-41-931.png!

> 100% CPU (epoll bug) on 2.1.x, Linux only
> -----------------------------------------
>
>                 Key: DIRMINA-1111
>                 URL: https://issues.apache.org/jira/browse/DIRMINA-1111
>             Project: MINA
>          Issue Type: Bug
>    Affects Versions: 2.1.2
>            Reporter: Guus der Kinderen
>            Priority: Major
>         Attachments: image-2019-05-21-11-37-41-931.png
>
>
> We're getting reports 
> [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13]
>  of a bug that causes 100% CPU usage on Linux (the problem does not occur on 
> Windows). 
> This problem occurs in 2.1.2. but does _not_ occur in 2.0.21.
> Is this a regression of the epoll selector bug in DIRMINA-678 ?
> A stack trace of one of the spinning threads:
> {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x00007f0408002000 nid=0x2a6a 
> runnable [0x00007f0476dee000]
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>       - locked <0x00000004c486b748> (a sun.nio.ch.Util$3)
>       - locked <0x00000004c486b738> (a java.util.Collections$UnmodifiableSet)
>       - locked <0x00000004c420ccb0> (a sun.nio.ch.EPollSelectorImpl)
>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>       at 
> org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112)
>       at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616)
>       at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x00000004c4d03530> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker){code}
> More info is available at 
> https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only

Reply via email to