[jira] [Commented] (FTPSERVER-491) SSLConfigurationFactory.setSslProtocol never actually work
[ https://issues.apache.org/jira/browse/FTPSERVER-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845438#comment-16845438 ] Roy Lu commented on FTPSERVER-491: -- Hi [~johnnyv], how may I verify this? I was using Apache FTP through Maven. Or could you provide me the jar file built at your side? > SSLConfigurationFactory.setSslProtocol never actually work > -- > > Key: FTPSERVER-491 > URL: https://issues.apache.org/jira/browse/FTPSERVER-491 > Project: FtpServer > Issue Type: Bug > Components: Core >Affects Versions: 1.1.1 >Reporter: Roy Lu >Assignee: Jonathan Valliere >Priority: Critical > Labels: easyfix > Fix For: 1.1.2 > > > It says in the document: Set the SSL protocol used for this channel. > Supported values are "SSL" and "TLS". Defaults to "TLS". > Actually the available value could be TLSv1, TLSv1.1, TLSv1.2, SSLv3. This is > mentioned > [https://mina.apache.org/mina-project/userguide/ch11-ssl-filter/ch11-ssl-filter.html] > at the bottom. > But the things is, the +setSslProtocol+ method here actually doesn't work. > Because the ssl protocol set in the +SSLConfiguration+ is never used. Check > +NioListener+ you will see this: > Configuration of cipher suites was set into +sslFilter+ but no protocol. It > seems protocols are missing. > |if (ssl.getEnabledCipherSuites() != null) { > sslFilter.setEnabledCipherSuites(ssl.getEnabledCipherSuites()); > } > > | > This leads to a problem: > In +SSLHandler+ protocols will be set into +sslEngine+. Because protocol was > lost when building sslFilter, so the protocols setting never work. > > |if (this.sslFilter.getEnabledCipherSuites() != null) { > > this.sslEngine.setEnabledCipherSuites(this.sslFilter.getEnabledCipherSuites()); > } > > if (this.sslFilter.getEnabledProtocols() != null) { > this.sslEngine.setEnabledProtocols(this.sslFilter.getEnabledProtocols()); > }| > > I found this because I scanned FTP with Nmap. I set it to critical because > it's a security issue and hope it can be fixed soon. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (DIRMINA-1112) Backport DIRMINA-1107 to 2.0.X
[ https://issues.apache.org/jira/browse/DIRMINA-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Valliere reassigned DIRMINA-1112: -- Assignee: Jonathan Valliere > Backport DIRMINA-1107 to 2.0.X > -- > > Key: DIRMINA-1112 > URL: https://issues.apache.org/jira/browse/DIRMINA-1112 > Project: MINA > Issue Type: Bug >Affects Versions: 2.0.22 >Reporter: Jonathan Valliere >Assignee: Jonathan Valliere >Priority: Major > > Backport DIRMINA-1107 to 2.0.X branch -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Valliere resolved DIRMINA-1107. Resolution: Fixed > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DIRMINA-1112) Backport DIRMINA-1107 to 2.0.X
[ https://issues.apache.org/jira/browse/DIRMINA-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Valliere updated DIRMINA-1112: --- Description: Backport DIRMINA-1107 to 2.0.X branch > Backport DIRMINA-1107 to 2.0.X > -- > > Key: DIRMINA-1112 > URL: https://issues.apache.org/jira/browse/DIRMINA-1112 > Project: MINA > Issue Type: Bug >Affects Versions: 2.0.22 >Reporter: Jonathan Valliere >Priority: Major > > Backport DIRMINA-1107 to 2.0.X branch -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DIRMINA-1112) Backport DIRMINA-1107 to 2.0.X
Jonathan Valliere created DIRMINA-1112: -- Summary: Backport DIRMINA-1107 to 2.0.X Key: DIRMINA-1112 URL: https://issues.apache.org/jira/browse/DIRMINA-1112 Project: MINA Issue Type: Bug Affects Versions: 2.0.22 Reporter: Jonathan Valliere -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845261#comment-16845261 ] Emmanuel Lecharny commented on DIRMINA-1107: Can we backport the fix to 2.0 ? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SSHD-920) Scp Subsystem does not support java.nio.file.FileSystem
[ https://issues.apache.org/jira/browse/SSHD-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Goldstein Lyor resolved SSHD-920. - Resolution: Implemented Assignee: Goldstein Lyor Fix Version/s: 2.3.0 Please try the latest (snapshot) version - fixed by SSHD-893 > Scp Subsystem does not support java.nio.file.FileSystem > --- > > Key: SSHD-920 > URL: https://issues.apache.org/jira/browse/SSHD-920 > Project: MINA SSHD > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Fulvio Cavarretta >Assignee: Goldstein Lyor >Priority: Blocker > Fix For: 2.3.0 > > > If you plug a custom java.nio.file.FileSystem into Apache SSHD, some scp use > case are not supported. > Specifically if you download from apache-sshd using wildcard like: > scp user1@apache-sshd:/home/user1/* /tmp > Command does not succed. > This because in > *ScpFileOpener.getMatchingFilesToSend* > the class *DirectoryScanner*, used to retrieve list of files to be > transferred, does not relies on java.nio.file.FileSystem plugged into SSHD, > but just leverage on local file system. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844936#comment-16844936 ] Jonathan Valliere commented on DIRMINA-: {{increaseWrittenBytes}} automatically decrements {{scheduledWriteBytes}} (which is a horrible idea but that is how it works). We could just force set {{scheduledWriteBytes}} to 0 when {{clearWriteRequestQueue()}} is called. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844890#comment-16844890 ] Emmanuel Lecharny commented on DIRMINA-: We might updated the written bytes counter directly in the {{clearWriteRequestQueue()}} method, assuming we know how many bytes have already been written. In any case, if the stats are wrong but the code works (ie does not eat 100%CPU), I guess it's acceptable for a moment, until we come to a rational view on how to correct the stats. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844887#comment-16844887 ] Jonathan Valliere edited comment on DIRMINA- at 5/21/19 2:27 PM: - Did some more checking, the scheduledWriteBytes is always going to get messed up because it is being decrement as bytes are written. My vote is to just remove {{buf.reset()}} line entirely. was (Author: johnnyv): Did some more checking, the scheduledWriteBytes is always going to get messed up because it is being decrement as bytes are written. My vote it to just remove {{buf.reset()}} line entirely. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844887#comment-16844887 ] Jonathan Valliere commented on DIRMINA-: Did some more checking, the scheduledWriteBytes is always going to get messed up because it is being decrement as bytes are written. My vote it to just remove {{buf.reset()}} line entirely. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844880#comment-16844880 ] Guus der Kinderen commented on DIRMINA-: Maybe bike-shidding, but wouldn't {{clear}} potentially cause some bytes to be 'uncounted', even if they were processed? Even if you're not using {{mark}} now, at some point in the future, that might be introduced again. Checking for a mark, and repositioning to the mark if present, otherwise to 0, seems like a desirable defensive construct. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844880#comment-16844880 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 2:17 PM: - Maybe bike-shedding, but wouldn't {{clear}} potentially cause some bytes to be 'uncounted', even if they were processed? Even if you're not using {{mark}} now, at some point in the future, that might be introduced again. Checking for a mark, and repositioning to the mark if present, otherwise to 0, seems like a desirable defensive construct. was (Author: guus.der.kinderen): Maybe bike-shidding, but wouldn't {{clear}} potentially cause some bytes to be 'uncounted', even if they were processed? Even if you're not using {{mark}} now, at some point in the future, that might be introduced again. Checking for a mark, and repositioning to the mark if present, otherwise to 0, seems like a desirable defensive construct. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844847#comment-16844847 ] Emmanuel Lecharny commented on DIRMINA-: Ok, after some history digging, I find out why {{buff.reset()}} was called. This is a 11 years old addition : [|http://svn.apache.org/viewvc?view=revision=592218]. The rational, AFAICT, was to be sure the stats about the number of written bytes is correct. We should not count the non-written bytes still present in a buffer when the session is disposed. Calling {{clear()}} instead of {{ereset()}} - which only work when a {{mark()}} has been set - seems the right thing to do. Note that with {{2.0.X}} we don't have this issue because we call {{mark}}, something we don't do anymore with {{2.1.X}}. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844838#comment-16844838 ] Jonathan Valliere commented on DIRMINA-: Slice ensures that the bounds of the ByteBuffer View are correct. The {{scheduledWriteBytes}} counter is incremented and decremented atomically based on {{remaining()}}. The only purpose of {{clear()}} is to reset the position so when the {{scheduledWriteBytes}} is decremented as the {{IoSession}} closes that the counter correctly goes to zero. It should not adversely affect any functionality. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844834#comment-16844834 ] Guus der Kinderen commented on DIRMINA-: Thanks for both your efforts on this! I still don't quite understand the fix, but am happy to accept it. {{Clear}}, unlike {{reset}} will prevent the {{InvalidMarkException}}. I'm not confident that using {{slice}} instead if {{duplicate}} sufficiently prevents the buffer from being in an unpredictable state in {{clearWriteRequestQueue}}. Isn't it possible that after construction (slicing) of the WriteRequest and during the processing of that instance an exception occurs that causes it to be handled by {{clearWriteRequestQueue}}? We'd still be in potential trouble there. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844815#comment-16844815 ] Jonathan Valliere commented on DIRMINA-: Emmanuel is trying to figure out why {{reset()}} was in there at all. Might end up just removing that call. For the time being, this fixes the problem. {code:java} diff --git a/mina-core/src/main/java/org/apache/mina/core/polling/AbstractPollingIoProcessor.java b/mina-core/src/main/java/org/apache/mina/core/polling/AbstractPollingIoProcessor.java index 1553a99..9501074 100644 --- a/mina-core/src/main/java/org/apache/mina/core/polling/AbstractPollingIoProcessor.java +++ b/mina-core/src/main/java/org/apache/mina/core/polling/AbstractPollingIoProcessor.java @@ -1170,7 +1170,7 @@ // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { -buf.reset(); +buf.clear(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); diff --git a/mina-core/src/main/java/org/apache/mina/core/write/DefaultWriteRequest.java b/mina-core/src/main/java/org/apache/mina/core/write/DefaultWriteRequest.java index 8324c6a..23604fa 100644 --- a/mina-core/src/main/java/org/apache/mina/core/write/DefaultWriteRequest.java +++ b/mina-core/src/main/java/org/apache/mina/core/write/DefaultWriteRequest.java @@ -226,7 +226,7 @@ if (message instanceof IoBuffer) { // duplicate it, so that any modification made on it // won't change the original message -this.message = ((IoBuffer)message).duplicate(); +this.message = ((IoBuffer)message).slice(); } {code} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844812#comment-16844812 ] Guus der Kinderen commented on DIRMINA-: I don't quite understand your suggestion. Could you illustrate it with a code snippet please? > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844809#comment-16844809 ] Jonathan Valliere commented on DIRMINA-1107: Merged into 2.1.X [99fbf4fe64d72409b19f8835c0a568022916d840|https://gitbox.apache.org/repos/asf?p=mina.git;a=commit;h=99fbf4fe64d72409b19f8835c0a568022916d840] > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844795#comment-16844795 ] Jonathan Valliere commented on DIRMINA-: Okay, I checked it. {{DefaultWriteRequest}} should use {{slice()}} to duplicate the buffer. That ensures that index 0 is the current position of the input buffer. That way, we can call {{clear()}} and fully reset the message to the correct start and end offsets. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844793#comment-16844793 ] Guus der Kinderen commented on DIRMINA-: {{rewind()}} looks like a simpler solution than resetting-if-marked, but there's a small difference in that the position would always revert to 0, instead of the last marked position. I'm not sure if that makes any difference to code that depends on this. In any case, preventing the {{InvalidMarkException}} would be achieved by both, and that is what seems to prevent the CPU-churning. I'd be happy with either solution. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844786#comment-16844786 ] Jonathan Valliere edited comment on DIRMINA- at 5/21/19 12:25 PM: -- Sorry, {{buf.rewind()}} Honestly, it's not necessary at all because the duplicated buffer is the one being modified. The original one is left intact. I'm not sure why {{failedRequests}} is a thing because any failed writes should mark the {{IoSession}} as invalid and close everything. was (Author: johnnyv): Sorry, {{buf.rewind() Honestly, it's not necessary at all because the duplicated buffer is the one being modified. The original one is left intact.}} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844786#comment-16844786 ] Jonathan Valliere edited comment on DIRMINA- at 5/21/19 12:24 PM: -- Sorry, {{buf.rewind() Honestly, it's not necessary at all because the duplicated buffer is the one being modified. The original one is left intact.}} was (Author: johnnyv): Sorry, {{buf.rewind()}} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844786#comment-16844786 ] Jonathan Valliere commented on DIRMINA-: Sorry, {{buf.rewind()}} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844783#comment-16844783 ] Guus der Kinderen commented on DIRMINA-: That would change the behavior - instead of rewinding the buffer to be re-read, it would effectively discard information. I'm not sure why we're resetting in the first place here, but I'm guessing it serves a purpose (to later process the remainder of the buffer?) With a reset, the data will likely be lost (although it's not effectively wiped from the buffer, recovering it would be harder). > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844777#comment-16844777 ] Jonathan Valliere commented on DIRMINA-: Replace {{buf.reset()}} with {{buf.clear()}}. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844762#comment-16844762 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 12:05 PM: -- Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We have not investigated why this resolves the CPU spikes, but we assume that the thrown exception prevents keys from being consumed. We're still suffering from a different issue (all clients unexpectedly reconnect periodically), but we're now working on the assumption that the CPU spike is a result, not the cause, of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:java|title=Proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() > -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} was (Author: guus.der.kinderen): Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We have not investigated why this resolves the CPU spikes, but we assume that the thrown exception prevents keys from being consumed. We're still suffering from a different issue (all clients unexpectedly reconnect periodically), but we're now working on the assumption that the CPU spike is a result, not the cause, of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:java|title=Proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() >= -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844762#comment-16844762 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 12:04 PM: -- Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We have not investigated why this resolves the CPU spikes, but we assume that the thrown exception prevents keys from being consumed. We're still suffering from a different issue (all clients unexpectedly reconnect periodically), but we're now working on the assumption that the CPU spike is a result, not the cause, of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:java|title=Proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() >= -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} was (Author: guus.der.kinderen): Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We have not investigated why this resolves the CPU spikes, but we assume that the thrown exception prevents keys from being consumed. We're still suffering from a different issue (all clients unexpectedly reconnect periodically), but we're now working on the assumption that the CPU spike is a result, not the cause, of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:java|title=proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() >= -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844762#comment-16844762 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 12:03 PM: -- Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We have not investigated why this resolves the CPU spikes, but we assume that the thrown exception prevents keys from being consumed. We're still suffering from a different issue (all clients unexpectedly reconnect periodically), but we're now working on the assumption that the CPU spike is a result, not the cause, of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:java|title=proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() >= -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} was (Author: guus.der.kinderen): Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We're still suffering from a different issue (all clients reconnecting periodically), but we're now working on the assumption that the CPU spike is a result, not the cause of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:title=proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue|java} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() >= -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. >
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844762#comment-16844762 ] Guus der Kinderen commented on DIRMINA-: Although we're not exactly sure how, we've "resolved" the issue by modifying the snippet below: {code:java|title=Original code for org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} As said, this triggered {{InvalidMarkExceptions}}. After we added a simple {{try/catch}} around the reset, the CPU spikes go away. We're still suffering from a different issue (all clients reconnecting periodically), but we're now working on the assumption that the CPU spike is a result, not the cause of this. Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that this is often called from exception handlers. With that, the state of the buffer that it's operating on, is likely unpredictable. The call to {{reset()}} assumes that a mark is set, but there are various scenario's where this is untrue: * the buffer could have been completely unused (a buffer fresh from the constructor will cause {{InvalidMarkException}} here * the buffer might have been flipped, but not yet read. We're uncertain if the reset is needed, but if it is, we suggest to explicitly check if the mark has been set. If not, then we don't believe a reset is needed. {code:title=proposed fix for org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue|java} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { if (buf.markValue() >= -1) { buf.reset(); } failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844687#comment-16844687 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 10:01 AM: -- The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{"Broken connection"}} or {{"Create a new selector. >Selected is 0, delta ="}} being logged, which suggests that the workaround for >the EPOLL / spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png! was (Author: guus.der.kinderen): The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected >is 0, delta =}} being logged, which suggests that the workaround for the EPOLL >/ spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png! > 100% CPU (epoll bug) on 2.1.x, Linux only > -
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844690#comment-16844690 ] Guus der Kinderen commented on DIRMINA-: {quote}Do you have a link to the OpenFire issue ?{quote} No, this is being reported in a non-public project that uses Openfire. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844687#comment-16844687 ] Guus der Kinderen commented on DIRMINA-: The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected >is 0, delta =}} being logged, which suggests that the workaround for the EPOLL >/ spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png! > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844687#comment-16844687 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 9:58 AM: - The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected >is 0, delta =}} being logged, which suggests that the workaround for the EPOLL >/ spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png! was (Author: guus.der.kinderen): The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected >is 0, delta =}} being logged, which suggests that the workaround for the EPOLL >/ spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png|thumbnail! > 100% CPU (epoll bug) on 2.1.x, Linux only >
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844687#comment-16844687 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 9:58 AM: - The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected >is 0, delta =}} being logged, which suggests that the workaround for the EPOLL >/ spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png|thumbnail! was (Author: guus.der.kinderen): The CPU cycles that get burned are reported to be "system", not "user" cycles. We've obtained a Call Tree graph from JProfiler, from which I included a screenshot (the entire thing is 18MB). Just under the section from which I created this screenshot, there's another 10.2% native CPU usage for {{java.lang.System.currentTimeMillis}} that's called from within {{org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process}} >From what I can deduce, this looks like it iterates over this snippet from a >lot. We don't see {{Broken connection}} or "{{Create a new selector. Selected >is 0, delta =}} being logged, which suggests that the workaround for the EPOLL >/ spinning selector bug is not even being considered. {code:title=org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#run|java} /*(...snip..)*/ // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = t1 - t0; if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. if (nbTries == 0) { LOG.warn("Create a new selector. Selected is 0, delta = " + delta); registerNewSelector(); nbTries = 10; } else { nbTries--; } } } else { nbTries = 10; } /*(...snip..)*/ {code} !image-2019-05-21-11-37-41-931.png! > 100% CPU (epoll bug) on 2.1.x, Linux only >
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844683#comment-16844683 ] Emmanuel Lecharny commented on DIRMINA-: Do you have a link to the OpenFire issue ? > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guus der Kinderen updated DIRMINA-: --- Attachment: image-2019-05-21-11-37-41-931.png > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Attachments: image-2019-05-21-11-37-41-931.png > > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844639#comment-16844639 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 8:50 AM: - The report claims that the CPU increase is combined with InvalidMarkException instances being thrown by MINA 2.1.2's implementation of {{org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}} Specifically the {{InvalidMarkException}} is thrown when {{reset}} is called in this snippet: {code:java} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} was (Author: guus.der.kinderen): The report claims that the CPU increase is combined with InvalidMarkException instances being thrown by MINA 2.1.2's implementation of {{org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}} Specifically, when {{reset}} is called in this snippet: {code:java} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844639#comment-16844639 ] Guus der Kinderen edited comment on DIRMINA- at 5/21/19 8:49 AM: - The report claims that the CPU increase is combined with InvalidMarkException instances being thrown by MINA 2.1.2's implementation of {{org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}} Specifically, when {{reset}} is called in this snippet: {code:java} // The first unwritten empty buffer must be // forwarded to the filter chain. if (buf.hasRemaining()) { buf.reset(); failedRequests.add(req); } else { IoFilterChain filterChain = session.getFilterChain(); filterChain.fireMessageSent(req); }{code} was (Author: guus.der.kinderen): The report claims that the CPU increase is combined with InvalidMarkException instances being thrown by MINA 2.1.2's implementation of {{org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844639#comment-16844639 ] Guus der Kinderen commented on DIRMINA-: The report claims that the CPU increase is combined with InvalidMarkException instances being thrown by MINA 2.1.2's implementation of {{org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}} > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only
[ https://issues.apache.org/jira/browse/DIRMINA-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844614#comment-16844614 ] Guus der Kinderen commented on DIRMINA-: For what it's worth, I've received another report (unrelated to the first one) that appears to be related to this issue. This new environment recently upgraded to MINA 2.1.2 too. I'm working on obtaining the profiling information. > 100% CPU (epoll bug) on 2.1.x, Linux only > - > > Key: DIRMINA- > URL: https://issues.apache.org/jira/browse/DIRMINA- > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > > We're getting reports > [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13] > of a bug that causes 100% CPU usage on Linux (the problem does not occur on > Windows). > This problem occurs in 2.1.2. but does _not_ occur in 2.0.21. > Is this a regression of the epoll selector bug in DIRMINA-678 ? > A stack trace of one of the spinning threads: > {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x7f0408002000 nid=0x2a6a > runnable [0x7f0476dee000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x0004c486b748> (a sun.nio.ch.Util$3) > - locked <0x0004c486b738> (a java.util.Collections$UnmodifiableSet) > - locked <0x0004c420ccb0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112) > at > org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616) > at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0004c4d03530> (a > java.util.concurrent.ThreadPoolExecutor$Worker){code} > More info is available at > https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844611#comment-16844611 ] Emmanuel Lecharny commented on DIRMINA-1107: Great ! Jonathan, I let you merge the branch. We probably want to backport the change to the 2.0.X branch too. Let me know and I'll kick a release when done. Thanks a lot ! > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844606#comment-16844606 ] Guus der Kinderen commented on DIRMINA-1107: We've replaced our initial fix with what Jonathan provided. His fix appears to prevent the original issue from occurring. We're happy to close this issue with that fix! Thanks! > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)