Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Thanks for looking further into this, Mark, We are running: java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.12.04.2) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) Linux 3.11.0-15-generic #25~precise1-Ubuntu SMP Thu Jan 30 17:39:31 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz The bug happened very seldom compared to the 100M text WebSocket messages a day, about once a week, so I guess it could be an external problem like the JVM. I already added a temporary fix to our Tomcat installation as mentioned earlier, but other users could also suffer from this. How do we proceed? Best regards, Lars Engholm Johansen On Wed, Nov 12, 2014 at 12:15 PM, Mark Thomas wrote: > On 10/11/2014 09:57, Lars Engholm Johansen wrote: > > Hi Mark, > > > > I looked into our javax.websocket.Endpoint implementation and found the > > following suspicious code: > > > > When we need to close the WebSocket session already in .onOpen() method > > (rejecting a connection), we are calling session.close() asynchronously > > after 1 second via a java.util.Timer task. > > This was due to bug > https://issues.apache.org/bugzilla/show_bug.cgi?id=54716, > > that I can see should be fixed long time ago (thanks). > > > > Can this cause the selector's keyset to be accessed by more than one > thread? > > I don't see how. > > I've just double checked the NIO Poller code and the only places the > keyset is used is in Poller.run() and Poller.timeout() - both of which > are only ever accessed from the Poller thread. > > I've also looked over the run() and timeout() methods and haven't yet > found anything that could trigger this. > > There are multiple Pollers but each Poller has a distinct set of sockets > to manage. > > I'm beginning to wonder if there is a JVM bug here. Which JVM are you > using? > > Mark > > > > > Best regards, > > Lars Engholm Johansen > > > > On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas wrote: > > > >> On 06/10/2014 10:11, Lars Engholm Johansen wrote: > >>> Hi all, > >>> > >>> I have good news as I have identified the reason for the devastating > >>> NioEndpoint.Poller thread death: > >>> > >>> In rare circumstances a ConcurrentModification can occur in the > Poller's > >>> connection timeout handling called from OUTSIDE the > try-catch(Throwable) > >> of > >>> Poller.run() > >>> > >>> java.util.ConcurrentModificationException > >>> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) > >>> at java.util.HashMap$KeyIterator.next(HashMap.java:956) > >>> at > >>> > >> > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) > >>> at > >>> > >> > org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437) > >>> at > >>> > org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143) > >>> at java.lang.Thread.run(Thread.java:745) > >>> > >>> Somehow the Poller's Selector object gets modified from another thread. > >> > >> Any idea how? I've been looking through that code for some time now > >> (this stack trace appears to be from 7.0.55 for those that want to look > >> at this themselves) and I can't see anywhere where the selector's keyset > >> is accessed by more than one thread. > >> > >>> As a remedy until fixed properly by the Tomcat team, I have added a > >>> try-catch(ConcurrentModificationException) surrounding the for loop in > >>> Poller.timeout(). > >>> That way, in case of the rare problem, a full iteration of the Selector > >>> will be retried in the next call to Poller.timeout(). > >> > >> That seems like a reasonable work-around but before we start making > >> changes to the Tomcat code I'd really like to understand the root > >> cause(s) of the issue else we might not be fixing the actual issue and > >> could make it worse for some folks. > >> > >> Mark > >> > >> > >>> > >>> I am really happy now as all our production servers have been rock > stable > >>> for two weeks now. > >>> > >>> Best regards to all, > >>> Lars Engholm Johansen > >>> > >>> > >>> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik wrote: > >>> > Thanks Lars, if you are indeed experiencing a non caught error, let us > >> know > what it is. > > On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen < > >> lar...@gmail.com> > wrote: > > > Thanks guys for all the feedback. > > > > I have tried the following suggested tasks: > > > >- Upgrading Tomcat to the newest 7.0.55 on all our servers -> > >> Problem > >still persists > >- Force a System.gc() when connection count is on the loose -> > >Connection count is not dropping > >- Lowering the log level of NioEndpoint class that contains the > >> Poller > >code -> No info about why the poller thread exits in any tomcat > logs > >- Reverting the JVM stack size per thread to the default is > >> discussed >
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
On 10/11/2014 09:57, Lars Engholm Johansen wrote: > Hi Mark, > > I looked into our javax.websocket.Endpoint implementation and found the > following suspicious code: > > When we need to close the WebSocket session already in .onOpen() method > (rejecting a connection), we are calling session.close() asynchronously > after 1 second via a java.util.Timer task. > This was due to bug https://issues.apache.org/bugzilla/show_bug.cgi?id=54716, > that I can see should be fixed long time ago (thanks). > > Can this cause the selector's keyset to be accessed by more than one thread? I don't see how. I've just double checked the NIO Poller code and the only places the keyset is used is in Poller.run() and Poller.timeout() - both of which are only ever accessed from the Poller thread. I've also looked over the run() and timeout() methods and haven't yet found anything that could trigger this. There are multiple Pollers but each Poller has a distinct set of sockets to manage. I'm beginning to wonder if there is a JVM bug here. Which JVM are you using? Mark > > Best regards, > Lars Engholm Johansen > > On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas wrote: > >> On 06/10/2014 10:11, Lars Engholm Johansen wrote: >>> Hi all, >>> >>> I have good news as I have identified the reason for the devastating >>> NioEndpoint.Poller thread death: >>> >>> In rare circumstances a ConcurrentModification can occur in the Poller's >>> connection timeout handling called from OUTSIDE the try-catch(Throwable) >> of >>> Poller.run() >>> >>> java.util.ConcurrentModificationException >>> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) >>> at java.util.HashMap$KeyIterator.next(HashMap.java:956) >>> at >>> >> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) >>> at >>> >> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437) >>> at >>> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> Somehow the Poller's Selector object gets modified from another thread. >> >> Any idea how? I've been looking through that code for some time now >> (this stack trace appears to be from 7.0.55 for those that want to look >> at this themselves) and I can't see anywhere where the selector's keyset >> is accessed by more than one thread. >> >>> As a remedy until fixed properly by the Tomcat team, I have added a >>> try-catch(ConcurrentModificationException) surrounding the for loop in >>> Poller.timeout(). >>> That way, in case of the rare problem, a full iteration of the Selector >>> will be retried in the next call to Poller.timeout(). >> >> That seems like a reasonable work-around but before we start making >> changes to the Tomcat code I'd really like to understand the root >> cause(s) of the issue else we might not be fixing the actual issue and >> could make it worse for some folks. >> >> Mark >> >> >>> >>> I am really happy now as all our production servers have been rock stable >>> for two weeks now. >>> >>> Best regards to all, >>> Lars Engholm Johansen >>> >>> >>> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik wrote: >>> Thanks Lars, if you are indeed experiencing a non caught error, let us >> know what it is. On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen < >> lar...@gmail.com> wrote: > Thanks guys for all the feedback. > > I have tried the following suggested tasks: > >- Upgrading Tomcat to the newest 7.0.55 on all our servers -> >> Problem >still persists >- Force a System.gc() when connection count is on the loose -> >Connection count is not dropping >- Lowering the log level of NioEndpoint class that contains the >> Poller >code -> No info about why the poller thread exits in any tomcat logs >- Reverting the JVM stack size per thread to the default is >> discussed >previously -> Problem still persists > > I have now checked out the NioEndpoint source code and recompiled it with a > logging try-catch surrounding the whole of the Poller.run() implementation > as I noticed that the outer try-catch here only catches OOME. > I will report back with my findings as soon as the problem arises >> again. > > /Lars > > > > On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz < > ch...@christopherschultz.net> wrote: > >>> Filip, >>> >>> On 6/27/14, 11:36 AM, Filip Hanik wrote: Are there any log entries that would indicate that the poller thread has died? This/these thread/s start when Tomcat starts. and a stack over flow on a processing thread should never affect the poller thread. >>> >>> OP reported in the initial post that the thread had disappeared: >>> >>> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: We have no output in tomcat or our logs at t
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Hi Mark, I looked into our javax.websocket.Endpoint implementation and found the following suspicious code: When we need to close the WebSocket session already in .onOpen() method (rejecting a connection), we are calling session.close() asynchronously after 1 second via a java.util.Timer task. This was due to bug https://issues.apache.org/bugzilla/show_bug.cgi?id=54716, that I can see should be fixed long time ago (thanks). Can this cause the selector's keyset to be accessed by more than one thread? Best regards, Lars Engholm Johansen On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas wrote: > On 06/10/2014 10:11, Lars Engholm Johansen wrote: > > Hi all, > > > > I have good news as I have identified the reason for the devastating > > NioEndpoint.Poller thread death: > > > > In rare circumstances a ConcurrentModification can occur in the Poller's > > connection timeout handling called from OUTSIDE the try-catch(Throwable) > of > > Poller.run() > > > > java.util.ConcurrentModificationException > > at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) > > at java.util.HashMap$KeyIterator.next(HashMap.java:956) > > at > > > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) > > at > > > org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437) > > at > > org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143) > > at java.lang.Thread.run(Thread.java:745) > > > > Somehow the Poller's Selector object gets modified from another thread. > > Any idea how? I've been looking through that code for some time now > (this stack trace appears to be from 7.0.55 for those that want to look > at this themselves) and I can't see anywhere where the selector's keyset > is accessed by more than one thread. > > > As a remedy until fixed properly by the Tomcat team, I have added a > > try-catch(ConcurrentModificationException) surrounding the for loop in > > Poller.timeout(). > > That way, in case of the rare problem, a full iteration of the Selector > > will be retried in the next call to Poller.timeout(). > > That seems like a reasonable work-around but before we start making > changes to the Tomcat code I'd really like to understand the root > cause(s) of the issue else we might not be fixing the actual issue and > could make it worse for some folks. > > Mark > > > > > > I am really happy now as all our production servers have been rock stable > > for two weeks now. > > > > Best regards to all, > > Lars Engholm Johansen > > > > > > On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik wrote: > > > >> Thanks Lars, if you are indeed experiencing a non caught error, let us > know > >> what it is. > >> > >> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen < > lar...@gmail.com> > >> wrote: > >> > >>> Thanks guys for all the feedback. > >>> > >>> I have tried the following suggested tasks: > >>> > >>>- Upgrading Tomcat to the newest 7.0.55 on all our servers -> > Problem > >>>still persists > >>>- Force a System.gc() when connection count is on the loose -> > >>>Connection count is not dropping > >>>- Lowering the log level of NioEndpoint class that contains the > Poller > >>>code -> No info about why the poller thread exits in any tomcat logs > >>>- Reverting the JVM stack size per thread to the default is > discussed > >>>previously -> Problem still persists > >>> > >>> I have now checked out the NioEndpoint source code and recompiled it > >> with a > >>> logging try-catch surrounding the whole of the Poller.run() > >> implementation > >>> as I noticed that the outer try-catch here only catches OOME. > >>> I will report back with my findings as soon as the problem arises > again. > >>> > >>> /Lars > >>> > >>> > >>> > >>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz < > >>> ch...@christopherschultz.net> wrote: > >>> > > Filip, > > > > On 6/27/14, 11:36 AM, Filip Hanik wrote: > >> Are there any log entries that would indicate that the poller > >> thread has died? This/these thread/s start when Tomcat starts. and > >> a stack over flow on a processing thread should never affect the > >> poller thread. > > > > OP reported in the initial post that the thread had disappeared: > > > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: > >> We have no output in tomcat or our logs at the time when this event > >> occurs. The only sign is when comparing full java thread dump with > >> a dump from a newly launched Tomcat: > >> > >> One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 > >> is missing/has died. > > > > -chris > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > > > >>> > >> > > > > > --
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
On 06/10/2014 10:11, Lars Engholm Johansen wrote: > Hi all, > > I have good news as I have identified the reason for the devastating > NioEndpoint.Poller thread death: > > In rare circumstances a ConcurrentModification can occur in the Poller's > connection timeout handling called from OUTSIDE the try-catch(Throwable) of > Poller.run() > > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) > at java.util.HashMap$KeyIterator.next(HashMap.java:956) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) > at > org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437) > at > org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143) > at java.lang.Thread.run(Thread.java:745) > > Somehow the Poller's Selector object gets modified from another thread. Any idea how? I've been looking through that code for some time now (this stack trace appears to be from 7.0.55 for those that want to look at this themselves) and I can't see anywhere where the selector's keyset is accessed by more than one thread. > As a remedy until fixed properly by the Tomcat team, I have added a > try-catch(ConcurrentModificationException) surrounding the for loop in > Poller.timeout(). > That way, in case of the rare problem, a full iteration of the Selector > will be retried in the next call to Poller.timeout(). That seems like a reasonable work-around but before we start making changes to the Tomcat code I'd really like to understand the root cause(s) of the issue else we might not be fixing the actual issue and could make it worse for some folks. Mark > > I am really happy now as all our production servers have been rock stable > for two weeks now. > > Best regards to all, > Lars Engholm Johansen > > > On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik wrote: > >> Thanks Lars, if you are indeed experiencing a non caught error, let us know >> what it is. >> >> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen >> wrote: >> >>> Thanks guys for all the feedback. >>> >>> I have tried the following suggested tasks: >>> >>>- Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem >>>still persists >>>- Force a System.gc() when connection count is on the loose -> >>>Connection count is not dropping >>>- Lowering the log level of NioEndpoint class that contains the Poller >>>code -> No info about why the poller thread exits in any tomcat logs >>>- Reverting the JVM stack size per thread to the default is discussed >>>previously -> Problem still persists >>> >>> I have now checked out the NioEndpoint source code and recompiled it >> with a >>> logging try-catch surrounding the whole of the Poller.run() >> implementation >>> as I noticed that the outer try-catch here only catches OOME. >>> I will report back with my findings as soon as the problem arises again. >>> >>> /Lars >>> >>> >>> >>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz < >>> ch...@christopherschultz.net> wrote: >>> > Filip, > > On 6/27/14, 11:36 AM, Filip Hanik wrote: >> Are there any log entries that would indicate that the poller >> thread has died? This/these thread/s start when Tomcat starts. and >> a stack over flow on a processing thread should never affect the >> poller thread. > > OP reported in the initial post that the thread had disappeared: > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: >> We have no output in tomcat or our logs at the time when this event >> occurs. The only sign is when comparing full java thread dump with >> a dump from a newly launched Tomcat: >> >> One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 >> is missing/has died. > > -chris - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org >>> >> > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Hi all, I have good news as I have identified the reason for the devastating NioEndpoint.Poller thread death: In rare circumstances a ConcurrentModification can occur in the Poller's connection timeout handling called from OUTSIDE the try-catch(Throwable) of Poller.run() java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) at java.util.HashMap$KeyIterator.next(HashMap.java:956) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) at org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437) at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143) at java.lang.Thread.run(Thread.java:745) Somehow the Poller's Selector object gets modified from another thread. As a remedy until fixed properly by the Tomcat team, I have added a try-catch(ConcurrentModificationException) surrounding the for loop in Poller.timeout(). That way, in case of the rare problem, a full iteration of the Selector will be retried in the next call to Poller.timeout(). I am really happy now as all our production servers have been rock stable for two weeks now. Best regards to all, Lars Engholm Johansen On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik wrote: > Thanks Lars, if you are indeed experiencing a non caught error, let us know > what it is. > > On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen > wrote: > > > Thanks guys for all the feedback. > > > > I have tried the following suggested tasks: > > > >- Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem > >still persists > >- Force a System.gc() when connection count is on the loose -> > >Connection count is not dropping > >- Lowering the log level of NioEndpoint class that contains the Poller > >code -> No info about why the poller thread exits in any tomcat logs > >- Reverting the JVM stack size per thread to the default is discussed > >previously -> Problem still persists > > > > I have now checked out the NioEndpoint source code and recompiled it > with a > > logging try-catch surrounding the whole of the Poller.run() > implementation > > as I noticed that the outer try-catch here only catches OOME. > > I will report back with my findings as soon as the problem arises again. > > > > /Lars > > > > > > > > On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz < > > ch...@christopherschultz.net> wrote: > > > > > -BEGIN PGP SIGNED MESSAGE- > > > Hash: SHA256 > > > > > > Filip, > > > > > > On 6/27/14, 11:36 AM, Filip Hanik wrote: > > > > Are there any log entries that would indicate that the poller > > > > thread has died? This/these thread/s start when Tomcat starts. and > > > > a stack over flow on a processing thread should never affect the > > > > poller thread. > > > > > > OP reported in the initial post that the thread had disappeared: > > > > > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: > > > > We have no output in tomcat or our logs at the time when this event > > > > occurs. The only sign is when comparing full java thread dump with > > > > a dump from a newly launched Tomcat: > > > > > > > > One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 > > > > is missing/has died. > > > > > > - -chris > > > -BEGIN PGP SIGNATURE- > > > Version: GnuPG v1 > > > Comment: GPGTools - http://gpgtools.org > > > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > > > > > iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok > > > cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH > > > pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5 > > > 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB > > > j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz > > > rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH > > > Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf > > > H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM > > > zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl > > > aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr > > > xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+ > > > tV1WjkDkejPrC4WWKwzm > > > =sTia > > > -END PGP SIGNATURE- > > > > > > - > > > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > > > For additional commands, e-mail: users-h...@tomcat.apache.org > > > > > > > > >
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Thanks Lars, if you are indeed experiencing a non caught error, let us know what it is. On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen wrote: > Thanks guys for all the feedback. > > I have tried the following suggested tasks: > >- Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem >still persists >- Force a System.gc() when connection count is on the loose -> >Connection count is not dropping >- Lowering the log level of NioEndpoint class that contains the Poller >code -> No info about why the poller thread exits in any tomcat logs >- Reverting the JVM stack size per thread to the default is discussed >previously -> Problem still persists > > I have now checked out the NioEndpoint source code and recompiled it with a > logging try-catch surrounding the whole of the Poller.run() implementation > as I noticed that the outer try-catch here only catches OOME. > I will report back with my findings as soon as the problem arises again. > > /Lars > > > > On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz < > ch...@christopherschultz.net> wrote: > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA256 > > > > Filip, > > > > On 6/27/14, 11:36 AM, Filip Hanik wrote: > > > Are there any log entries that would indicate that the poller > > > thread has died? This/these thread/s start when Tomcat starts. and > > > a stack over flow on a processing thread should never affect the > > > poller thread. > > > > OP reported in the initial post that the thread had disappeared: > > > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: > > > We have no output in tomcat or our logs at the time when this event > > > occurs. The only sign is when comparing full java thread dump with > > > a dump from a newly launched Tomcat: > > > > > > One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 > > > is missing/has died. > > > > - -chris > > -BEGIN PGP SIGNATURE- > > Version: GnuPG v1 > > Comment: GPGTools - http://gpgtools.org > > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > > > iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok > > cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH > > pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5 > > 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB > > j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz > > rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH > > Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf > > H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM > > zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl > > aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr > > xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+ > > tV1WjkDkejPrC4WWKwzm > > =sTia > > -END PGP SIGNATURE- > > > > - > > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > > For additional commands, e-mail: users-h...@tomcat.apache.org > > > > >
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Thanks guys for all the feedback. I have tried the following suggested tasks: - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem still persists - Force a System.gc() when connection count is on the loose -> Connection count is not dropping - Lowering the log level of NioEndpoint class that contains the Poller code -> No info about why the poller thread exits in any tomcat logs - Reverting the JVM stack size per thread to the default is discussed previously -> Problem still persists I have now checked out the NioEndpoint source code and recompiled it with a logging try-catch surrounding the whole of the Poller.run() implementation as I noticed that the outer try-catch here only catches OOME. I will report back with my findings as soon as the problem arises again. /Lars On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz < ch...@christopherschultz.net> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Filip, > > On 6/27/14, 11:36 AM, Filip Hanik wrote: > > Are there any log entries that would indicate that the poller > > thread has died? This/these thread/s start when Tomcat starts. and > > a stack over flow on a processing thread should never affect the > > poller thread. > > OP reported in the initial post that the thread had disappeared: > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: > > We have no output in tomcat or our logs at the time when this event > > occurs. The only sign is when comparing full java thread dump with > > a dump from a newly launched Tomcat: > > > > One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 > > is missing/has died. > > - -chris > -BEGIN PGP SIGNATURE- > Version: GnuPG v1 > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok > cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH > pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5 > 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB > j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz > rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH > Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf > H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM > zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl > aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr > xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+ > tV1WjkDkejPrC4WWKwzm > =sTia > -END PGP SIGNATURE- > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > >
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Filip, On 6/27/14, 11:36 AM, Filip Hanik wrote: > Are there any log entries that would indicate that the poller > thread has died? This/these thread/s start when Tomcat starts. and > a stack over flow on a processing thread should never affect the > poller thread. OP reported in the initial post that the thread had disappeared: On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: > We have no output in tomcat or our logs at the time when this event > occurs. The only sign is when comparing full java thread dump with > a dump from a newly launched Tomcat: > > One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 > is missing/has died. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+ tV1WjkDkejPrC4WWKwzm =sTia -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Are there any log entries that would indicate that the poller thread has died? This/these thread/s start when Tomcat starts. and a stack over flow on a processing thread should never affect the poller thread. Filip On Thu, Jun 26, 2014 at 4:01 PM, Christopher Schultz < ch...@christopherschultz.net> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > André, > > On 6/26/14, 11:09 AM, André Warnier wrote: > > Lars Engholm Johansen wrote: > >> Thanks for all the replies guys. > >> > >> Have you observed a performance increase by setting > >>> acceptorThreadCount to 4 instead of a lower number? I'm just > >>> curious. > >> > >> > >> No, but this was the consensus after elongated discussions in my > >> team. We have 12 cpu cores - better save than sorry. I know that > >> the official docs reads "although you would never really need > >> more than 2" :-) > >> > >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT > >>> connections in netstat output, in case if those are owned by > >>> some abandoned and non properly closed I/O classes that are > >>> still present in JVM memory. > >> > >> > >> Please check out the "open connections" graph at > >> http://imgur.com/s4fOUte As far as I interpret, we only have a > >> slight connection count growth during the days until the poller > >> thread die. These may or may not disappear by forcing a GC, but > >> the amount is not problematic until we hit the > >> http-nio-80-ClientPoller-x thread death. > > > > Just to make sure : what kind of connections does this graph > > actually show ? in which TCP state ? does it count only the > > "established", or also the "FIN_WAIT", "CLOSE_WAIT", "LISTEN" etc.. > > ? > > I think the state of the connections is a red herring: Tomcat will > hold those connections forever because the poller thread has died. > Nothing else matters. > > Even if the CLOSE_WAIT connections were somehow cleared, Tomcat would > never respond properly to another request, ever. A Tomcat restart is > required if the poller thread dies. > > One could argue that the poller threads should maybe "try harder" not > to die, but sometimes you can't stop thread death. > > - -chris > -BEGIN PGP SIGNATURE- > Version: GnuPG v1 > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIcBAEBCAAGBQJTrKZWAAoJEBzwKT+lPKRYZm8QAJ9fD8acAGZY/Tddt4TvUfat > rjpEhyfhUkBIMZkPito/pKHsT8EuwP36g3spa8tmj94gSB+ajgqDjJ6jZ47AN/eG > /0mSEMaedsaEXdO5e3RELDsDv16/fS0+OStvkG0/K22bWgf4Lbh7V9sC+LtIF04S > szo++GV9ykgv9fmFVyxXKRwpDdNWxNzJvgGCi/gXo/1bpSYUTwRUQcXb0aANvU2i > 90KAj4ng9SJqyGwLKvYencYH7Ga2vqmuePHNLKKtiNT6iRLz9ZI8O1qW+SzJIG+e > moqS0VOz8C9v2yk1Dl7Ox7gw9A1dAd4GhLwtpsAcJFdpA4PKzto1hvKIuCLr3j7h > 7pyw0/N2Nldl+nEOOhQiRU41e2L+wci0Rln2b83azvuqO2GrxUlJVthqjbTvLjMX > TbFYAjpIUPZFlIMKtefXA+cPF7JUh1expXk5J6/l1u6hWHR8a/uQ8G5M/5DK+ObV > 1n5f/xU9eoLjbGs7/RC87VhWsxW/WutPh68cPKeC2oH6Hk1VL0lkjUiQ3i4DC+ym > yx4BvO1HUqY9uvjoGF0XoJRTcVlOyUP1G3zpxBBvL7ZpaU+r/c5MjjcMrgg6yzZe > NhMYKGJJ+qIfAtZHCm3snPCFYPysz9JwuUIGPB6ZBOukzDUAi59YLM/u9CIexkiJ > hZpHrNNu9zANMbFvt6jh > =SJdz > -END PGP SIGNATURE- > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > >
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 André, On 6/26/14, 11:09 AM, André Warnier wrote: > Lars Engholm Johansen wrote: >> Thanks for all the replies guys. >> >> Have you observed a performance increase by setting >>> acceptorThreadCount to 4 instead of a lower number? I'm just >>> curious. >> >> >> No, but this was the consensus after elongated discussions in my >> team. We have 12 cpu cores - better save than sorry. I know that >> the official docs reads "although you would never really need >> more than 2" :-) >> >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT >>> connections in netstat output, in case if those are owned by >>> some abandoned and non properly closed I/O classes that are >>> still present in JVM memory. >> >> >> Please check out the "open connections" graph at >> http://imgur.com/s4fOUte As far as I interpret, we only have a >> slight connection count growth during the days until the poller >> thread die. These may or may not disappear by forcing a GC, but >> the amount is not problematic until we hit the >> http-nio-80-ClientPoller-x thread death. > > Just to make sure : what kind of connections does this graph > actually show ? in which TCP state ? does it count only the > "established", or also the "FIN_WAIT", "CLOSE_WAIT", "LISTEN" etc.. > ? I think the state of the connections is a red herring: Tomcat will hold those connections forever because the poller thread has died. Nothing else matters. Even if the CLOSE_WAIT connections were somehow cleared, Tomcat would never respond properly to another request, ever. A Tomcat restart is required if the poller thread dies. One could argue that the poller threads should maybe "try harder" not to die, but sometimes you can't stop thread death. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJTrKZWAAoJEBzwKT+lPKRYZm8QAJ9fD8acAGZY/Tddt4TvUfat rjpEhyfhUkBIMZkPito/pKHsT8EuwP36g3spa8tmj94gSB+ajgqDjJ6jZ47AN/eG /0mSEMaedsaEXdO5e3RELDsDv16/fS0+OStvkG0/K22bWgf4Lbh7V9sC+LtIF04S szo++GV9ykgv9fmFVyxXKRwpDdNWxNzJvgGCi/gXo/1bpSYUTwRUQcXb0aANvU2i 90KAj4ng9SJqyGwLKvYencYH7Ga2vqmuePHNLKKtiNT6iRLz9ZI8O1qW+SzJIG+e moqS0VOz8C9v2yk1Dl7Ox7gw9A1dAd4GhLwtpsAcJFdpA4PKzto1hvKIuCLr3j7h 7pyw0/N2Nldl+nEOOhQiRU41e2L+wci0Rln2b83azvuqO2GrxUlJVthqjbTvLjMX TbFYAjpIUPZFlIMKtefXA+cPF7JUh1expXk5J6/l1u6hWHR8a/uQ8G5M/5DK+ObV 1n5f/xU9eoLjbGs7/RC87VhWsxW/WutPh68cPKeC2oH6Hk1VL0lkjUiQ3i4DC+ym yx4BvO1HUqY9uvjoGF0XoJRTcVlOyUP1G3zpxBBvL7ZpaU+r/c5MjjcMrgg6yzZe NhMYKGJJ+qIfAtZHCm3snPCFYPysz9JwuUIGPB6ZBOukzDUAi59YLM/u9CIexkiJ hZpHrNNu9zANMbFvt6jh =SJdz -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Lars, On 6/26/14, 9:56 AM, Lars Engholm Johansen wrote: > Thanks for all the replies guys. > > Have you observed a performance increase by setting >> acceptorThreadCount to 4 instead of a lower number? I'm just >> curious. > > > No, but this was the consensus after elongated discussions in my > team. We have 12 cpu cores - better save than sorry. I know that > the official docs reads "although you would never really need more > than 2" :-) Okay. You might want to do some actual benchmarking. You may find that more contention for an exclusive lock actually /decreases/ performance. >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT >> connections in netstat output, in case if those are owned by >> some abandoned and non properly closed I/O classes that are still >> present in JVM memory. > > > Please check out the "open connections" graph at > http://imgur.com/s4fOUte As far as I interpret, we only have a > slight connection count growth during the days until the poller > thread die. These may or may not disappear by forcing a GC, but the > amount is not problematic until we hit the > http-nio-80-ClientPoller-x thread death. Like I said, when the poller thread(s) die, you are totally screwed. > The insidious part is that everything may look fine for a long time > (apart >> from an occasional long list of CLOSE_WAIT connections). A GC >> will happen from time to time (*), which will get rid of these >> connections. And those CLOSE_WAIT connections do not consume a >> lot of resources, so you'll never notice. Until at some point, >> the number of these CLOSE_WAIT connections gets just at the point >> where the OS can't swallow any more of them, and then you have a >> big problem. (*) and this is the "insidious squared" part : the >> smaller the Heap, the more often a GC will happen, so the sooner >> these CLOSE_WAIT connections will disappear. Conversely, by >> increasing the Heap size, you leave more time between GCs, and >> make the problem more likely to happen. > > > You are correct. The bigger the Heap size the rarer a GC will > happen - and we have set aside 32GiB of ram. But again, referring > to my "connection count" graph, a missing close in the code does > not seem to be the culprit. > > A critical error (java.lang.ThreadDeath, >> java.lang.VirtualMachineError) will cause death of a thread. A >> subtype of the latter is java.lang.OutOfMemoryError. > > > I just realized that StackOverflowError is also a subclass of > VirtualMachineError, and remembered that we due to company > historical reasons had configured the JVM stack size to 256KiB > (down from the default 1GiB on 64 bit machines). This was to > support a huge number of threads on limited memory in the past. I > have now removed the -Xss jvm parameter and are exited if this > solves our poller thread problems. Thanks for the hint, > Konstantin. Definitely let us know. A StackOverflowError should be relatively rare, but if you have set your stack size to something very low, this can happen. Remember, since you are using the NIO connector, you don't need a huge number of threads to support a huge number of connections. The stack size relates to the number of threads you want to have active. It looks like you haven't set the number of request-processor threads, so you'll get the default value of 200. The default stack size for Oracle's HotSpot JVM is 1MiB, not 1GiB. 200MiB in a 64-bit heap shouldn't be too much for your JVM, and will hopefully cut-down on your stack problems. Do you have a lot of recursive algorithms, or anything that calls very-deep stacks? - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJTrKXlAAoJEBzwKT+lPKRYbT8P+wR71t/UOmD1VoXZaCnSsq/m spUp1sQMTIsk1OT1rQXQiHe77XMBaVPEsaa9/DVnzk2T3ZJwab6uzCB+fQcULL9c 62ucaHRLJiMF91G7uHMd96Ep7hw2+vrAT3HIlZbKex6Eat4bgCVSbM+Xj9SAwN3v l0e66E+Pw1v1n0p5tK8W4JeOTx5I0+rF/ebwIY80y00umy6eW7lTnVmG6gauFsHb XUaqqzOh6zqArXjWdhbhIBfrShjsUWfKDNQnuZ4JRwmBYqsm7paNcZ5mxixizv/a LTOaLn8dSktWQsvf1F/qQVcCE23WIag5xg+jykAJY2kyU4LIvjdvXUPMcWKkVpZj CtwKIdoaK33+Mt0w01yoEyfpWghtiwy2rJvEZOZYR9n8krpAOXCXnHa7PuyRN5lZ LF82CPYgztkLC9OmlZoAub5qFeUYzfpdiDP6MBnSp54cFYRwvUtKgLt2YGIrWlwI r9/NlGVpzT8lurzHkxAe65rNifRqsrl3Duz9ORHNMVGepRzpJJP2e3fjG79/S0JB zMo6j8arETnwkVxYtLTu/VqCjirn79jyZaqOtAyHJa+41/e+GbxMNzXk0FUevZ42 ySHCySEo13wdx4QcWN9CtTGSD9OR3hdK7E/56RGtgvgIZFyiiGT3Kl8HFA0a2bo0 iKnMg6CXezCCJkPq6K4E =VM0k -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Lars Engholm Johansen wrote: Thanks for all the replies guys. Have you observed a performance increase by setting acceptorThreadCount to 4 instead of a lower number? I'm just curious. No, but this was the consensus after elongated discussions in my team. We have 12 cpu cores - better save than sorry. I know that the official docs reads "although you would never really need more than 2" :-) The GC that Andre suggested was to get rid of some of CLOSE_WAIT connections in netstat output, in case if those are owned by some abandoned and non properly closed I/O classes that are still present in JVM memory. Please check out the "open connections" graph at http://imgur.com/s4fOUte As far as I interpret, we only have a slight connection count growth during the days until the poller thread die. These may or may not disappear by forcing a GC, but the amount is not problematic until we hit the http-nio-80-ClientPoller-x thread death. Just to make sure : what kind of connections does this graph actually show ? in which TCP state ? does it count only the "established", or also the "FIN_WAIT", "CLOSE_WAIT", "LISTEN" etc.. ? The insidious part is that everything may look fine for a long time (apart from an occasional long list of CLOSE_WAIT connections). A GC will happen from time to time (*), which will get rid of these connections. And those CLOSE_WAIT connections do not consume a lot of resources, so you'll never notice. Until at some point, the number of these CLOSE_WAIT connections gets just at the point where the OS can't swallow any more of them, and then you have a big problem. (*) and this is the "insidious squared" part : the smaller the Heap, the more often a GC will happen, so the sooner these CLOSE_WAIT connections will disappear. Conversely, by increasing the Heap size, you leave more time between GCs, and make the problem more likely to happen. You are correct. The bigger the Heap size the rarer a GC will happen - and we have set aside 32GiB of ram. But again, referring to my "connection count" graph, a missing close in the code does not seem to be the culprit. A critical error (java.lang.ThreadDeath, java.lang.VirtualMachineError) will cause death of a thread. A subtype of the latter is java.lang.OutOfMemoryError. I just realized that StackOverflowError is also a subclass of VirtualMachineError, and remembered that we due to company historical reasons had configured the JVM stack size to 256KiB (down from the default 1GiB on 64 bit machines). This was to support a huge number of threads on limited memory in the past. I have now removed the -Xss jvm parameter and are exited if this solves our poller thread problems. Thanks for the hint, Konstantin. I promise to report back to you guys :-) On Fri, Jun 20, 2014 at 2:49 AM, Filip Hanik wrote: "Our sites still functions normally with no cpu spikes during this build up until around 60,000 connections, but then the server refuses further connections and a manual Tomcat restart is required." yes, the connection limit is a 16 bit short count minus some reserved addresses. So your system should become unresponsive, you've run out of ports (the 16 bit value in a TCP connection). netstat -na should give you your connection state when this happens, and that is helpful debug information. Filip On Thu, Jun 19, 2014 at 2:44 PM, André Warnier wrote: Konstantin Kolinko wrote: 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen : I will try to force a GC next time I am at the console about to restart a Tomcat where one of the http-nio-80-ClientPoller-x threads have died and connection count is exploding. But I do not see this as a solution - can you somehow deduct why this thread died from the outcome from a GC? Nobody said that a thread died because of GC. The GC that Andre suggested was to get rid of some of CLOSE_WAIT connections in netstat output, in case if those are owned by some abandoned and non properly closed I/O classes that are still present in JVM memory. Exactly, thanks Konstantin for clarifying. I was going per the following in the original post : "Our sites still functions normally with no cpu spikes during this build up until around 60,000 connections, but then the server refuses further connections and a manual Tomcat restart is required." CLOSE_WAIT is a normal state for a TCP connection, but it should not normally last long. It indicates basically that the other side has closed the connection, and that this side should do the same. But it doesn't, and as long as it doesn't the connection remains in the CLOSE_WAIT state. It's like "half-closed", but not entirely, and as long as it isn't, the OS cannot get rid of it. For a more precise explanation, Google for "TCP CLOSE_WAIT state". I have noticed in the past, with some Linux versions, that when the number of such CLOSE_WAIT connections goes above a certain level (several hundred), the TCP/IP stack can become totally unresponsive and not
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Thanks for all the replies guys. Have you observed a performance increase by setting > acceptorThreadCount to 4 instead of a lower number? I'm just curious. No, but this was the consensus after elongated discussions in my team. We have 12 cpu cores - better save than sorry. I know that the official docs reads "although you would never really need more than 2" :-) The GC that Andre suggested was to get rid of some of CLOSE_WAIT > connections in netstat output, in case if those are owned by some > abandoned and non properly closed I/O classes that are still present > in JVM memory. Please check out the "open connections" graph at http://imgur.com/s4fOUte As far as I interpret, we only have a slight connection count growth during the days until the poller thread die. These may or may not disappear by forcing a GC, but the amount is not problematic until we hit the http-nio-80-ClientPoller-x thread death. The insidious part is that everything may look fine for a long time (apart > from an occasional long list of CLOSE_WAIT connections). A GC will happen > from time to time (*), which will get rid of these connections. And those > CLOSE_WAIT connections do not consume a lot of resources, so you'll never > notice. > Until at some point, the number of these CLOSE_WAIT connections gets just > at the point where the OS can't swallow any more of them, and then you have > a big problem. > (*) and this is the "insidious squared" part : the smaller the Heap, the > more often a GC will happen, so the sooner these CLOSE_WAIT connections > will disappear. Conversely, by increasing the Heap size, you leave more > time between GCs, and make the problem more likely to happen. You are correct. The bigger the Heap size the rarer a GC will happen - and we have set aside 32GiB of ram. But again, referring to my "connection count" graph, a missing close in the code does not seem to be the culprit. A critical error (java.lang.ThreadDeath, > java.lang.VirtualMachineError) will cause death of a thread. > A subtype of the latter is java.lang.OutOfMemoryError. I just realized that StackOverflowError is also a subclass of VirtualMachineError, and remembered that we due to company historical reasons had configured the JVM stack size to 256KiB (down from the default 1GiB on 64 bit machines). This was to support a huge number of threads on limited memory in the past. I have now removed the -Xss jvm parameter and are exited if this solves our poller thread problems. Thanks for the hint, Konstantin. I promise to report back to you guys :-) On Fri, Jun 20, 2014 at 2:49 AM, Filip Hanik wrote: > "Our sites still functions normally with no cpu spikes during this build up > until around 60,000 connections, but then the server refuses further > connections and a manual Tomcat restart is required." > > yes, the connection limit is a 16 bit short count minus some reserved > addresses. So your system should become unresponsive, you've run out of > ports (the 16 bit value in a TCP connection). > > netstat -na should give you your connection state when this happens, and > that is helpful debug information. > > Filip > > > > > On Thu, Jun 19, 2014 at 2:44 PM, André Warnier wrote: > > > Konstantin Kolinko wrote: > > > >> 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen : > >> > >>> I will try to force a GC next time I am at the console about to > restart a > >>> Tomcat where one of the http-nio-80-ClientPoller-x threads have died > and > >>> connection count is exploding. > >>> > >>> But I do not see this as a solution - can you somehow deduct why this > >>> thread died from the outcome from a GC? > >>> > >> > >> Nobody said that a thread died because of GC. > >> > >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT > >> connections in netstat output, in case if those are owned by some > >> abandoned and non properly closed I/O classes that are still present > >> in JVM memory. > >> > > > > Exactly, thanks Konstantin for clarifying. > > > > I was going per the following in the original post : > > > > "Our sites still functions normally with no cpu spikes during this build > up > > until around 60,000 connections, but then the server refuses further > > connections and a manual Tomcat restart is required." > > > > CLOSE_WAIT is a normal state for a TCP connection, but it should not > > normally last long. > > It indicates basically that the other side has closed the connection, and > > that this side should do the same. But it doesn't, and as long as it > > doesn't the connection remains in the CLOSE_WAIT state. It's like > > "half-closed", but not entirely, and as long as it isn't, the OS cannot > get > > rid of it. > > For a more precise explanation, Google for "TCP CLOSE_WAIT state". > > > > I have noticed in the past, with some Linux versions, that when the > number > > of such CLOSE_WAIT connections goes above a certain level (several > > hundred), the TCP/IP stack can become totally unresponsive and not accept
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
"Our sites still functions normally with no cpu spikes during this build up until around 60,000 connections, but then the server refuses further connections and a manual Tomcat restart is required." yes, the connection limit is a 16 bit short count minus some reserved addresses. So your system should become unresponsive, you've run out of ports (the 16 bit value in a TCP connection). netstat -na should give you your connection state when this happens, and that is helpful debug information. Filip On Thu, Jun 19, 2014 at 2:44 PM, André Warnier wrote: > Konstantin Kolinko wrote: > >> 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen : >> >>> I will try to force a GC next time I am at the console about to restart a >>> Tomcat where one of the http-nio-80-ClientPoller-x threads have died and >>> connection count is exploding. >>> >>> But I do not see this as a solution - can you somehow deduct why this >>> thread died from the outcome from a GC? >>> >> >> Nobody said that a thread died because of GC. >> >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT >> connections in netstat output, in case if those are owned by some >> abandoned and non properly closed I/O classes that are still present >> in JVM memory. >> > > Exactly, thanks Konstantin for clarifying. > > I was going per the following in the original post : > > "Our sites still functions normally with no cpu spikes during this build up > until around 60,000 connections, but then the server refuses further > connections and a manual Tomcat restart is required." > > CLOSE_WAIT is a normal state for a TCP connection, but it should not > normally last long. > It indicates basically that the other side has closed the connection, and > that this side should do the same. But it doesn't, and as long as it > doesn't the connection remains in the CLOSE_WAIT state. It's like > "half-closed", but not entirely, and as long as it isn't, the OS cannot get > rid of it. > For a more precise explanation, Google for "TCP CLOSE_WAIT state". > > I have noticed in the past, with some Linux versions, that when the number > of such CLOSE_WAIT connections goes above a certain level (several > hundred), the TCP/IP stack can become totally unresponsive and not accept > any new connections at all, on any port. > In my case, this was due to the following kind of scenario : > Some class Xconnection instantiates an object, and upon creation this > object opens a TCP connection to something. This object is now used as an > "alias" for this connection. Time passes, and finally the object goes out > of scope (e.g. the reference to it is set to "null"), and one may believe > that the underlying connection gets closed as a side-effect. But it > doesn't, not as long as this object is not actually garbage-collected, > which triggers the actual object destruction and the closing of the > underlying connection. > Forcing a GC is a way to provoke this (and restarting Tomcat another, but > more drastic). > > If a forced GC gets rid of your many CLOSE_WAIT connections and makes your > Tomcat operative again, that would be a sign that something similar to the > above is occurring; and then you would need to look in your application for > the oversight. (e.g. the class should have a "close" method (closing the > underlying connection), which should be invoked before letting the object > go out of scope). > > The insidious part is that everything may look fine for a long time (apart > from an occasional long list of CLOSE_WAIT connections). A GC will happen > from time to time (*), which will get rid of these connections. And those > CLOSE_WAIT connections do not consume a lot of resources, so you'll never > notice. > Until at some point, the number of these CLOSE_WAIT connections gets just > at the point where the OS can't swallow any more of them, and then you have > a big problem. > > That sounds a bit like your case, doesn't it ? > > (*) and this is the "insidious squared" part : the smaller the Heap, the > more often a GC will happen, so the sooner these CLOSE_WAIT connections > will disappear. Conversely, by increasing the Heap size, you leave more > time between GCs, and make the problem more likely to happen. > > > I believe that the rest below may be either a consequence, or a red > herring, and I would first eliminate the above as a cause. > > > >> And could an Exception/Error in Tomcat thread http-nio-80-ClientPoller-0 >>> or http-nio-80-ClientPoller-1 make the thread die with no Stacktrace >>> in >>> the Tomcat logs? >>> >>> >> A critical error (java.lang.ThreadDeath, >> java.lang.VirtualMachineError) will cause death of a thread. >> >> A subtype of the latter is java.lang.OutOfMemoryError. >> >> As of now, such errors are passed through and are not logged by >> Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException(). >> ThreadGroup prints them to System.err (catalina.out). >> >> >> Best regards, >> Konstantin Kolinko >> >>
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Konstantin Kolinko wrote: 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen : I will try to force a GC next time I am at the console about to restart a Tomcat where one of the http-nio-80-ClientPoller-x threads have died and connection count is exploding. But I do not see this as a solution - can you somehow deduct why this thread died from the outcome from a GC? Nobody said that a thread died because of GC. The GC that Andre suggested was to get rid of some of CLOSE_WAIT connections in netstat output, in case if those are owned by some abandoned and non properly closed I/O classes that are still present in JVM memory. Exactly, thanks Konstantin for clarifying. I was going per the following in the original post : "Our sites still functions normally with no cpu spikes during this build up until around 60,000 connections, but then the server refuses further connections and a manual Tomcat restart is required." CLOSE_WAIT is a normal state for a TCP connection, but it should not normally last long. It indicates basically that the other side has closed the connection, and that this side should do the same. But it doesn't, and as long as it doesn't the connection remains in the CLOSE_WAIT state. It's like "half-closed", but not entirely, and as long as it isn't, the OS cannot get rid of it. For a more precise explanation, Google for "TCP CLOSE_WAIT state". I have noticed in the past, with some Linux versions, that when the number of such CLOSE_WAIT connections goes above a certain level (several hundred), the TCP/IP stack can become totally unresponsive and not accept any new connections at all, on any port. In my case, this was due to the following kind of scenario : Some class Xconnection instantiates an object, and upon creation this object opens a TCP connection to something. This object is now used as an "alias" for this connection. Time passes, and finally the object goes out of scope (e.g. the reference to it is set to "null"), and one may believe that the underlying connection gets closed as a side-effect. But it doesn't, not as long as this object is not actually garbage-collected, which triggers the actual object destruction and the closing of the underlying connection. Forcing a GC is a way to provoke this (and restarting Tomcat another, but more drastic). If a forced GC gets rid of your many CLOSE_WAIT connections and makes your Tomcat operative again, that would be a sign that something similar to the above is occurring; and then you would need to look in your application for the oversight. (e.g. the class should have a "close" method (closing the underlying connection), which should be invoked before letting the object go out of scope). The insidious part is that everything may look fine for a long time (apart from an occasional long list of CLOSE_WAIT connections). A GC will happen from time to time (*), which will get rid of these connections. And those CLOSE_WAIT connections do not consume a lot of resources, so you'll never notice. Until at some point, the number of these CLOSE_WAIT connections gets just at the point where the OS can't swallow any more of them, and then you have a big problem. That sounds a bit like your case, doesn't it ? (*) and this is the "insidious squared" part : the smaller the Heap, the more often a GC will happen, so the sooner these CLOSE_WAIT connections will disappear. Conversely, by increasing the Heap size, you leave more time between GCs, and make the problem more likely to happen. I believe that the rest below may be either a consequence, or a red herring, and I would first eliminate the above as a cause. And could an Exception/Error in Tomcat thread http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 make the thread die with no Stacktrace in the Tomcat logs? A critical error (java.lang.ThreadDeath, java.lang.VirtualMachineError) will cause death of a thread. A subtype of the latter is java.lang.OutOfMemoryError. As of now, such errors are passed through and are not logged by Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException(). ThreadGroup prints them to System.err (catalina.out). Best regards, Konstantin Kolinko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen : > I will try to force a GC next time I am at the console about to restart a > Tomcat where one of the http-nio-80-ClientPoller-x threads have died and > connection count is exploding. > > But I do not see this as a solution - can you somehow deduct why this > thread died from the outcome from a GC? Nobody said that a thread died because of GC. The GC that Andre suggested was to get rid of some of CLOSE_WAIT connections in netstat output, in case if those are owned by some abandoned and non properly closed I/O classes that are still present in JVM memory. > And could an Exception/Error in Tomcat thread http-nio-80-ClientPoller-0 > or http-nio-80-ClientPoller-1 make the thread die with no Stacktrace in > the Tomcat logs? > A critical error (java.lang.ThreadDeath, java.lang.VirtualMachineError) will cause death of a thread. A subtype of the latter is java.lang.OutOfMemoryError. As of now, such errors are passed through and are not logged by Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException(). ThreadGroup prints them to System.err (catalina.out). Best regards, Konstantin Kolinko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Lars, On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote: > Our sites can run for days without problems, but once in a while > the tomcat connection count suddenly starts growing abnormally > fast. See this graph: http://imgur.com/s4fOUte netstat shows these > TCP connections to be mostly in CLOSE_WAIT state. > > Our sites still functions normally with no cpu spikes during this > build up until around 60,000 connections, but then the server > refuses further connections and a manual Tomcat restart is > required. > > We have no output in tomcat or our logs at the time when this event > occurs. The only sign is when comparing full java thread dump with > a dump from a newly launched Tomcat: One of > http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 is > missing/has died. If the poller thread has died, you are completely SOL. Is there no indication in any log why the poller thread died? Particularly, check catalina.out for OOME or an exception that might kill such a thread. Can you configure your logger to increase the log level for the poller component? Actually, the NIO Poller only contains log.error messages except for 2 @ debug that indicate encountering JDK 1.5 bugs, so changing the log level probably won't help. The only other thing that looks like it might cause complications would be an OOME, which should definitely show up in the logs. > Our connector configuration: acceptorThreadCount="4" Have you observed a performance increase by setting acceptorThreadCount to 4 instead of a lower number? I'm just curious. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJTouYeAAoJEBzwKT+lPKRYf98P/0kYjX8jH1ZCvZ/FS9hXnYBw hJtrcYHYe6ulPJLz66QU0FY0xNlu6ADISukW1njqcr7Cz/WJ8nIQI6fXnOVgcZj1 F/JLj4OhKQP8b0WA6NkN0KTtRBt/OsIZx4pp/VHBuBbkr2c/B7Ixfh3a15/XZjrl TGa+KMUIvJp/gBuLCsrvmY39SACCfKvVAL6soe6+4LozURSpv/OdGBgitbiHY8sf AqB2g4+psltMI5VGXC/wSPpmnIGfalgFyX1o1EshBeiz6sg6YejjHY5NI/eLcdW4 nfKRz+Dbm0y/uY3fs7U7tOvI2rhbXBXwu1XoscoGzlRaB4pD5nJLdIpczH4vBu2r vjKyVE2v1QitpMr/hzsZkngSkiPe7JtnrQIQGqJ+pvgdtbpWhmOmG8OcCLR3xxH5 1arfgscgxCzH0apwuWZmO2kUXqDk9z7qyHuGHOMwFcoFfHaNUF6C83tlKEnGM0ME pc+pCRwwVOIoLllVo1WizALFv/oI3MiWxs8C5FhibYqzpWAR1E6zfUz9YVarCqiV F8Cu56mVlTD9bdtunvutfVcByOuVEqxw1cMKS5nVTKueKqIi1nKh8nr9c8OAK1QP D0CeHAC6EDTTKjxUmRzApEq8LhrSCJVY6XxNdck1qeVoWp7nceQs5yrPW6t/70pE BAbG0nqUNjrPqaJP98w9 =A3kD -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
I will try to force a GC next time I am at the console about to restart a Tomcat where one of the http-nio-80-ClientPoller-x threads have died and connection count is exploding. But I do not see this as a solution - can you somehow deduct why this thread died from the outcome from a GC? And could an Exception/Error in Tomcat thread http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 make the thread die with no Stacktrace in the Tomcat logs? (This problem is quite critical for our production environment) On Mon, Jun 16, 2014 at 5:56 PM, André Warnier wrote: > Lars Engholm Johansen wrote: > >> Our company are running several Tomcat 7.0.52 high volume Ubuntu 12.04 >> production servers. >> We are using Tomcat WebSockets (JSR356 implementation) heavily with 100M >> text messages (100GiB) per day. >> >> We monitor webserver health by measuring several key parameters every >> minute, including tomcat connection count using: >> >> mBeanServer.getAttribute(threadPool, "connectionCount"); // >> threadPool is MBean of "type=ThreadPool" >> >> __The problem__ >> >> Our sites can run for days without problems, but once in a while the >> tomcat >> connection count suddenly starts growing abnormally fast. >> See this graph: http://imgur.com/s4fOUte >> netstat shows these TCP connections to be mostly in CLOSE_WAIT state. >> > > And if at that moment, you force the JVM that runs Tomcat to do a Garbage > Collection, do you still have these numerous connections in CLOSE_WAIT > state after the GC completed ? > > > >> Our sites still functions normally with no cpu spikes during this build up >> until around 60,000 connections, but then the server refuses further >> connections and a manual Tomcat restart is required. >> >> We have no output in tomcat or our logs at the time when this event >> occurs. >> The only sign is when comparing full java thread dump with a dump from a >> newly launched Tomcat: >> One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 is >> missing/has died. >> >> We have observed this problem at least since Tomcat 7.0.48 and can not >> find >> indications in Tomcat 7.0.x change logs that it should have been fixed in >> newer releases. >> >> Any help or advises are appreciated, >> Best regards, >> Lars Engholm Johansen >> >> >> Our connector configuration: >> > acceptCount="1500" >> acceptorThreadCount="4" >> asyncTimeout="10" >> connectionTimeout="6" >> connectionUploadTimeout="12" >> disableUploadTimeout="false" >> enableLookups="false" >> keepAliveTimeout="12" >> maxConnections="10" >> maxPostSize="300" >> maxThreads="300" >> port="80" >> protocol="org.apache.coyote.http11.Http11NioProtocol" >> socket.soKeepAlive="true" >> /> >> >> > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > >
Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death
Lars Engholm Johansen wrote: Our company are running several Tomcat 7.0.52 high volume Ubuntu 12.04 production servers. We are using Tomcat WebSockets (JSR356 implementation) heavily with 100M text messages (100GiB) per day. We monitor webserver health by measuring several key parameters every minute, including tomcat connection count using: mBeanServer.getAttribute(threadPool, "connectionCount"); // threadPool is MBean of "type=ThreadPool" __The problem__ Our sites can run for days without problems, but once in a while the tomcat connection count suddenly starts growing abnormally fast. See this graph: http://imgur.com/s4fOUte netstat shows these TCP connections to be mostly in CLOSE_WAIT state. And if at that moment, you force the JVM that runs Tomcat to do a Garbage Collection, do you still have these numerous connections in CLOSE_WAIT state after the GC completed ? Our sites still functions normally with no cpu spikes during this build up until around 60,000 connections, but then the server refuses further connections and a manual Tomcat restart is required. We have no output in tomcat or our logs at the time when this event occurs. The only sign is when comparing full java thread dump with a dump from a newly launched Tomcat: One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 is missing/has died. We have observed this problem at least since Tomcat 7.0.48 and can not find indications in Tomcat 7.0.x change logs that it should have been fixed in newer releases. Any help or advises are appreciated, Best regards, Lars Engholm Johansen Our connector configuration: - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Connection count explosion due to thread http-nio-80-ClientPoller-x death
Our company are running several Tomcat 7.0.52 high volume Ubuntu 12.04 production servers. We are using Tomcat WebSockets (JSR356 implementation) heavily with 100M text messages (100GiB) per day. We monitor webserver health by measuring several key parameters every minute, including tomcat connection count using: mBeanServer.getAttribute(threadPool, "connectionCount"); // threadPool is MBean of "type=ThreadPool" __The problem__ Our sites can run for days without problems, but once in a while the tomcat connection count suddenly starts growing abnormally fast. See this graph: http://imgur.com/s4fOUte netstat shows these TCP connections to be mostly in CLOSE_WAIT state. Our sites still functions normally with no cpu spikes during this build up until around 60,000 connections, but then the server refuses further connections and a manual Tomcat restart is required. We have no output in tomcat or our logs at the time when this event occurs. The only sign is when comparing full java thread dump with a dump from a newly launched Tomcat: One of http-nio-80-ClientPoller-0 or http-nio-80-ClientPoller-1 is missing/has died. We have observed this problem at least since Tomcat 7.0.48 and can not find indications in Tomcat 7.0.x change logs that it should have been fixed in newer releases. Any help or advises are appreciated, Best regards, Lars Engholm Johansen Our connector configuration: