Re: Tomcat 8 epoll spinning issue (100% CPU)

Emmanuel Lecharny Fri, 04 Oct 2019 18:26:05 -0700

On 2019/10/04 22:47:17, Christopher Schultz <ch...@christopherschultz.net> 
wrote: 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Emmanuel,
> 
> On 10/4/19 16:38, Emmanuel Lecharny wrote:
> > Hi remy,
> >
> > On 2019/10/04 15:37:36, Rémy Maucherat <r...@apache.org> wrote:
> >> On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny
> >> <elecha...@apache.org> wrote:
> >>
> >>> Hi !
> >>>
> >>> I filled a ticket yesterday about a pb we face with many NIO
> >>> framework, which I think could hit Tomcat too (see
> >>> https://bz.apache.org/bugzilla/show_bug.cgi?id=63802).
> >>> Actually, I think I'm facing this problem on a project I'm
> >>> working on atm.
> >>>
> >>> Remy suggested we discuss it on this mailing list.
> >>>
> >>> Bottom line, what happens is that under some circumstances not
> >>> well defined, the call to select() might end to an infinite
> >>> loop eating all the CPU (select() returns 0, so select is
> >>> immediately called again, and we loop).
> >>>
> >>> In various NIO framworks - and being a MINA committer, I have
> >>> implemented the discussed workaround -, we are controlling this
> >>> situation by breaking this infinite loop this way : - if the
> >>> select() call returns 0 - then if we have called select() more
> >>> than N times in less than M ms (N=10, M=100 in MINA) - then we
> >>> create a new Selector, register all the selectionKey that were
> >>> registered on the broken selector, and ditch the old selector.
> >>>
> >>> This workaround does not cost a lot when the selector works as
> >>> designed, as a select() call should never return 0.
> >>>
> >>
> >> There's actually a very similar hack for APR that has been placed
> >> by myself a long time ago [
> >> https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/u
> til/net/AprEndpoint.java#L1410
> >>
> >>
> ], I don't even know if it's actually useful and it's certainly not
> >> testable. Overall what it does is pretty terrible :(
> >>
> >> Personally I would like to know more about this "long lived bug
> >> either in the JDK or even in Linux epoll implementation" like
> >> actual platform details and JVM versions used since I've never
> >> heard about it in the first place.
> >
> > for the record, I had a discussion yesterday with one of my close
> > friend and co-worker back in the 90's. He remember clearly, while
> > working on the SUN TCP stack,  that such a problem occorded back
> > then. Yes, 25 years ago... Ok, that was just for the fun, it's
> > likely be perfectly unrelated ;-)
> >
> > At MINA, we were hit by this bug in 2009 (see
> > https://issues.apache.org/jira/browse/DIRMINA-678), and it was
> > linked to a bug reported on Jetty
> > (http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConn
> ector-100-CPU-usage-on-Linux-td36385.html),
> > itself related to some JDK bugs, supposedly fixed since then.
> >
> > I had a long conversation with Jean-François Arcand somewhere
> > around this date, and he suggested we adopt the same workaround he
> > applied to Grizzly. We also had a convo with Alan Bateman during a
> > Java One in SF, but nothing specific resulted from this convo,
> > except that AFAICR, he aknowledge there is an issue.
> >
> > So this problem started with JDK 6, but I can't guarantee it wasn't
> > already present in JDK 5 or 4, on linux, and not on any other OS
> > like windows or Mac OSX. It's not exactly fresh in my mind, because
> > it was already 10 years ago.
> >
> >> Also I'd like to know since NIO2 doesn't expose its poller and
> >> almost certainly doesn't have such a platform specific mysterious
> >> thing inside it [we can check I guess].
> >
> > No idea, but I think NIO.2 has just added some coating over what
> > was NIO.1 (guts feeling here...).
> >
> > In the context of NIO, do you have evidence the
> >> hack has been tested to work (besides avoiding the CPU loop) and
> >> allowed the server to continue its regular operation without any
> >> impact ?
> >
> > Absolutely. We do log in MINA when a new selector is created, and
> > we have had some issue related to a case where this piece of code
> > was called, fixed since :
> > https://issues.apache.org/jira/browse/DIRMINA-762?page=com.atlassian.j
> ira.plugin.system.issuetabpanels%3Aall-tabpanel
> >
> >  So we definitively know that people get hit by the initial issue
> > (select returns 0), a new selector is being created, and everything
> > is fine from the user perspective (I do believe that creating the
> > new selector and registering all the SelectionKey on it is not
> > worse than having to restart the server manually...)
> >
> > In any case, Grizzly has probably the best possible approach to
> > this problem: make the workaround optional.
> >
> > For Tomcat, I'm tempted to use the Http11AprProtocol class instead
> > of the NIO one, as one can swap the protocol in the configuration,
> > but the impact is that you need OpenSSL already installed on your
> > machine. That would be an acceptable workaround in my case, but a
> > painful one. A similar approach would be pleasant to have : a
> > Http11NIONoSpinProtocol class that we can use if needed.
> 
> I'm inclined to just build this into the standard protocol class with
> some good documentation explaining why the hack is in there. You will
> never know you need it until you suddenly need it, and then it's too lat
> e.
> 
> Is this only a problem when select() returns 0? That is... is there
> really a reason to do the N times in M ms check? Can we simply replace
> the Selector is select() ever returns 0? Or are there legitimate
> use-cases for that return value under certain circumstances?

We give the selector some opportunity to 'fix' itself by letting it looping N 
times. If for any reason a first call to select() returns 0 but the immediate 
second call does not, or if it returns 0 after M ms, it would be bad to ditch 
it, that's why we wait N times and M ms (that means the selector somehow is 
back waiting for events for M ms doing nothing, so we don't eat CPU).

> 
> Instead of implementing N / M, why not simply maintain a counter of
> "useless select()s" and then replace the Selector when the count gets
> too high? 

because a useless select that happens only once in a while is not problematic. 
Ditching the Selector every time it occurs would not be a free operation in 
this case.

Or, perhaps a tweak, something like this (psuedocode):
> 
>     int badness = 0;
> 
>     while(dontStop) {
>         if(0 == select(..)) {
>           badness++;
> 
>           if(badness > threshold) {
>               // replace selector
>           }
>         } else {
>           // do useful work
> 
>           badness = Math.min(0, badness - 1);
>         }
>     }


side note: if one consider that computing a time delta (calling 
System.currentmillis) is wasting CPU - I would agree up to a point -, it's an 
option to do so after the first detection of a select returning 0.

Emmanuel

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Tomcat 8 epoll spinning issue (100% CPU)

Reply via email to