On Fri, Jun 5, 2020 at 3:08 PM Martin Grigorov <mgrigo...@apache.org> wrote:
> Hi, > > On Thu, Jun 4, 2020 at 9:28 PM Rémy Maucherat <r...@apache.org> wrote: > >> On Thu, Jun 4, 2020 at 6:16 PM Mark Thomas <ma...@apache.org> wrote: >> >>> On 04/06/2020 16:33, Martin Grigorov wrote: >>> > Hi team, >>> > >>> > I am profiling our application and I've found that Tomcat uses its own >>> > Synchronized Queue/Stack implementations which synchronize on instance >>> > level, i.e. their methods are 'synchronized'. >>> >> >> Can you work off 10 (master) first ? >> > > The application is based on Spring Boot and my attempts to > "jakartaee migrate it" were not successful. > > But I've created a very simple Servlet and I will try it on Tomcat 10.x > soon: > > @WebServlet(urlPatterns = "/test") > public class PlainTextServlet extends HttpServlet { > > public static final String CONTENT = "Hello world!"; > public static final int CONTENT_LENGTH = CONTENT.length(); > > @Override > protected void doGet(final HttpServletRequest req, final > HttpServletResponse resp) throws IOException { > resp.setContentType("text/plain"); > resp.setContentLength(CONTENT_LENGTH); > resp.getWriter().println(CONTENT); > } > } > > For load testing it I use wrk: wrk -c96 -t8 -d60s https://host:port/test > The GC JVM arguments are: -Xloggc:./gc.log -XX:+PrintGCDetails -verbose:gc > JDK 1.8.0 b252 > > The GC logs are: > - Tomcat 9.0.x (uses SynchronizedQueue/Stack): > https://gist.github.com/martin-g/d2570e7a6896e4d094ce548ceea3adb6 > - Tomcat 9.0.x with my changes: > https://gist.github.com/martin-g/52c7d3a883b37e9bcd11ad6430800852 > I've uploaded them to https://gceasy.io/ and the charts are similar. > > The summary from wrk with SynchronizedQueue/Stack is: > > Running 1m test @ http://192.168.0.206:8080 > 8 threads and 96 connections > Thread Stats Avg Stdev Max +/- Stdev > Latency 1.24ms 788.15us 19.46ms 91.44% > Req/Sec 10.04k 1.64k 12.45k 79.48% > 4797193 requests in 1.00m, 600.19MB read > Requests/sec: 79834.45 > Transfer/sec: 9.99MB > > I didn't preserve the summary from the lock-free run but it was similar - > the throughput is around 80K/sec +- few thousands. > I.e. the throughput does not improve after my changes! > Only the locking profiling results become better (Async Profiler: > ./profile.sh -e lock -d50s -f lock-results.txt PID) > I rarely got any gains when making an optimization lately. But this sounds good [or not bad at least]. Rémy > > Martin > > >> >>> >>> ~8 years ago when I was profiling Tomcat and wrote those having >>> something targetted to Tomcat's specific use case made a measurable >>> improvement. >>> >> >> Yes, the SyncStack was the go to structure for all reuse. Not sure if >> it's the best. >> >> I already removed some of the usage in 10: >> - No more block poller and its pooling/recycling [I had to pull its >> removal out of 9 due to an unfortunate regression, this could be brought >> back now that it seems to work] >> - No recycling of trivial objects >> >> The main use should now be the recycled processors and the recycled >> channels (it would be bad to not reuse them since they may use direct >> buffers ...). >> >> >>> >>> I am more than prepared to believe that things have moved on since then >>> but I'd be interested in seeing the same numbers with the latest Java 8 >>> from Adopt OpenJDK rather than Java 14. >>> >> >> +1, firnally there could be some differences between JDK, like now they >> started to actually improve NIO 1 again. >> >> Rémy >> >> >>> >>> I'd also like to see overall throughput numbers and the GC frequency for >>> each test scenario. >>> >>> Mark >>> >>> >>> >>> > >>> > The javadoc for those two classes say that they are preferred because >>> > they are GC-friendlier than >>> > java.uti.concurrent.ConcurrentLinkedQueue. >>> > >>> > I use Async Profiler >>> > (https://github.com/jvm-profiling-tools/async-profiler) to profile >>> CPU, >>> > memory, >>> > locks and few lower level types (from perf). >>> > Issuing "./profiler.sh -e lock -d 60 PID" against 9.0.x shows such >>> reports: >>> > >>> > Started [lock] profiling >>> > >>> > --- Execution profile --- >>> > >>> > Total samples : 713834 >>> >>> > >>> > >>> > >>> > Frame buffer usage : 0.0288% >>> >>> > >>> > >>> > --- >>> > 1106165618177 ns (58.97%), 395609 samples >>> > >>> > [ 0] >>> > org.apache.tomcat.util.collections.SynchronizedStack >>> >>> > >>> > [ 1] >>> > org.apache.tomcat.util.collections.SynchronizedStack.push >>> > >>> > [ 2] >>> > org.apache.tomcat.util.net.NioBlockingSelector.write >>> >>> > >>> > [ 3] >>> > org.apache.tomcat.util.net.NioSelectorPool.write >>> >>> > >>> > [ 4] >>> > org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite >>> >>> > >>> > [ 5] >>> > org.apache.tomcat.util.net.SocketWrapperBase.doWrite >>> >>> > >>> > [ 6] >>> > org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking >>> >>> > >>> > [ 7] >>> > org.apache.tomcat.util.net.SocketWrapperBase.flush >>> >>> > >>> > [ 8] >>> > org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end >>> >>> > [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end >>> >>> > >>> > [10] org.apache.coyote.http11.Http11OutputBuffer.end >>> > [11] org.apache.coyote.http11.Http11Processor.finishResponse >>> > [12] org.apache.coyote.AbstractProcessor.action >>> > [13] org.apache.coyote.Response.action >>> > [14] org.apache.catalina.connector.OutputBuffer.close >>> > [15] org.apache.catalina.connector.Response.finishResponse >>> > [16] org.apache.catalina.connector.CoyoteAdapter.service >>> > [17] org.apache.coyote.http11.Http11Processor.service >>> > [18] org.apache.coyote.AbstractProcessorLight.process >>> > [19] org.apache.coyote.AbstractProtocol$ConnectionHandler.process >>> > [20] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun >>> > [21] org.apache.tomcat.util.net.SocketProcessorBase.run >>> > [22] java.util.concurrent.ThreadPoolExecutor.runWorker >>> > [23] java.util.concurrent.ThreadPoolExecutor$Worker.run >>> > [24] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run >>> > [25] java.lang.Thread.run >>> > >>> > --- 769519847169 ns (41.03%), 314384 samples >>> > [ 0] org.apache.tomcat.util.collections.SynchronizedStack >>> > [ 1] org.apache.tomcat.util.collections.SynchronizedStack.pop >>> > [ 2] org.apache.tomcat.util.net.NioBlockingSelector.write >>> > [ 3] org.apache.tomcat.util.net.NioSelectorPool.write >>> > [ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite >>> > [ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite >>> > [ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking >>> > [ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush >>> > [ 8] >>> org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end >>> > [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end >>> > [10] org.apache.coyote.http11.Http11OutputBuffer.end >>> > [11] org.apache.coyote.http11.Http11Processor.finishResponse >>> > .... >>> > >>> > >>> > I've made the following changes >>> > - >>> https://github.com/apache/tomcat/compare/9.0.x...martin-g:feature/lock-free-queue-and-stack >>> > and ran the test suite - all green! >>> > >>> > Now the output from the locks profiling looks like: >>> > >>> > >>> > --- 72354435197 ns (47.98%), 4008 samples >>> > [ 0] java.util.concurrent.locks.ReentrantLock$NonfairSync >>> > [ 1] jdk.internal.misc.Unsafe.park >>> > [ 2] java.util.concurrent.locks.LockSupport.park >>> > [ 3] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire >>> > [ 4] >>> > >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly >>> > [ 5] java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly >>> > [ 6] java.util.concurrent.locks.ReentrantLock.lockInterruptibly >>> > [ 7] java.util.concurrent.LinkedBlockingQueue.poll >>> > [ 8] org.apache.tomcat.util.threads.TaskQueue.poll >>> > [ 9] org.apache.tomcat.util.threads.TaskQueue.poll >>> > [10] java.util.concurrent.ThreadPoolExecutor.getTask >>> > [11] java.util.concurrent.ThreadPoolExecutor.runWorker >>> > [12] java.util.concurrent.ThreadPoolExecutor$Worker.run >>> > [13] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run >>> > [14] java.lang.Thread.run >>> > >>> > --- 58413748017 ns (38.74%), 10179 samples >>> > [ 0] java.lang.Object >>> > [ 1] >>> org.apache.catalina.core.StandardContext$ContextFilterMaps.asArray >>> > [ 2] org.apache.catalina.core.StandardContext.findFilterMaps >>> > [ 3] >>> org.apache.catalina.core.ApplicationFilterFactory.createFilterChain >>> > [ 4] org.apache.catalina.core.StandardWrapperValve.invoke >>> > [ 5] org.apache.catalina.core.StandardContextValve.invoke >>> > [ 6] org.apache.catalina.authenticator.AuthenticatorBase.invoke >>> > [ 7] org.apache.catalina.core.StandardHostValve.invoke >>> > [ 8] org.apache.catalina.valves.ErrorReportValve.invoke >>> > [ 9] org.apache.catalina.core.StandardEngineValve.invoke >>> > [10] org.apache.catalina.connector.CoyoteAdapter.service >>> > [11] org.apache.coyote.http11.Http11Processor.service >>> > [12] org.apache.coyote.AbstractProcessorLight.process >>> > [13] org.apache.coyote.AbstractProtocol$ConnectionHandler.process >>> > [14] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun >>> > [15] org.apache.tomcat.util.net.SocketProcessorBase.run >>> > [16] java.util.concurrent.ThreadPoolExecutor.runWorker >>> > [17] java.util.concurrent.ThreadPoolExecutor$Worker.run >>> > [18] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run >>> > [19] java.lang.Thread.run >>> > >>> > --- 5856040617 ns (3.88%), 11395 samples >>> > [ 0] org.apache.tomcat.util.collections.SynchronizedQueue >>> > [ 1] org.apache.tomcat.util.collections.SynchronizedQueue.offer >>> > [ 2] org.apache.tomcat.util.net.NioEndpoint$Poller.addEvent >>> > [ 3] org.apache.tomcat.util.net.NioEndpoint$Poller.add >>> > [ 4] >>> > org.apache.tomcat.util.net >>> .NioEndpoint$NioSocketWrapper.registerReadInterest >>> > [ 5] org.apache.coyote.AbstractProtocol$ConnectionHandler.process >>> > ... >>> > >>> > >>> > I monitor the GC both with -Xloggc:gc.log and with JMX and the patterns >>> > are the same, no new spikes. >>> > JDK 14 with G1GC. >>> > >>> > I load the application with wrk (https://github.com/wg/wrk/) and it >>> > makes 70/80K per second. >>> > >>> > I guess there is a reason to use these custom synchronized data >>> > structures but so far I don't see any drawbacks after moving to the >>> > lock-free impls. >>> > >>> > Regards, >>> > Martin >>> > >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org >>> For additional commands, e-mail: dev-h...@tomcat.apache.org >>> >>>