On Fri, Jun 5, 2020 at 3:08 PM Martin Grigorov <mgrigo...@apache.org> wrote:

> Hi,
>
> On Thu, Jun 4, 2020 at 9:28 PM Rémy Maucherat <r...@apache.org> wrote:
>
>> On Thu, Jun 4, 2020 at 6:16 PM Mark Thomas <ma...@apache.org> wrote:
>>
>>> On 04/06/2020 16:33, Martin Grigorov wrote:
>>> > Hi team,
>>> >
>>> > I am profiling our application and I've found that Tomcat uses its own
>>> > Synchronized Queue/Stack implementations which synchronize on instance
>>> > level, i.e. their methods are 'synchronized'.
>>>
>>
>> Can you work off 10 (master) first ?
>>
>
> The application is based on Spring Boot and my attempts to
> "jakartaee migrate it" were not successful.
>
> But I've created a very simple Servlet and I will try it on Tomcat 10.x
> soon:
>
> @WebServlet(urlPatterns = "/test")
> public class PlainTextServlet extends HttpServlet {
>
>     public static final String CONTENT = "Hello world!";
>     public static final int CONTENT_LENGTH = CONTENT.length();
>
>     @Override
>     protected void doGet(final HttpServletRequest req, final
> HttpServletResponse resp) throws IOException {
>          resp.setContentType("text/plain");
>          resp.setContentLength(CONTENT_LENGTH);
>          resp.getWriter().println(CONTENT);
>     }
> }
>
> For load testing it I use wrk: wrk -c96 -t8 -d60s https://host:port/test
> The GC JVM arguments are: -Xloggc:./gc.log -XX:+PrintGCDetails -verbose:gc
> JDK 1.8.0 b252
>
> The GC logs are:
> - Tomcat 9.0.x (uses SynchronizedQueue/Stack):
> https://gist.github.com/martin-g/d2570e7a6896e4d094ce548ceea3adb6
> - Tomcat 9.0.x with my changes:
> https://gist.github.com/martin-g/52c7d3a883b37e9bcd11ad6430800852
> I've uploaded them to https://gceasy.io/ and the charts are similar.
>
> The summary from wrk with SynchronizedQueue/Stack is:
>
> Running 1m test @ http://192.168.0.206:8080
>   8 threads and 96 connections
>   Thread Stats   Avg      Stdev     Max   +/- Stdev
>     Latency     1.24ms  788.15us  19.46ms   91.44%
>     Req/Sec    10.04k     1.64k   12.45k    79.48%
>   4797193 requests in 1.00m, 600.19MB read
> Requests/sec:  79834.45
> Transfer/sec:      9.99MB
>
> I didn't preserve the summary from the lock-free run but it was similar -
> the throughput is around 80K/sec +- few thousands.
> I.e. the throughput does not improve after my changes!
> Only the locking profiling results become better (Async Profiler:
> ./profile.sh -e lock -d50s -f lock-results.txt PID)
>

I rarely got any gains when making an optimization lately. But this sounds
good [or not bad at least].

Rémy


>
> Martin
>
>
>>
>>>
>>> ~8 years ago when I was profiling Tomcat and wrote those having
>>> something targetted to Tomcat's specific use case made a measurable
>>> improvement.
>>>
>>
>> Yes, the SyncStack was the go to structure for all reuse. Not sure if
>> it's the best.
>>
>> I already removed some of the usage in 10:
>> - No more block poller and its pooling/recycling [I had to pull its
>> removal out of 9 due to an unfortunate regression, this could be brought
>> back now that it seems to work]
>> - No recycling of trivial objects
>>
>> The main use should now be the recycled processors and the recycled
>> channels (it would be bad to not reuse them since they may use direct
>> buffers ...).
>>
>>
>>>
>>> I am more than prepared to believe that things have moved on since then
>>> but I'd be interested in seeing the same numbers with the latest Java 8
>>> from Adopt OpenJDK rather than Java 14.
>>>
>>
>> +1, firnally there could be some differences between JDK, like now they
>> started to actually improve NIO 1 again.
>>
>> Rémy
>>
>>
>>>
>>> I'd also like to see overall throughput numbers and the GC frequency for
>>> each test scenario.
>>>
>>> Mark
>>>
>>>
>>>
>>> >
>>> > The javadoc for those two classes say that they are preferred because
>>> > they are GC-friendlier than
>>> > java.uti.concurrent.ConcurrentLinkedQueue.
>>> >
>>> > I use Async Profiler
>>> > (https://github.com/jvm-profiling-tools/async-profiler) to profile
>>> CPU,
>>> > memory,
>>> > locks and few lower level types (from perf).
>>> > Issuing "./profiler.sh -e lock -d 60 PID" against 9.0.x shows such
>>> reports:
>>> >
>>> > Started [lock] profiling
>>> >
>>> >  --- Execution profile ---
>>> >
>>> >  Total samples       : 713834
>>>
>>> >
>>> >
>>> >
>>> >   Frame buffer usage  : 0.0288%
>>>
>>> >
>>> >
>>> >                                                                     ---
>>> > 1106165618177 ns (58.97%), 395609 samples
>>> >
>>> >                                                              [ 0]
>>> > org.apache.tomcat.util.collections.SynchronizedStack
>>>
>>> >
>>> >                                                          [ 1]
>>> > org.apache.tomcat.util.collections.SynchronizedStack.push
>>> >
>>> >                                                           [ 2]
>>> > org.apache.tomcat.util.net.NioBlockingSelector.write
>>>
>>> >
>>> >                                                          [ 3]
>>> > org.apache.tomcat.util.net.NioSelectorPool.write
>>>
>>> >
>>> >                                                          [ 4]
>>> > org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
>>>
>>> >
>>> >                                                           [ 5]
>>> > org.apache.tomcat.util.net.SocketWrapperBase.doWrite
>>>
>>> >
>>> >                                                          [ 6]
>>> > org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
>>>
>>> >
>>> >                                                          [ 7]
>>> > org.apache.tomcat.util.net.SocketWrapperBase.flush
>>>
>>> >
>>> >                                                          [ 8]
>>> > org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
>>>
>>> >   [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
>>>
>>> >
>>> >   [10] org.apache.coyote.http11.Http11OutputBuffer.end
>>> >   [11] org.apache.coyote.http11.Http11Processor.finishResponse
>>> >   [12] org.apache.coyote.AbstractProcessor.action
>>> >   [13] org.apache.coyote.Response.action
>>> >   [14] org.apache.catalina.connector.OutputBuffer.close
>>> >   [15] org.apache.catalina.connector.Response.finishResponse
>>> >   [16] org.apache.catalina.connector.CoyoteAdapter.service
>>> >   [17] org.apache.coyote.http11.Http11Processor.service
>>> >   [18] org.apache.coyote.AbstractProcessorLight.process
>>> >   [19] org.apache.coyote.AbstractProtocol$ConnectionHandler.process
>>> >   [20] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
>>> >   [21] org.apache.tomcat.util.net.SocketProcessorBase.run
>>> >   [22] java.util.concurrent.ThreadPoolExecutor.runWorker
>>> >   [23] java.util.concurrent.ThreadPoolExecutor$Worker.run
>>> >   [24] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run
>>> >   [25] java.lang.Thread.run
>>> >
>>> > --- 769519847169 ns (41.03%), 314384 samples
>>> >   [ 0] org.apache.tomcat.util.collections.SynchronizedStack
>>> >   [ 1] org.apache.tomcat.util.collections.SynchronizedStack.pop
>>> >   [ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
>>> >   [ 3] org.apache.tomcat.util.net.NioSelectorPool.write
>>> >   [ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
>>> >   [ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
>>> >   [ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
>>> >   [ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
>>> >   [ 8]
>>> org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
>>> >   [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
>>> >   [10] org.apache.coyote.http11.Http11OutputBuffer.end
>>> >   [11] org.apache.coyote.http11.Http11Processor.finishResponse
>>> >    ....
>>> >
>>> >
>>> > I've made the following changes
>>> > -
>>> https://github.com/apache/tomcat/compare/9.0.x...martin-g:feature/lock-free-queue-and-stack
>>> > and ran the test suite - all green!
>>> >
>>> > Now the output from the locks profiling looks like:
>>> >
>>> >
>>> > --- 72354435197 ns (47.98%), 4008 samples
>>> >   [ 0] java.util.concurrent.locks.ReentrantLock$NonfairSync
>>> >   [ 1] jdk.internal.misc.Unsafe.park
>>> >   [ 2] java.util.concurrent.locks.LockSupport.park
>>> >   [ 3] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire
>>> >   [ 4]
>>> >
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly
>>> >   [ 5] java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly
>>> >   [ 6] java.util.concurrent.locks.ReentrantLock.lockInterruptibly
>>> >   [ 7] java.util.concurrent.LinkedBlockingQueue.poll
>>> >   [ 8] org.apache.tomcat.util.threads.TaskQueue.poll
>>> >   [ 9] org.apache.tomcat.util.threads.TaskQueue.poll
>>> >   [10] java.util.concurrent.ThreadPoolExecutor.getTask
>>> >   [11] java.util.concurrent.ThreadPoolExecutor.runWorker
>>> >   [12] java.util.concurrent.ThreadPoolExecutor$Worker.run
>>> >   [13] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run
>>> >   [14] java.lang.Thread.run
>>> >
>>> > --- 58413748017 ns (38.74%), 10179 samples
>>> >   [ 0] java.lang.Object
>>> >   [ 1]
>>> org.apache.catalina.core.StandardContext$ContextFilterMaps.asArray
>>> >   [ 2] org.apache.catalina.core.StandardContext.findFilterMaps
>>> >   [ 3]
>>> org.apache.catalina.core.ApplicationFilterFactory.createFilterChain
>>> >   [ 4] org.apache.catalina.core.StandardWrapperValve.invoke
>>> >   [ 5] org.apache.catalina.core.StandardContextValve.invoke
>>> >   [ 6] org.apache.catalina.authenticator.AuthenticatorBase.invoke
>>> >   [ 7] org.apache.catalina.core.StandardHostValve.invoke
>>> >   [ 8] org.apache.catalina.valves.ErrorReportValve.invoke
>>> >   [ 9] org.apache.catalina.core.StandardEngineValve.invoke
>>> >   [10] org.apache.catalina.connector.CoyoteAdapter.service
>>> >   [11] org.apache.coyote.http11.Http11Processor.service
>>> >   [12] org.apache.coyote.AbstractProcessorLight.process
>>> >   [13] org.apache.coyote.AbstractProtocol$ConnectionHandler.process
>>> >   [14] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
>>> >   [15] org.apache.tomcat.util.net.SocketProcessorBase.run
>>> >   [16] java.util.concurrent.ThreadPoolExecutor.runWorker
>>> >   [17] java.util.concurrent.ThreadPoolExecutor$Worker.run
>>> >   [18] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run
>>> >   [19] java.lang.Thread.run
>>> >
>>> > --- 5856040617 ns (3.88%), 11395 samples
>>> >   [ 0] org.apache.tomcat.util.collections.SynchronizedQueue
>>> >   [ 1] org.apache.tomcat.util.collections.SynchronizedQueue.offer
>>> >   [ 2] org.apache.tomcat.util.net.NioEndpoint$Poller.addEvent
>>> >   [ 3] org.apache.tomcat.util.net.NioEndpoint$Poller.add
>>> >   [ 4]
>>> > org.apache.tomcat.util.net
>>> .NioEndpoint$NioSocketWrapper.registerReadInterest
>>> >   [ 5] org.apache.coyote.AbstractProtocol$ConnectionHandler.process
>>> >   ...
>>> >
>>> >
>>> > I monitor the GC both with -Xloggc:gc.log and with JMX and the patterns
>>> > are the same, no new spikes.
>>> > JDK 14 with G1GC.
>>> >
>>> > I load the application with wrk (https://github.com/wg/wrk/) and it
>>> > makes 70/80K per second.
>>> >
>>> > I guess there is a reason to use these custom synchronized data
>>> > structures but so far I don't see any drawbacks after moving to the
>>> > lock-free impls.
>>> >
>>> > Regards,
>>> > Martin
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
>>> For additional commands, e-mail: dev-h...@tomcat.apache.org
>>>
>>>

Reply via email to