[ https://issues.apache.org/jira/browse/LOG4J2-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340369#comment-15340369 ]
Anthony Maire edited comment on LOG4J2-1430 at 6/20/16 8:52 PM: ---------------------------------------------------------------- I made further tests today, and while I still believe that the Conversant Disruptor is not the best alternative because of the coupling between the claiming slot order an the data publishing order, the main issue is not there. These throughput benchmarks only mesure one thing: how much is the consumer thread impacted by the wait strategy. And as soon as the strategy does not use condition variables, the more the consumer threads are paused, the better the throughput will be As I said in a previous comment, even with 24 or 36 threads, only 12 hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait strategy. So most of the time, the queue is full. I tried to play with the implementation of the wait strategy in my JCTools blocking wrapper to use something that is more like the one used in LMAX or Conversant disruptor with some spin/yield before falling back to park. The result is not what I expected : the yield loop made the JCTool queue throughput lower !! I think it can be explained by false sharing : when the queue is nearly full, the producers will write to slots that are adjacent to the one that is read by the consumer. Every time an event is inserted in the queue, that cache line is evicted from consumer core L1 cache, causing a cache miss. If producers are paused for a longer time, then the consumer can append events without these cache misses for a longer time, and the throughput is better That's why the Conversant Disruptor was performing way better under window than under Linux : the pause time of parkNanos is 20 times longer on windows and it was enough to avoid the spin/yield loop negative impact in the wait strategy. The main question with these benchmarks is: what do we want to measure, how can we say that an implementation is more "performant" than another one ? The throughput with a no-op appender is maybe not the best metric: in real applications, there will be real filters + appenders that will not support such a throughput, so we are mesuring something that does not exist in the real world. In my opinion, there are 2 usecases where an async appender is useful : - when we need to increase the maximal throughput of the underlying appender thanks to batching (cf RandomAccessFile appender with immediateFlush = false). The queue implementation has few impact on this - when we need to keep latency low in the application thread for a realistic load (i.e we don't expect the queue to be always nearly full, if this does happen, then another part of the system need improvements) As Remko said, log4J is used in a lot of situations that we cannot control. And saying something like "you shouldn't have more active threads than cores in your app" is not a solution, if the server is used for other apps it will not be enough. Being lock-free on the producer side (as soon as the queue is not full) is a mandatory characteristic in my opinion. The intrisinc latency of the queue is often not the main characterisic : there are lot's of things done in the main thread before enqueuing, especially with parametrized messages. The call to the logger can take a few microseconds, whereas any reasonable high-performance collection will enqueue in something like 100ns under low contention (and the saturation rate of the underlying appender will often happen way before achieving high contention on the queue). was (Author: anthony maire): I made further tests today, and while I still believe that the Conversant Disruptor is not the best alternative because of the coupling between the claiming slot order an the data publishing order, the main issue is not there. These throughput benchmarks only mesure one thing: how much is the consumer thread impacted by the wait strategy. And as soon as the strategy does not use condition variables, the more the consumer threads are paused, the better the throughput will be As I said in a previous comment, even with 24 or 36 threads, only 12 hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait strategy. So most of the time, the queue is full. I tried to play with so implementation of the wait strategy in my JCTools blocking wrapper to use something that is more like the one used in LMAX or Conversant disruptor with some spin/yield before falling back to park. The result is not what I expected : the yield loop made the JCTool queue throughput lower !! I think it can be explained by false sharing : when the queue is nearly full, the producers will write to slots that are adjacent to the one that is read by the consumer. Every time an event is inserted in the queue, that cache line is evicted from consumer core L1 cache, causing a cache miss. If producers are paused for a longer time, then the consumer can append events without these cache misses for a longer time, and the throughput is better That's why the Conversant Disruptor was performing way better under window than under Linux : the pause time of parkNanos is 20 times longer on windows and it was enough to avoid the spin/yield loop negative impact in the wait strategy. The main question with these benchmarks is: what do we want to measure, how can we say that an implementation is more "performant" than another one ? The throughput with a no-op appender is maybe not the best metric: in real applications, there will be real filters + appenders that will not support such a throughput, so we are mesuring something that does not exist in the real world. In my opinion, there are 2 usecases where an async appender is useful : - when we need to increase the maximal throughput of the underlying appender thanks to batching (cf RandomAccessFile appender with immediateFlush = false). The queue implementation has few impact on this - when we need to keep latency low in the application thread for a realistic load (i.e we don't expect the queue to be always nearly full, if this does happen, then another part of the system need improvements) As Remko said, log4J is used in a lot of situations that we cannot control. And saying something like "you shouldn't have more active threads than cores in your app" is not a solution, if the server is used for other apps it will not be enough. Being lock-free on the producer side (as soon as the queue is not full) is a mandatory characteristic. The intrisinc latency of the queue is often not the main characterisic : there are lot's of things done in the main thread before enqueuing, especially with parametrized messages. The call to the logger can take a few microseconds, whereas any reasonable high-performance collection will enqueue in something like 100ns under low contention (and the saturation rate of the underlying appender will often happen way before achieving high contention on the queue). > Add optional support for Conversant DisruptorBlockingQueue in AsyncAppender > --------------------------------------------------------------------------- > > Key: LOG4J2-1430 > URL: https://issues.apache.org/jira/browse/LOG4J2-1430 > Project: Log4j 2 > Issue Type: New Feature > Components: Appenders > Affects Versions: 2.6.1 > Reporter: Matt Sicker > Assignee: Matt Sicker > Fix For: 2.7 > > Attachments: AsyncAppenderPerf01.txt, > log4j2-1430-jctools-tmp-patch.txt > > > [Conversant Disruptor|https://github.com/conversant/disruptor] works as an > implementation of BlockingQueue that is much faster than ArrayBlockingQueue. > I did some benchmarks earlier and found it to be a bit faster: > h3. AsyncAppender/ArrayBlockingQueue > {code} > Benchmark Mode Samples > Score Error Units > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params thrpt 20 > 1101267.173 ± 17583.204 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params thrpt 20 > 1128269.255 ± 12188.910 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param thrpt 20 > 1525470.805 ± 56515.933 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params thrpt 20 > 1789434.196 ± 42733.475 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params thrpt 20 > 1803276.278 ± 34938.176 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params thrpt 20 > 1468550.776 ± 26402.286 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params thrpt 20 > 1322304.349 ± 22417.997 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params thrpt 20 > 1179756.489 ± 16502.276 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params thrpt 20 > 1324660.677 ± 18893.944 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params thrpt 20 > 1309365.962 ± 19602.489 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params thrpt 20 > 1422144.180 ± 20815.042 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple thrpt 20 > 1247862.372 ± 18300.764 ops/s > {code} > h3. AsyncAppender/DisruptorBlockingQueue > {code} > Benchmark Mode Samples > Score Error Units > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params thrpt 20 > 3704735.586 ± 59766.253 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params thrpt 20 > 3622175.410 ± 31975.353 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param thrpt 20 > 6862480.428 ± 121473.276 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params thrpt 20 > 6193288.988 ± 93545.144 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params thrpt 20 > 5715621.712 ± 131878.581 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params thrpt 20 > 5745187.005 ± 213854.016 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params thrpt 20 > 5307137.396 ± 88135.709 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params thrpt 20 > 4953015.419 ± 72100.403 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params thrpt 20 > 4833836.418 ± 52919.314 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params thrpt 20 > 4353791.507 ± 79047.812 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params thrpt 20 > 4136761.624 ± 67804.253 ops/s > o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple thrpt 20 > 6719456.722 ± 187433.301 ops/s > {code} > h3. AsyncLogger > {code} > Benchmark Mode Samples > Score Error Units > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput10Params thrpt 20 > 5075883.371 ± 180465.316 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput11Params thrpt 20 > 4867362.030 ± 193909.465 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput1Param thrpt 20 > 10294733.024 ± 226536.965 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput2Params thrpt 20 > 9021650.667 ± 351102.255 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput3Params thrpt 20 > 8079337.905 ± 115824.975 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput4Params thrpt 20 > 7347356.788 ± 66598.738 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput5Params thrpt 20 > 6930636.174 ± 150072.908 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput6Params thrpt 20 > 6309567.300 ± 293709.787 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput7Params thrpt 20 > 6051997.196 ± 268405.087 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput8Params thrpt 20 > 5273376.623 ± 99168.461 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughput9Params thrpt 20 > 5091137.594 ± 150617.444 ops/s > o.a.l.l.p.j.AsyncLoggersBenchmark.throughputSimple thrpt 20 > 11136623.731 ± 400350.272 ops/s > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: log4j-dev-unsubscr...@logging.apache.org For additional commands, e-mail: log4j-dev-h...@logging.apache.org