[
https://issues.apache.org/jira/browse/LOG4J2-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340369#comment-15340369
]
Anthony Maire edited comment on LOG4J2-1430 at 6/20/16 8:52 PM:
----------------------------------------------------------------
I made further tests today, and while I still believe that the Conversant
Disruptor is not the best alternative because of the coupling between the
claiming slot order an the data publishing order, the main issue is not there.
These throughput benchmarks only mesure one thing: how much is the consumer
thread impacted by the wait strategy. And as soon as the strategy does not use
condition variables, the more the consumer threads are paused, the better the
throughput will be
As I said in a previous comment, even with 24 or 36 threads, only 12
hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait
strategy. So most of the time, the queue is full. I tried to play with the
implementation of the wait strategy in my JCTools blocking wrapper to use
something that is more like the one used in LMAX or Conversant disruptor with
some spin/yield before falling back to park.
The result is not what I expected : the yield loop made the JCTool queue
throughput lower !! I think it can be explained by false sharing : when the
queue is nearly full, the producers will write to slots that are adjacent to
the one that is read by the consumer. Every time an event is inserted in the
queue, that cache line is evicted from consumer core L1 cache, causing a cache
miss. If producers are paused for a longer time, then the consumer can append
events without these cache misses for a longer time, and the throughput is
better
That's why the Conversant Disruptor was performing way better under window than
under Linux : the pause time of parkNanos is 20 times longer on windows and it
was enough to avoid the spin/yield loop negative impact in the wait strategy.
The main question with these benchmarks is: what do we want to measure, how can
we say that an implementation is more "performant" than another one ? The
throughput with a no-op appender is maybe not the best metric: in real
applications, there will be real filters + appenders that will not support such
a throughput, so we are mesuring something that does not exist in the real
world.
In my opinion, there are 2 usecases where an async appender is useful :
- when we need to increase the maximal throughput of the underlying appender
thanks to batching (cf RandomAccessFile appender with immediateFlush = false).
The queue implementation has few impact on this
- when we need to keep latency low in the application thread for a realistic
load (i.e we don't expect the queue to be always nearly full, if this does
happen, then another part of the system need improvements)
As Remko said, log4J is used in a lot of situations that we cannot control. And
saying something like "you shouldn't have more active threads than cores in
your app" is not a solution, if the server is used for other apps it will not
be enough. Being lock-free on the producer side (as soon as the queue is not
full) is a mandatory characteristic in my opinion.
The intrisinc latency of the queue is often not the main characterisic : there
are lot's of things done in the main thread before enqueuing, especially with
parametrized messages. The call to the logger can take a few microseconds,
whereas any reasonable high-performance collection will enqueue in something
like 100ns under low contention (and the saturation rate of the underlying
appender will often happen way before achieving high contention on the queue).
was (Author: anthony maire):
I made further tests today, and while I still believe that the Conversant
Disruptor is not the best alternative because of the coupling between the
claiming slot order an the data publishing order, the main issue is not there.
These throughput benchmarks only mesure one thing: how much is the consumer
thread impacted by the wait strategy. And as soon as the strategy does not use
condition variables, the more the consumer threads are paused, the better the
throughput will be
As I said in a previous comment, even with 24 or 36 threads, only 12
hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait
strategy. So most of the time, the queue is full. I tried to play with so
implementation of the wait strategy in my JCTools blocking wrapper to use
something that is more like the one used in LMAX or Conversant disruptor with
some spin/yield before falling back to park.
The result is not what I expected : the yield loop made the JCTool queue
throughput lower !! I think it can be explained by false sharing : when the
queue is nearly full, the producers will write to slots that are adjacent to
the one that is read by the consumer. Every time an event is inserted in the
queue, that cache line is evicted from consumer core L1 cache, causing a cache
miss. If producers are paused for a longer time, then the consumer can append
events without these cache misses for a longer time, and the throughput is
better
That's why the Conversant Disruptor was performing way better under window than
under Linux : the pause time of parkNanos is 20 times longer on windows and it
was enough to avoid the spin/yield loop negative impact in the wait strategy.
The main question with these benchmarks is: what do we want to measure, how can
we say that an implementation is more "performant" than another one ? The
throughput with a no-op appender is maybe not the best metric: in real
applications, there will be real filters + appenders that will not support such
a throughput, so we are mesuring something that does not exist in the real
world.
In my opinion, there are 2 usecases where an async appender is useful :
- when we need to increase the maximal throughput of the underlying appender
thanks to batching (cf RandomAccessFile appender with immediateFlush = false).
The queue implementation has few impact on this
- when we need to keep latency low in the application thread for a realistic
load (i.e we don't expect the queue to be always nearly full, if this does
happen, then another part of the system need improvements)
As Remko said, log4J is used in a lot of situations that we cannot control. And
saying something like "you shouldn't have more active threads than cores in
your app" is not a solution, if the server is used for other apps it will not
be enough. Being lock-free on the producer side (as soon as the queue is not
full) is a mandatory characteristic.
The intrisinc latency of the queue is often not the main characterisic : there
are lot's of things done in the main thread before enqueuing, especially with
parametrized messages. The call to the logger can take a few microseconds,
whereas any reasonable high-performance collection will enqueue in something
like 100ns under low contention (and the saturation rate of the underlying
appender will often happen way before achieving high contention on the queue).
> Add optional support for Conversant DisruptorBlockingQueue in AsyncAppender
> ---------------------------------------------------------------------------
>
> Key: LOG4J2-1430
> URL: https://issues.apache.org/jira/browse/LOG4J2-1430
> Project: Log4j 2
> Issue Type: New Feature
> Components: Appenders
> Affects Versions: 2.6.1
> Reporter: Matt Sicker
> Assignee: Matt Sicker
> Fix For: 2.7
>
> Attachments: AsyncAppenderPerf01.txt,
> log4j2-1430-jctools-tmp-patch.txt
>
>
> [Conversant Disruptor|https://github.com/conversant/disruptor] works as an
> implementation of BlockingQueue that is much faster than ArrayBlockingQueue.
> I did some benchmarks earlier and found it to be a bit faster:
> h3. AsyncAppender/ArrayBlockingQueue
> {code}
> Benchmark Mode Samples
> Score Error Units
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params thrpt 20
> 1101267.173 ± 17583.204 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params thrpt 20
> 1128269.255 ± 12188.910 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param thrpt 20
> 1525470.805 ± 56515.933 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params thrpt 20
> 1789434.196 ± 42733.475 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params thrpt 20
> 1803276.278 ± 34938.176 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params thrpt 20
> 1468550.776 ± 26402.286 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params thrpt 20
> 1322304.349 ± 22417.997 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params thrpt 20
> 1179756.489 ± 16502.276 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params thrpt 20
> 1324660.677 ± 18893.944 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params thrpt 20
> 1309365.962 ± 19602.489 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params thrpt 20
> 1422144.180 ± 20815.042 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple thrpt 20
> 1247862.372 ± 18300.764 ops/s
> {code}
> h3. AsyncAppender/DisruptorBlockingQueue
> {code}
> Benchmark Mode Samples
> Score Error Units
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params thrpt 20
> 3704735.586 ± 59766.253 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params thrpt 20
> 3622175.410 ± 31975.353 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param thrpt 20
> 6862480.428 ± 121473.276 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params thrpt 20
> 6193288.988 ± 93545.144 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params thrpt 20
> 5715621.712 ± 131878.581 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params thrpt 20
> 5745187.005 ± 213854.016 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params thrpt 20
> 5307137.396 ± 88135.709 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params thrpt 20
> 4953015.419 ± 72100.403 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params thrpt 20
> 4833836.418 ± 52919.314 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params thrpt 20
> 4353791.507 ± 79047.812 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params thrpt 20
> 4136761.624 ± 67804.253 ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple thrpt 20
> 6719456.722 ± 187433.301 ops/s
> {code}
> h3. AsyncLogger
> {code}
> Benchmark Mode Samples
> Score Error Units
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput10Params thrpt 20
> 5075883.371 ± 180465.316 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput11Params thrpt 20
> 4867362.030 ± 193909.465 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput1Param thrpt 20
> 10294733.024 ± 226536.965 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput2Params thrpt 20
> 9021650.667 ± 351102.255 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput3Params thrpt 20
> 8079337.905 ± 115824.975 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput4Params thrpt 20
> 7347356.788 ± 66598.738 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput5Params thrpt 20
> 6930636.174 ± 150072.908 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput6Params thrpt 20
> 6309567.300 ± 293709.787 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput7Params thrpt 20
> 6051997.196 ± 268405.087 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput8Params thrpt 20
> 5273376.623 ± 99168.461 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput9Params thrpt 20
> 5091137.594 ± 150617.444 ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughputSimple thrpt 20
> 11136623.731 ± 400350.272 ops/s
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]