[ 
https://issues.apache.org/jira/browse/LOG4J2-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340369#comment-15340369
 ] 

Anthony Maire edited comment on LOG4J2-1430 at 6/20/16 8:52 PM:
----------------------------------------------------------------

I made further tests today, and while I still believe that the Conversant 
Disruptor is not the best alternative because of the coupling between the 
claiming slot order an the data publishing order, the main issue is not there.

These throughput benchmarks only mesure one thing: how much is the consumer 
thread impacted by the wait strategy. And as soon as the strategy does not use 
condition variables, the more the consumer threads are paused, the better the 
throughput will be

As I said in a previous comment, even with 24 or 36 threads, only 12 
hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait 
strategy. So most of the time, the queue is full. I tried to play with the 
implementation of the wait strategy in my JCTools blocking wrapper to use 
something that is more like the one used in LMAX or Conversant disruptor with 
some spin/yield before falling back to park.

The result is not what I expected : the yield loop made the JCTool queue 
throughput lower !! I think it can be explained by false sharing : when the 
queue is nearly full, the producers will write to slots that are adjacent to 
the one that is read by the consumer. Every time an event is inserted in the 
queue, that cache line is evicted from consumer core L1 cache, causing a cache 
miss. If producers are paused for a longer time, then the consumer can append 
events without these cache misses for a longer time, and the throughput is 
better

That's why the Conversant Disruptor was performing way better under window than 
under Linux : the pause time of parkNanos is 20 times longer on windows and it 
was enough to avoid the spin/yield loop negative impact in the wait strategy.

The main question with these benchmarks is: what do we want to measure, how can 
we say that an implementation is more "performant" than another one ?  The 
throughput with a no-op appender is maybe not the best metric: in real 
applications, there will be real filters + appenders that will not support such 
a throughput, so we are mesuring something that does not exist in the real 
world.

In my opinion, there are 2 usecases where an async appender is useful :
- when we need to increase the maximal throughput of the underlying appender 
thanks to batching (cf RandomAccessFile appender with immediateFlush = false). 
The queue implementation has few impact on this
- when we need to keep latency low in the application thread for a realistic 
load (i.e we don't expect the queue to be always nearly full, if this does 
happen, then another part of the system need improvements)

As Remko said, log4J is used in a lot of situations that we cannot control. And 
saying something like "you shouldn't have more active threads than cores in 
your app" is not a solution, if the server is used for other apps it will not  
be enough. Being lock-free on the producer side (as soon as the queue is not 
full) is a mandatory characteristic in my opinion.

The intrisinc latency of the queue is often not the main characterisic : there 
are lot's of things done in the main thread before enqueuing, especially with 
parametrized messages. The call to the logger can take a few microseconds, 
whereas any reasonable high-performance collection will enqueue in something 
like 100ns under low contention (and the saturation rate of the underlying 
appender will often happen way before achieving high contention on the queue).


was (Author: anthony maire):
I made further tests today, and while I still believe that the Conversant 
Disruptor is not the best alternative because of the coupling between the 
claiming slot order an the data publishing order, the main issue is not there.

These throughput benchmarks only mesure one thing: how much is the consumer 
thread impacted by the wait strategy. And as soon as the strategy does not use 
condition variables, the more the consumer threads are paused, the better the 
throughput will be

As I said in a previous comment, even with 24 or 36 threads, only 12 
hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait 
strategy. So most of the time, the queue is full. I tried to play with so 
implementation of the wait strategy in my JCTools blocking wrapper to use 
something that is more like the one used in LMAX or Conversant disruptor with 
some spin/yield before falling back to park.

The result is not what I expected : the yield loop made the JCTool queue 
throughput lower !! I think it can be explained by false sharing : when the 
queue is nearly full, the producers will write to slots that are adjacent to 
the one that is read by the consumer. Every time an event is inserted in the 
queue, that cache line is evicted from consumer core L1 cache, causing a cache 
miss. If producers are paused for a longer time, then the consumer can append 
events without these cache misses for a longer time, and the throughput is 
better

That's why the Conversant Disruptor was performing way better under window than 
under Linux : the pause time of parkNanos is 20 times longer on windows and it 
was enough to avoid the spin/yield loop negative impact in the wait strategy.

The main question with these benchmarks is: what do we want to measure, how can 
we say that an implementation is more "performant" than another one ?  The 
throughput with a no-op appender is maybe not the best metric: in real 
applications, there will be real filters + appenders that will not support such 
a throughput, so we are mesuring something that does not exist in the real 
world.

In my opinion, there are 2 usecases where an async appender is useful :
- when we need to increase the maximal throughput of the underlying appender 
thanks to batching (cf RandomAccessFile appender with immediateFlush = false). 
The queue implementation has few impact on this
- when we need to keep latency low in the application thread for a realistic 
load (i.e we don't expect the queue to be always nearly full, if this does 
happen, then another part of the system need improvements)

As Remko said, log4J is used in a lot of situations that we cannot control. And 
saying something like "you shouldn't have more active threads than cores in 
your app" is not a solution, if the server is used for other apps it will not  
be enough. Being lock-free on the producer side (as soon as the queue is not 
full) is a mandatory characteristic.

The intrisinc latency of the queue is often not the main characterisic : there 
are lot's of things done in the main thread before enqueuing, especially with 
parametrized messages. The call to the logger can take a few microseconds, 
whereas any reasonable high-performance collection will enqueue in something 
like 100ns under low contention (and the saturation rate of the underlying 
appender will often happen way before achieving high contention on the queue).

> Add optional support for Conversant DisruptorBlockingQueue in AsyncAppender
> ---------------------------------------------------------------------------
>
>                 Key: LOG4J2-1430
>                 URL: https://issues.apache.org/jira/browse/LOG4J2-1430
>             Project: Log4j 2
>          Issue Type: New Feature
>          Components: Appenders
>    Affects Versions: 2.6.1
>            Reporter: Matt Sicker
>            Assignee: Matt Sicker
>             Fix For: 2.7
>
>         Attachments: AsyncAppenderPerf01.txt, 
> log4j2-1430-jctools-tmp-patch.txt
>
>
> [Conversant Disruptor|https://github.com/conversant/disruptor] works as an 
> implementation of BlockingQueue that is much faster than ArrayBlockingQueue. 
> I did some benchmarks earlier and found it to be a bit faster:
> h3. AsyncAppender/ArrayBlockingQueue
> {code}
> Benchmark                                                       Mode  Samples 
>        Score       Error  Units
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params    thrpt       20 
>  1101267.173 ± 17583.204  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params    thrpt       20 
>  1128269.255 ± 12188.910  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param      thrpt       20 
>  1525470.805 ± 56515.933  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params     thrpt       20 
>  1789434.196 ± 42733.475  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params     thrpt       20 
>  1803276.278 ± 34938.176  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params     thrpt       20 
>  1468550.776 ± 26402.286  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params     thrpt       20 
>  1322304.349 ± 22417.997  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params     thrpt       20 
>  1179756.489 ± 16502.276  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params     thrpt       20 
>  1324660.677 ± 18893.944  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params     thrpt       20 
>  1309365.962 ± 19602.489  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params     thrpt       20 
>  1422144.180 ± 20815.042  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple      thrpt       20 
>  1247862.372 ± 18300.764  ops/s
> {code}
> h3. AsyncAppender/DisruptorBlockingQueue
> {code}
> Benchmark                                                       Mode  Samples 
>        Score        Error  Units
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params    thrpt       20 
>  3704735.586 ±  59766.253  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params    thrpt       20 
>  3622175.410 ±  31975.353  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param      thrpt       20 
>  6862480.428 ± 121473.276  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params     thrpt       20 
>  6193288.988 ±  93545.144  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params     thrpt       20 
>  5715621.712 ± 131878.581  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params     thrpt       20 
>  5745187.005 ± 213854.016  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params     thrpt       20 
>  5307137.396 ±  88135.709  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params     thrpt       20 
>  4953015.419 ±  72100.403  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params     thrpt       20 
>  4833836.418 ±  52919.314  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params     thrpt       20 
>  4353791.507 ±  79047.812  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params     thrpt       20 
>  4136761.624 ±  67804.253  ops/s
> o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple      thrpt       20 
>  6719456.722 ± 187433.301  ops/s
> {code}
> h3. AsyncLogger
> {code}
> Benchmark                                                Mode  Samples        
>  Score        Error  Units
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput10Params    thrpt       20   
> 5075883.371 ± 180465.316  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput11Params    thrpt       20   
> 4867362.030 ± 193909.465  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput1Param      thrpt       20  
> 10294733.024 ± 226536.965  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput2Params     thrpt       20   
> 9021650.667 ± 351102.255  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput3Params     thrpt       20   
> 8079337.905 ± 115824.975  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput4Params     thrpt       20   
> 7347356.788 ±  66598.738  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput5Params     thrpt       20   
> 6930636.174 ± 150072.908  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput6Params     thrpt       20   
> 6309567.300 ± 293709.787  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput7Params     thrpt       20   
> 6051997.196 ± 268405.087  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput8Params     thrpt       20   
> 5273376.623 ±  99168.461  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughput9Params     thrpt       20   
> 5091137.594 ± 150617.444  ops/s
> o.a.l.l.p.j.AsyncLoggersBenchmark.throughputSimple      thrpt       20  
> 11136623.731 ± 400350.272  ops/s
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: log4j-dev-unsubscr...@logging.apache.org
For additional commands, e-mail: log4j-dev-h...@logging.apache.org

Reply via email to