Re: Garbage-free Log4j docs preview

Remko Popma Mon, 23 May 2016 05:39:27 -0700


On Monday, May 23, 2016 at 2:47:05 AM UTC+9, Benedict Elliott Smith wrote:
>
> Your pride should be saved by your gracious recognition of the error :)
>
>
> However this is the core of my point: you’re clearly an intelligent, 
> conscientious and talented developer, and yet you did not fully understand 
> these concerns.  You are writing core infrastructure many people will 
> deploy, and a majority will understand it much much less well.  So the very 
> best thing you can do is to educate them.  Doubly so if you needed 
> educating yourself!
>
;-)


>
> I would make it very clear that users should only consider asynchronous 
> logging if they are having performance problems caused by their logging.  
> Only with sub-millisecond latency requirements should async be considered, 
> and below 100 micros you are entering busy-spinning territory.  Or if huge 
> message rates are being seen.  The best default is a blocking strategy, as 
> this has the closest performance profile to synchronous, and will prevent 
> the many users who deploy this without needing it from suffering any major 
> ill-effect.
>
I think I missed a step here. You made it plausible that under low 
workload, the Disruptor may not be better than ArrayBlockingQueue. 
Understood.

But I don't understand why users should only configure async logging in 
rare cases? Synchronous logging also involves holding a lock. Log4j 2 makes 
an effort to release this lock as soon as possible (locking only while 
copying a byte array from a thread-local buffer to the I/O device), but as 
Martin pointed out, in addition to the lock contention, other things may 
cause response time pauses when writing to the I/O device.

So which is "better": holding a lock to signal a 
java.util.concurrent.locks.Condition, versus holding a lock to write a byte 
array to the I/O device? The latter sounds like it has more risk of latency 
spikes. So I'm not sure I agree with your assessment that async logging has 
more risk of major ill effects.

Under low loads the upside of Async Loggers may not be as high as the graph 
suggests but I don't see any real downsides either. Please let me know if 
I'm missing something.
 

>
> I would also do a performance comparison with a real application.  
> Cassandra uses log4j, and the community has an easily utilised performance 
> tool.  You could try both DEBUG and INFO level, and put up some graphs.  
> This would help users make a more realistic and informed decision.
>
I agree it may be interesting for users to see a case study of some sort 
with a real application. Can you point me to the tool you are referring to?
 

>
> This may all seem a little disappointing: to *reduce* the number of 
> people who might use your amazing new technology.  But there will be 
> plenty, who both need it and do not and don't read or heed the warnings.
>
That's okay. It is garbage-free logging that is new in the 2.6 release, not 
the Async Loggers. The Disruptor-based Async Loggers have been used in the 
wild for about three years now. The response has generally been positive so 
I'm not panicking just yet. 
I do intend to follow up though and I take your point that under low load 
this technology may not give as much benefit as under high load. (Again, 
please correct me if there is a bigger problem and I'm missing any other 
downsides/risks.)
 

>
> Regarding improvements: In my opinion, a better algorithm for this 
> majority, of low-to-moderate traffic that is not extremely latency 
> sensitive, is semi-synchronous.  By which I mean for the first log call in 
> a batch to take exclusive ownership of the output and to write its message 
> synchronously; any messages that arrive while it has ownership are passed 
> to the owner to persist, i.e. they are written asynchronously.  To ensure 
> any thread’s ownership is brief, it would impose a time-limit, after which 
> a dedicated consumer could be started if the queue is still not exhausted.  
> This would have better than synchronous behaviour for low-to-moderate 
> traffic and scale gracefully to high traffic.
>
This is really interesting. It reminds me of one of the co-operative 
algorithms I read about in the Art of Multiprocessor Programming (I tried 
to look it up but could not find it last night).

It may be an idea to enhance the Disruptor for lower workloads: I guess the 
question is whether it is possible to incorporate a cheap check before 
calling waitStrategy.signalAllWhenBlocking() in MultiProducerSequence, and 
avoid calling that method if another thread is already signalling the 
consumer thread.
...and it turns out this already exists and is called 
LiteBlockingWaitStrategy. Good stuff! Perhaps all that is needed is combine 
the TimeoutBlockingWaitStrategy with this one and we can improve things 
nicely for lower workloads.

Does anyone on this list have any experience with LiteBlockingWaitStrategy? 
I see it was used in projectreactor.io, would be interested to hear how 
that worked, either on list of off-list. 

>
>
> On Sunday, 22 May 2016 16:40:21 UTC+1, Remko Popma wrote:
>>
>>
>>
>> On Sunday, May 22, 2016 at 11:08:37 PM UTC+9, Benedict Elliott Smith 
>> wrote:
>>>
>>> It's possible I'm missing some other aspect that insulates the 
>>> application thread from the disruptor, but from your message it seems to me 
>>> that you fall into the all-too-common category of people who do not fully 
>>> understand the semantics of the disruptor.  Application threads (i.e. 
>>> disruptor producers) very much *are* affected by the wait strategy.  
>>> Under normal use, *if the disruptor consumer thread is employing a 
>>> blocking strategy, there will be futex call to unblock it when work is 
>>> provided*.  i.e. whenever the queue transitions from empty to 
>>> non-empty, i.e. for most messages with low-to-medium logging rates.
>>>
>>> In this situation the enqueueing of a message is also* not lock free*, 
>>> as while futexes can in theory be lock-free, the JVM uses a lock to 
>>> synchronise access to the semaphore that controls its runnable state, so 
>>> all threads attempting to wake it up concurrently will compete for this 
>>> lock.
>>>
>>
>> Well... Looks like you were right and I was wrong. 
>>
>> Interesting. Apart from the dent in my pride here, this is actually good 
>> news. It means we can do better!
>>
>> So in the worst-case scenario Async Loggers/Disruptor with the blocking 
>> wait strategy give performance equivalent to AsyncAppender with a 
>> BlockingArrayQueue. We can expect Async Loggers with Disruptor to 
>> outperform AsyncAppender with BlockingArrayQueue when workloads are high or 
>> bursts of work keep the background thread busy for some time. At least this 
>> is my understanding now. Or did I miss something and are there cases where 
>> Async Loggers with Disruptor will perform worse than AsyncAppender with 
>> BlockingArrayQueue?
>>
>> In terms of how to improve, short-term we should probably look at using 
>> the PhasedBackoffWaitStrategy as the default, although there will always be 
>> workloads that "miss the window". The trade-off is increased CPU. We need 
>> to tread carefully here because users reporting high CPU is why we changed 
>> from the sleeping wait strategy to blocking. (The default in Log4j 2.6 will 
>> be timeout blocking wait to deal with a deadlock issue in Solaris.)
>>
>> Long-term we should probably explore other data structures.
>>
>> Exciting times. Thanks for correcting me!
>>
>>
>>>
>>> On 22 May 2016 at 15:07, Benedict Elliott Smith <bene...@apache.org> 
>>> wrote:
>>>
>>>> It's possible I'm missing some other aspect that insulates the 
>>>> application thread from the disruptor, but from your message it seems to 
>>>> me 
>>>> that you fall into the all-too-common category of people who do not fully 
>>>> understand the semantics of the disruptor.  Application threads (i.e. 
>>>> disruptor producers) very much *are* affected by the wait strategy.  
>>>> Under normal use, *if the disruptor consumer thread is employing a 
>>>> blocking strategy, there will be futex call to unblock it when work is 
>>>> provided*.  i.e. whenever the queue transitions from empty to 
>>>> non-empty, i.e. for most messages with low-to-medium logging rates.
>>>>
>>>> In this situation the enqueueing of a message is also* not lock free*, 
>>>> as while futexes can in theory be lock-free, the JVM uses a lock to 
>>>> synchronise access to the semaphore that controls its runnable state, so 
>>>> all threads attempting to wake it up concurrently will compete for this 
>>>> lock.
>>>>
>>>>
>>>>
>>>> On 22 May 2016 at 14:42, Remko Popma <remko...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Sunday, May 22, 2016 at 6:34:13 PM UTC+9, Benedict Elliott Smith 
>>>>> wrote:
>>>>>>
>>>>>> Hi Remko,
>>>>>>
>>>>>> I realise that I was looking at the old log4j code without realising 
>>>>>> it, in which there is always a futex involved.
>>>>>>
>>>>> Which class is this? Async Loggers in Log4j 2 have always been 
>>>>> lock-free. What may be confusing is that Log4j 2 also has an Async
>>>>> *Appender*, which does not use the Disruptor but uses a JDK 
>>>>> BlockingArrayQueue. This appender is very similar to how Logback and 
>>>>> Log4j 
>>>>> 1 do asynchronous logging. Async*Appender* (in Log4j 2, Log4j 1 and 
>>>>> Logback, they all use a blocking queue) is sensitive to lock contention 
>>>>> and 
>>>>> does not scale with more threads.
>>>>>  
>>>>>
>>>>>>
>>>>>> However the behaviour of the disruptor that backs the new code will 
>>>>>> depend on the wait strategy employed (and I'll note your default is a 
>>>>>> blocking strategy, for which my prior statement holds*).
>>>>>>
>>>>> Perhaps the key misunderstanding is here. I think I did not explain it 
>>>>> well in my previous message. Let me try again.
>>>>>
>>>>> Async Loggers use the Disruptor and are lock-free. This means that 
>>>>> multiple application threads can log concurrently and are not blocked by 
>>>>> each other. This is true for very low workloads, medium workloads and 
>>>>> high 
>>>>> workloads. 
>>>>>
>>>>> Only if an application logs messages at a very high rate for a long 
>>>>> period of time, or uses a slow appender, the Disruptor ringbuffer can 
>>>>> fill 
>>>>> up and application threads will have to wait until a slot in the 
>>>>> ringbuffer 
>>>>> becomes available. For the 2.6 release I have added text to the 
>>>>> Trade-offs section 
>>>>> <http://home.apache.org/~rpopma/log4j/2.6/manual/async.html#Trade-offs> 
>>>>> to explain this last point, based on your feedback.
>>>>>
>>>>> Note that the Disruptor wait strategy is irrelevant to all this. 
>>>>> Application threads are *not *impacted by the wait strategy.
>>>>>
>>>>>
>>>>>> Here we obviously get into the territory of people needing to 
>>>>>> understand the disruptor as well as your logging framework, but since 
>>>>>> you 
>>>>>> have a whole section labelled "Trade-offs" in which you draw attention 
>>>>>> to 
>>>>>> the improved throughput and latency profile under "Benefits," under 
>>>>>> "Drawbacks" it might be nice to mention these possible confounders.  Or 
>>>>>> at 
>>>>>> least raise a bat signal to go and understand the tradeoffs for the 
>>>>>> disruptor (which is also regrettably poorly understood).  
>>>>>>
>>>>>> It might also be helpful to explicitly call out the configuration 
>>>>>> used for your benchmarks, 
>>>>>>
>>>>> Tests on the 2.6 performance page 
>>>>> <http://home.apache.org/~rpopma/log4j/2.6/performance.html> are 
>>>>> mostly done with JMH benchmarks where parameters are in the javadoc.
>>>>> Tests on the Async Loggers page are partly with the new ResponseTimeTest 
>>>>>
>>>>> <https://github.com/apache/logging-log4j2/blob/6ddc0250cb6ebbf5f05d3026a976bc8ca0a8f162/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java>class
>>>>>  
>>>>> (params in the doc 
>>>>> <http://home.apache.org/~rpopma/log4j/2.6/manual/async.html#Latency>) 
>>>>> and partly still shows results from the older PerfTestDriver 
>>>>> <https://github.com/apache/logging-log4j2/blob/6ddc0250cb6ebbf5f05d3026a976bc8ca0a8f162/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/PerfTestDriver.java>
>>>>>  
>>>>> class. This last one has parameters in the code. You'll need to spend 
>>>>> some 
>>>>> time to get familiar with it if you want to run it.
>>>>>
>>>>> and perhaps to run a comparison against some real software - github 
>>>>>> does not lack for applications using log4j!  Unfortunately I bet this 
>>>>>> would 
>>>>>> result in a much less sexy graph.
>>>>>>
>>>>> Log4j 2 has had lock-free Async Loggers from its 2.0-beta5 release 
>>>>> <http://logging.apache.org/log4j/2.x/changes-report.html#a2.0-beta5>, 
>>>>> three years ago. If there are any serious performance drawbacks like you 
>>>>> seem to think there are I hope I would have heard of them. Fortunately 
>>>>> the 
>>>>> opposite is true. People report very positive experiences 
>>>>> <http://tech.finn.no/2014/07/01/log4j2-in-production-making-it-fly/>. 
>>>>> People who did their own measurements reported it makes their logging as 
>>>>> fast as if it was switched off 
>>>>> <http://blogs.mulesoft.com/dev/mule-dev/mule-3-6-asynchronous-logging/>
>>>>> .
>>>>>
>>>>>
>>>>>> Finally, I really think people should (in your wiki page) explicitly 
>>>>>> be *discouraged* from using this part of your framework *unless they 
>>>>>> know they need it*.  It's a fantastic piece of work for people who 
>>>>>> *do* need it.  
>>>>>>
>>>>> Again, I think you are laboring under a misapprehension. I cannot 
>>>>> imagine a scenario where lock-free logging would be a bad thing. If you 
>>>>> think there is, please show evidence.
>>>>>  
>>>>>
>>>>>> But 99.99% of people would be better off with a logger that just 
>>>>>> avoids blocking threads when one is already in the process of logging. 
>>>>>>
>>>>> Avoiding blocking is exactly what the Async Loggers are for. (So I 
>>>>> really think there is some misunderstanding somewhere.)
>>>>> Based on your feedback, I've updated 
>>>>> <http://home.apache.org/~rpopma/log4j/2.6/manual/async.html> the side 
>>>>> nav menu and page title for the 2.6 release to clarify that Async Loggers 
>>>>> are lock-free. I hope this will help avoid misunderstandings.
>>>>>  
>>>>>
>>>>>>
>>>>>> (*It looks like you silently ignore misconfiguration, and default to 
>>>>>> timeout wait strategy also, which is probably going to surprise some 
>>>>>> people)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19 May 2016 at 00:34, Remko Popma <remko...@gmail.com> wrote:
>>>>>>
>>>>>>> Benedict, thanks for your feedback!
>>>>>>>
>>>>>>> With regards to filtering, global filters are already considered 
>>>>>>> before the event is enqueued. Filters configured on the Logger and 
>>>>>>> AppenderRef are applied prior to enqueueing with mixed Sync/Async 
>>>>>>> <http://logging.apache.org/log4j/2.x/manual/async.html#MixedSync-Async> 
>>>>>>> Loggers but when all loggers are async 
>>>>>>> <http://logging.apache.org/log4j/2.x/manual/async.html#AllAsync> 
>>>>>>> these filters are applied by the background thread. We can probably 
>>>>>>> improve 
>>>>>>> this, thanks for the suggestion!
>>>>>>>
>>>>>>> Your suggestion to run performance tests under lower loads is 
>>>>>>> reasonable and we will look at this for a future release.
>>>>>>> I did experiment with this a little a while back and did see larger 
>>>>>>> response time pauses. Perhaps others with more experience can chime in.
>>>>>>>
>>>>>>> My understanding of how the Disruptor works is that enqueueing the 
>>>>>>> event is lock-free, so applications threads doing the logging should 
>>>>>>> not 
>>>>>>> experience any context switches or suffer latency from Futex calls 
>>>>>>> caused 
>>>>>>> by logging (regardless of the workload). The background thread is 
>>>>>>> another 
>>>>>>> story. Under moderate to low workloads the background thread may 
>>>>>>> (depending 
>>>>>>> on the wait policy) fall asleep and experience delays before waking up 
>>>>>>> when 
>>>>>>> work arrives. However, if there are enough cores to serve all threads I 
>>>>>>> don't see how such background thread delays can impact (cause response 
>>>>>>> time 
>>>>>>> pauses for) the application that is doing the logging. Please correct 
>>>>>>> me if 
>>>>>>> my understanding is incorrect.
>>>>>>>
>>>>>>> My thinking is that using async loggers is good for reactive 
>>>>>>> applications that need to respond quickly to external events. It is 
>>>>>>> especially useful if the application needs to deal with occasional 
>>>>>>> bursts 
>>>>>>> of work (and accompanied bursts of logging). This is where async 
>>>>>>> loggers 
>>>>>>> can deliver value even if the normal workload is low.
>>>>>>>
>>>>>>> Remko
>>>>>>>
>>>>>>> On Wednesday, May 18, 2016 at 2:56:18 AM UTC+9, Benedict Elliott 
>>>>>>> Smith wrote:
>>>>>>>>
>>>>>>>> Regrettably it seems impossible (or at least very annoying) to send 
>>>>>>>> to both lists simultaneously...
>>>>>>>>
>>>>>>>> On 17 May 2016 at 18:55, Benedict Elliott Smith <bene...@apache.org
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Could I suggest that you run tests for latency and throughput 
>>>>>>>>> effects of using this in systems with only moderate-to-low numbers of 
>>>>>>>>> logging calls?  (i.e. the vast majority of systems!)
>>>>>>>>>
>>>>>>>>
>>>>>>>>> It's very likely that in such systems messages will not reach a 
>>>>>>>>> velocity to keep the Dispatcher running, and so logging calls may 
>>>>>>>>> often (or 
>>>>>>>>> always) involve a Futex call - which is not a trivial cost.  There 
>>>>>>>>> will 
>>>>>>>>> almost certainly be systems out there with anti-ideal characteristics 
>>>>>>>>> - 
>>>>>>>>> logging just often enough that these costs materially impact 
>>>>>>>>> throughput, 
>>>>>>>>> but not often enough that they suddenly disappear.  
>>>>>>>>>
>>>>>>>>> Even though a majority of systems *do not need this*, because it 
>>>>>>>>> "async" and the new hotness, and there are no advertised downsides, 
>>>>>>>>> everyone will try to use it.  It's up to those who know better to 
>>>>>>>>> make sure 
>>>>>>>>> these people are informed it isn't a free lunch.  Making sure all of 
>>>>>>>>> the 
>>>>>>>>> caveats are reported on the advertising page would be a great start 
>>>>>>>>> IMO.
>>>>>>>>>
>>>>>>>>> Might I also separately suggest you consider filtering events 
>>>>>>>>> prior to placing them on the queue for processing by the dispatcher?  
>>>>>>>>> I've 
>>>>>>>>> only briefly glanced at the code, but it seems not to be the case 
>>>>>>>>> currently.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 17 May 2016 at 18:33, Martin Thompson <mjp...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Remko,
>>>>>>>>>>>
>>>>>>>>>>> I'd just like to say that it is a great service to the community 
>>>>>>>>>>> as a whole that someone is seriously looking at improving logging.
>>>>>>>>>>>
>>>>>>>>>>> If you keep it up you'll be putting folks like me out of a job :)
>>>>>>>>>>>
>>>>>>>>>>> Martin...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 17 May 2016 at 18:13, Remko Popma <remko...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> First, my thanks to the many people who gave helpful advice and 
>>>>>>>>>>>> feedback on how to measure Log4j response time on this list some 
>>>>>>>>>>>> time ago.
>>>>>>>>>>>>
>>>>>>>>>>>> We're about to start the Log4j 2.6 release.
>>>>>>>>>>>> If anyone is interested, a preview of the garbage-free logging 
>>>>>>>>>>>> manual page is here: 
>>>>>>>>>>>> http://home.apache.org/~rpopma/log4j/2.6/manual/garbagefree.html
>>>>>>>>>>>> and a preview of the updated performance page is here: 
>>>>>>>>>>>> http://home.apache.org/~rpopma/log4j/2.6/performance.html
>>>>>>>>>>>>
>>>>>>>>>>>> Feedback welcome!
>>>>>>>>>>>>
>>>>>>>>>>>> Remko
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "mechanical-sympathy" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to 
>>>>>>>>>>>> mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "mechanical-sympathy" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to 
>>>>>>>>>>> mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "mechanical-sympathy" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: log4j-dev-unsubscr...@logging.apache.org
For additional commands, e-mail: log4j-dev-h...@logging.apache.org

Re: Garbage-free Log4j docs preview

Reply via email to