Re: Cek's new log4j vs logback benchmark

Carter Kozak Fri, 27 Aug 2021 19:08:04 -0700

Thanks, Remko. The default '%d' uses FixedDateFormat with FixedFormat.DEFAULT. 
The FastDateFormat alternative does not support microseconds, so it doesn't 
suffer from the same problem. I think I can substantially reduce the frequency 
we re-format dates by checking FixedFormat.secondFractionDigits to determine if 
we meed to compare microseconds.


On Fri, Aug 27, 2021, at 16:10, Remko Popma wrote:
> I remember looking at PatternLayout performance, I reported my findings here, 
> hopefully they’re still useful: 
> https://issues.apache.org/jira/browse/LOG4J2-930
> 
> If %d is used in the pattern, does the FixedDateFormat get used?
> 
> 
> 
> 
> > On Aug 28, 2021, at 4:33, Ralph Goers <ralph.go...@dslextreme.com> wrote:
> > 
> > All of that agrees with my observations as well.
> > 
> > Ralph
> > 
> >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <cko...@ckozak.net> wrote:
> >> 
> >> I've identified a few things that seem impactful, but bear in mind that I 
> >> haven't begun to drill down into them yet. I plan to file individual 
> >> tickets and investigate in greater depth later on:
> >> 
> >> 1. Formatting the date is expensive. The process of actually formatting a 
> >> value is reasonable, however using a precise clock appears to cause cache 
> >> misses even when the pattern results in the same value 
> >> (microseconds/nanoseconds aren't included in the result). Using the 
> >> SystemMillisClock improves our throughput. I imagine some part of that 
> >> improvement is the result of currentTimeMillis(): long rather than 
> >> Clock.instant().
> >> 
> >> 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, 
> >> otherwise we see a lot of time spent in the StringBuilderEncoder, but I 
> >> haven't had time to investigate that yet. Without direct encoders, we are 
> >> allocating a great deal more byte arrays, but they're used across a small 
> >> space that C2 may be able to inline out.
> >> 
> >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern 
> >> '%m{nolookup}' saves us some time scanning through produced messages -- 
> >> I've had this disabled at work for a while.
> >> 
> >> 4. [in progress] If I implement the full layout directly in java (write 
> >> directly to a stringbuilder based on the event in a custom Layout instead 
> >> of using PatternLayout) combined with [1] and [2] above, performance is 
> >> much better than logback, however using PatternLayout with the steps above 
> >> we fall just a bit short. I still need to read through the patternlayout 
> >> and individual pattern converters to understand why, but there's 
> >> definitely some headroom we'll be able to reclaim hiding somewhere.
> >> 
> >> -ck
> >> 
> >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote:
> >>> My pleasure, I enjoy digging into this sort of performance comparison, the
> >>> toughest piece right now is balancing investigation with active projects 
> >>> at
> >>> work. I completely agree about the importance of getting this resolved
> >>> quickly and I intend to continue investigating.
> >>> 
> >>> Re loom:
> >>> Currently we have a few places which reuse shared objects, potentially 
> >>> large
> >>> ones (thinking StringBuilder instances on 
> >>> ReusableLogEvent/StringLayout/etc)
> >>> which are currently bounded to the number of application threads you have
> >>> (which is relatively small compared to the loom claims which are 
> >>> admittedly
> >>> large). More importantly, today we tend to reuse almost every thread in a
> >>> ThreadPoolExecutor of some kind where the model in loom involves creating
> >>> a new virtual thread for each task, meaning using a thread-local as a 
> >>> cache
> >>> will have a much lower hit rate as reuse will only occur within a single 
> >>> task,
> >>> and only if ThreadLocals are enabled. I hope that made sense sense.
> >>> 
> >>> -ck
> >>> 
> >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
> >>>> Carter, thanks so much for your great effort and attention on this 
> >>>> issue. *I
> >>>> personally find this the uttermost important thing the project needs to
> >>>> address with urgency right now.* Ceki is warming up for a new Logback 
> >>>> (and
> >>>> SLF4J?) release. I am pretty sure he will do his part in PR, particularly
> >>>> in the context of benchmarking his new work with competitors. Once these
> >>>> results are out – independent of whether they make sense or not –
> >>>> "internet" will take this as the ultimate truth and it will require a
> >>>> tremendous effort to change people's mind. In conclusion, please 
> >>>> prioritize
> >>>> this appropriately and keep us posted. Don't hesitate to let us/me know 
> >>>> if
> >>>> there is anything (drinks, coffee, lunch?) we can help with.
> >>>> 
> >>>> Regarding your remarks about Loom, I don't think I follow you there. Our
> >>>> thread-locals are stateless hence should be perfectly fine while running 
> >>>> in
> >>>> virtual threads too. (I am totally against blindly using TLs, but that is
> >>>> another discussion. I would rather prefer a recycler concept similar to 
> >>>> the
> >>>> one we have in JsonTemplateLayout. I think there was already a mailing 
> >>>> list
> >>>> thread about this.)
> >>>> 
> >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote:
> >>>> 
> >>>>> Yes, I'm still looking into this.
> >>>>> 
> >>>>> I agree that the zero-garbage code is difficult to follow, and Loom
> >>>>> virtual threads will have less reuse so our thread-locals will create 
> >>>>> more
> >>>>> overhead than they're worth in many cases. Fingers crossed for Valhalla 
> >>>>> to
> >>>>> land before Loom, but I haven't been following updates from either 
> >>>>> project
> >>>>> very closely lately, and Loom seems farther along.
> >>>>> 
> >>>>> I think the issue with multiple buffers these days is that we have 
> >>>>> optane
> >>>>> and nvme devices which can be _nearly_ as fast as main memory, so 
> >>>>> copying
> >>>>> contents can be problematic. In fully async mode (with some caveats 
> >>>>> around
> >>>>> queue-full-policy configuration points and nested async appenders) we 
> >>>>> don't
> >>>>> need locking because the appender interactions are single-threaded.
> >>>>> 
> >>>>> For what it's worth, in my testing I didn't see much difference between
> >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively fast 
> >>>>> nvme
> >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect default
> >>>>> buffer size in the RandomAccessFileAppender).
> >>>>> 
> >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> >>>>>> While that is true it does mean that we are locking higher up in the
> >>>>> call stack than we need to be.
> >>>>>> 
> >>>>>> Ralph
> >>>>>> 
> >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote:
> >>>>>>> 
> >>>>>>> I’m fairly certain the JIT will optimize out the locking operations on
> >>>>> the nested synchronized calls after a while. I’m not sure how soon into 
> >>>>> the
> >>>>> performance tests that would happen.
> >>>>>>> 
> >>>>>>> Tim
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers 
> >>>>>>>> <ralph.go...@dslextreme.com>
> >>>>> wrote:
> >>>>>>>> 
> >>>>>>>> So are you continuing to look at this?  Frankly, the work on zero-gc
> >>>>> stuff made this
> >>>>>>>> very complicated. I am not sure I can even follow the logic anymore.
> >>>>> I also noticed
> >>>>>>>> that many of the methods dealing with byte buffers are synchronized.
> >>>>> I am not sure
> >>>>>>>> why as only the method that moves data really needs to be.
> >>>>>>>> 
> >>>>>>>> Using multiple buffers would make sense if you are filling one from
> >>>>> the various patterns
> >>>>>>>> while it is unlocked and then only blocking when writing to the
> >>>>> output stream buffer.
> >>>>>>>> 
> >>>>>>>> For what its worth Ceki updated his performance page again yesterday
> >>>>> to change
> >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2
> >>>>> configuration to use a
> >>>>>>>> 256K buffer to match Logback.  And the page still makes it look like
> >>>>> he is testing
> >>>>>>>> the performance of asynchronous processing when it is really testing
> >>>>> the performance
> >>>>>>>> of the file appenders with a blocking queue in front of them.
> >>>>>>>> 
> >>>>>>>> Ralph
> >>>>>>>> 
> >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote:
> >>>>>>>>> 
> >>>>>>>>> I also tried that with similar results, which leads me to believe
> >>>>> we're spending
> >>>>>>>>> too much time copying between buffers.
> >>>>>>>>> 
> >>>>>>>>> We've proven that log4j can get LogEvents to the appender very
> >>>>> quickly and
> >>>>>>>>> efficiently.
> >>>>>>>>> 
> >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder.
> >>>>> AbstractStringBuilder used to store a char array in java 8, however with
> >>>>> jep 254 compact strings it was updated (similarly to String) to use a 
> >>>>> byte
> >>>>> array + byte representing the encoding.
> >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination
> >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would have
> >>>>> been a direct copy from char[] to char[] in java8, but requires 
> >>>>> conversion
> >>>>> from byte[]+coder now)
> >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder
> >>>>> ByteBuffer
> >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
> >>>>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as 
> >>>>> far
> >>>>> as I can tell)
> >>>>>>>>> 4. Write the ByteBufferDestination to the
> >>>>> FileOutputStream/RandomAccessFile by extracting the heap byte-array.
> >>>>>>>>> 
> >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would have
> >>>>> optimized out a bit of the copying because the types exactly matched.
> >>>>>>>>> It's unclear why we need to copy through two separate ByteBuffers
> >>>>> from StringBuilderEncoder into ByteBufferDestination
> >>>>>>>>> I've generally had better I/O performance using direct byte-buffers
> >>>>> than heap buffers, although that experience largely comes from work on
> >>>>> webservers, I haven't tested if it holds true for file I/O. If we 
> >>>>> operate
> >>>>> directly upon a FileChannel, we can write direct buffers directly to 
> >>>>> files.
> >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it should
> >>>>> have better performance around copying and manipulation as well.
> >>>>>>>>> 
> >>>>>>>>> -ck
> >>>>>>>>> 
> >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null
> >>>>> to avoid
> >>>>>>>>>> hardware's influence in the test results?
> >>>>>>>>>> 
> >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
> >>>>> ralph.go...@dslextreme.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you using
> >>>>> our
> >>>>>>>>>>> NullAppender? I have
> >>>>>>>>>>> my doubts about using that NullAppender as it does nothing - not
> >>>>> even
> >>>>>>>>>>> render the Layout, so
> >>>>>>>>>>> it may get completely optimized away.
> >>>>>>>>>>> 
> >>>>>>>>>>> I’d still like to know what kind of disks Remko did his original
> >>>>> testing.
> >>>>>>>>>>> The page says he got
> >>>>>>>>>>> 18.495 million messages per second. Each message would be about 
> >>>>>>>>>>> 130
> >>>>>>>>>>> characters long
> >>>>>>>>>>> from I can tell. That means the destination has to be able to
> >>>>> support
> >>>>>>>>>>> about at least 2.4 GB
> >>>>>>>>>>> per second.
> >>>>>>>>>>> 
> >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs 100 in
> >>>>> Remko’s).
> >>>>>>>>>>> That means the
> >>>>>>>>>>> message size is only going to be about 60 characters long. For me
> >>>>> the
> >>>>>>>>>>> async test is writing
> >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
> >>>>> seeing the
> >>>>>>>>>>> queue backed up.
> >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback is
> >>>>> getting
> >>>>>>>>>>> about 2000 ops/ms,
> >>>>>>>>>>> which is still only 120 MBs.
> >>>>>>>>>>> 
> >>>>>>>>>>> Your test below would be generating about 197 MBs.
> >>>>>>>>>>> 
> >>>>>>>>>>> One thing I did notice that made a big difference is that Ceki has
> >>>>> Logback
> >>>>>>>>>>> configured to
> >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of 8K.
> >>>>> Setting
> >>>>>>>>>>> Log4j2 to 256K
> >>>>>>>>>>> considerably improved the performance.
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Ralph
> >>>>>>>>>>> 
> >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net>
> >>>>> wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> If we would move the appender performance aside, am I right to
> >>>>> conclude
> >>>>>>>>>>>>> that the entire complex async. framework built atop Disruptor
> >>>>> with all
> >>>>>>>>>>> its
> >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance gain
> >>>>> compared
> >>>>>>>>>>> to a
> >>>>>>>>>>>>> simple off the shelf blocking queue?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I haven't seen any data that would give me such an impression --
> >>>>> the
> >>>>>>>>>>> file appender
> >>>>>>>>>>>> itself is slower than the thread producing events, which appears
> >>>>> to
> >>>>>>>>>>> limit our throughput.
> >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux
> >>>>> workstation
> >>>>>>>>>>> with 20+ producer
> >>>>>>>>>>>> threads compared to logback.
> >>>>>>>>>>>> Any time the queue is constantly full, performance will suffer
> >>>>> (and this
> >>>>>>>>>>> applies to standard
> >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor will fare
> >>>>> worse
> >>>>>>>>>>> when the queue
> >>>>>>>>>>>> is full).
> >>>>>>>>>>>> The results using a single thread are roughly reproducible on the
> >>>>>>>>>>> non-async test, so I think
> >>>>>>>>>>>> the problem is somewhere between the layout, and 
> >>>>>>>>>>>> appender/manager.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Results running locally with no-op appender implementations:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 1 thread:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Benchmark                                        Mode  Cnt
> >>>>> Score
> >>>>>>>>>>> Error   Units
> >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> >>>>> 3285.743 ±
> >>>>>>>>>>> 427.963  ops/ms
> >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> >>>>> 2383.532 ±
> >>>>>>>>>>> 136.034  ops/ms
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 20 threads:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Benchmark                                        Mode  Cnt
> >>>>> Score
> >>>>>>>>>>> Error   Units
> >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> >>>>> 1177.746 ±
> >>>>>>>>>>> 919.560  ops/ms
> >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> >>>>> 602.614 ±
> >>>>>>>>>>> 47.485  ops/ms
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>>>> -ck
> >>>>> 
> >>>> 
> >>> 
> > 
> > 
> 

-ck

Re: Cek's new log4j vs logback benchmark

Reply via email to