I've identified a few things that seem impactful, but bear in mind that I 
haven't begun to drill down into them yet. I plan to file individual tickets 
and investigate in greater depth later on:

1. Formatting the date is expensive. The process of actually formatting a value 
is reasonable, however using a precise clock appears to cause cache misses even 
when the pattern results in the same value (microseconds/nanoseconds aren't 
included in the result). Using the SystemMillisClock improves our throughput. I 
imagine some part of that improvement is the result of currentTimeMillis(): 
long rather than Clock.instant().

2. Setting 'log4j2.enable.direct.encoders' to false improves performance, 
otherwise we see a lot of time spent in the StringBuilderEncoder, but I haven't 
had time to investigate that yet. Without direct encoders, we are allocating a 
great deal more byte arrays, but they're used across a small space that C2 may 
be able to inline out.

3. Setting log4j2.formatMsgNoLookups=true (or using message pattern 
'%m{nolookup}' saves us some time scanning through produced messages -- I've 
had this disabled at work for a while.

4. [in progress] If I implement the full layout directly in java (write 
directly to a stringbuilder based on the event in a custom Layout instead of 
using PatternLayout) combined with [1] and [2] above, performance is much 
better than logback, however using PatternLayout with the steps above we fall 
just a bit short. I still need to read through the patternlayout and individual 
pattern converters to understand why, but there's definitely some headroom 
we'll be able to reclaim hiding somewhere.

-ck

On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote:
> My pleasure, I enjoy digging into this sort of performance comparison, the
> toughest piece right now is balancing investigation with active projects at
> work. I completely agree about the importance of getting this resolved
> quickly and I intend to continue investigating.
> 
> Re loom:
> Currently we have a few places which reuse shared objects, potentially large
> ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc)
> which are currently bounded to the number of application threads you have
> (which is relatively small compared to the loom claims which are admittedly
> large). More importantly, today we tend to reuse almost every thread in a
> ThreadPoolExecutor of some kind where the model in loom involves creating
> a new virtual thread for each task, meaning using a thread-local as a cache
> will have a much lower hit rate as reuse will only occur within a single task,
> and only if ThreadLocals are enabled. I hope that made sense sense.
> 
> -ck
> 
> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
> > Carter, thanks so much for your great effort and attention on this issue. *I
> > personally find this the uttermost important thing the project needs to
> > address with urgency right now.* Ceki is warming up for a new Logback (and
> > SLF4J?) release. I am pretty sure he will do his part in PR, particularly
> > in the context of benchmarking his new work with competitors. Once these
> > results are out – independent of whether they make sense or not –
> > "internet" will take this as the ultimate truth and it will require a
> > tremendous effort to change people's mind. In conclusion, please prioritize
> > this appropriately and keep us posted. Don't hesitate to let us/me know if
> > there is anything (drinks, coffee, lunch?) we can help with.
> > 
> > Regarding your remarks about Loom, I don't think I follow you there. Our
> > thread-locals are stateless hence should be perfectly fine while running in
> > virtual threads too. (I am totally against blindly using TLs, but that is
> > another discussion. I would rather prefer a recycler concept similar to the
> > one we have in JsonTemplateLayout. I think there was already a mailing list
> > thread about this.)
> > 
> > On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote:
> > 
> > > Yes, I'm still looking into this.
> > >
> > > I agree that the zero-garbage code is difficult to follow, and Loom
> > > virtual threads will have less reuse so our thread-locals will create more
> > > overhead than they're worth in many cases. Fingers crossed for Valhalla to
> > > land before Loom, but I haven't been following updates from either project
> > > very closely lately, and Loom seems farther along.
> > >
> > > I think the issue with multiple buffers these days is that we have optane
> > > and nvme devices which can be _nearly_ as fast as main memory, so copying
> > > contents can be problematic. In fully async mode (with some caveats around
> > > queue-full-policy configuration points and nested async appenders) we 
> > > don't
> > > need locking because the appender interactions are single-threaded.
> > >
> > > For what it's worth, in my testing I didn't see much difference between
> > > the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme
> > > drive, however I did discover and resolve LOG4J2-3150 (incorrect default
> > > buffer size in the RandomAccessFileAppender).
> > >
> > > On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> > > > While that is true it does mean that we are locking higher up in the
> > > call stack than we need to be.
> > > >
> > > > Ralph
> > > >
> > > > > On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote:
> > > > >
> > > > > I’m fairly certain the JIT will optimize out the locking operations on
> > > the nested synchronized calls after a while. I’m not sure how soon into 
> > > the
> > > performance tests that would happen.
> > > > >
> > > > > Tim
> > > > >
> > > > >
> > > > >> On Aug 26, 2021, at 10:55 AM, Ralph Goers 
> > > > >> <ralph.go...@dslextreme.com>
> > > wrote:
> > > > >>
> > > > >> So are you continuing to look at this?  Frankly, the work on zero-gc
> > > stuff made this
> > > > >> very complicated. I am not sure I can even follow the logic anymore.
> > > I also noticed
> > > > >> that many of the methods dealing with byte buffers are synchronized.
> > > I am not sure
> > > > >> why as only the method that moves data really needs to be.
> > > > >>
> > > > >> Using multiple buffers would make sense if you are filling one from
> > > the various patterns
> > > > >> while it is unlocked and then only blocking when writing to the
> > > output stream buffer.
> > > > >>
> > > > >> For what its worth Ceki updated his performance page again yesterday
> > > to change
> > > > >> the wording but I am pretty sure he didn’t change the log4j2
> > > configuration to use a
> > > > >> 256K buffer to match Logback.  And the page still makes it look like
> > > he is testing
> > > > >> the performance of asynchronous processing when it is really testing
> > > the performance
> > > > >> of the file appenders with a blocking queue in front of them.
> > > > >>
> > > > >> Ralph
> > > > >>
> > > > >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote:
> > > > >>>
> > > > >>> I also tried that with similar results, which leads me to believe
> > > we're spending
> > > > >>> too much time copying between buffers.
> > > > >>>
> > > > >>> We've proven that log4j can get LogEvents to the appender very
> > > quickly and
> > > > >>> efficiently.
> > > > >>>
> > > > >>> Once we hit PatternLayout we write data to a StringBuilder.
> > > AbstractStringBuilder used to store a char array in java 8, however with
> > > jep 254 compact strings it was updated (similarly to String) to use a byte
> > > array + byte representing the encoding.
> > > > >>> We encode the StringBuilder value into a ByteBufferDestination
> > > (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> > > > >>> 1. copy the StringBuilder value into a CharBuffer (this would have
> > > been a direct copy from char[] to char[] in java8, but requires conversion
> > > from byte[]+coder now)
> > > > >>> 2. encode data from the CharBuffer to the StringBuilderEncoder
> > > ByteBuffer
> > > > >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
> > > ByteBufferDestination ByteBuffer (which is also always a heap buffer as 
> > > far
> > > as I can tell)
> > > > >>> 4. Write the ByteBufferDestination to the
> > > FileOutputStream/RandomAccessFile by extracting the heap byte-array.
> > > > >>>
> > > > >>> We're copying bytes around a lot, and I think older JVMs would have
> > > optimized out a bit of the copying because the types exactly matched.
> > > > >>> It's unclear why we need to copy through two separate ByteBuffers
> > > from StringBuilderEncoder into ByteBufferDestination
> > > > >>> I've generally had better I/O performance using direct byte-buffers
> > > than heap buffers, although that experience largely comes from work on
> > > webservers, I haven't tested if it holds true for file I/O. If we operate
> > > directly upon a FileChannel, we can write direct buffers directly to 
> > > files.
> > > The DirectByteBuffer memory is guaranteed to be contiguous, so it should
> > > have better performance around copying and manipulation as well.
> > > > >>>
> > > > >>> -ck
> > > > >>>
> > > > >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> > > > >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null
> > > to avoid
> > > > >>>> hardware's influence in the test results?
> > > > >>>>
> > > > >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
> > > ralph.go...@dslextreme.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Did you add the no-op appender to Ceki’s project? Or are you using
> > > our
> > > > >>>>> NullAppender? I have
> > > > >>>>> my doubts about using that NullAppender as it does nothing - not
> > > even
> > > > >>>>> render the Layout, so
> > > > >>>>> it may get completely optimized away.
> > > > >>>>>
> > > > >>>>> I’d still like to know what kind of disks Remko did his original
> > > testing.
> > > > >>>>> The page says he got
> > > > >>>>> 18.495 million messages per second. Each message would be about 
> > > > >>>>> 130
> > > > >>>>> characters long
> > > > >>>>> from I can tell. That means the destination has to be able to
> > > support
> > > > >>>>> about at least 2.4 GB
> > > > >>>>> per second.
> > > > >>>>>
> > > > >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in
> > > Remko’s).
> > > > >>>>> That means the
> > > > >>>>> message size is only going to be about 60 characters long. For me
> > > the
> > > > >>>>> async test is writing
> > > > >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
> > > seeing the
> > > > >>>>> queue backed up.
> > > > >>>>> So how we are writing must be terribly inefficient. Logback is
> > > getting
> > > > >>>>> about 2000 ops/ms,
> > > > >>>>> which is still only 120 MBs.
> > > > >>>>>
> > > > >>>>> Your test below would be generating about 197 MBs.
> > > > >>>>>
> > > > >>>>> One thing I did notice that made a big difference is that Ceki has
> > > Logback
> > > > >>>>> configured to
> > > > >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K.
> > > Setting
> > > > >>>>> Log4j2 to 256K
> > > > >>>>> considerably improved the performance.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Ralph
> > > > >>>>>
> > > > >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net>
> > > wrote:
> > > > >>>>>>
> > > > >>>>>>> If we would move the appender performance aside, am I right to
> > > conclude
> > > > >>>>>>> that the entire complex async. framework built atop Disruptor
> > > with all
> > > > >>>>> its
> > > > >>>>>>> whistles and bells is yielding a diminishing performance gain
> > > compared
> > > > >>>>> to a
> > > > >>>>>>> simple off the shelf blocking queue?
> > > > >>>>>>
> > > > >>>>>> I haven't seen any data that would give me such an impression --
> > > the
> > > > >>>>> file appender
> > > > >>>>>> itself is slower than the thread producing events, which appears
> > > to
> > > > >>>>> limit our throughput.
> > > > >>>>>> Even then, log4j2 does much better in my testing on a linux
> > > workstation
> > > > >>>>> with 20+ producer
> > > > >>>>>> threads compared to logback.
> > > > >>>>>> Any time the queue is constantly full, performance will suffer
> > > (and this
> > > > >>>>> applies to standard
> > > > >>>>>> blocking queues as well as disruptor, however disruptor will fare
> > > worse
> > > > >>>>> when the queue
> > > > >>>>>> is full).
> > > > >>>>>> The results using a single thread are roughly reproducible on the
> > > > >>>>> non-async test, so I think
> > > > >>>>>> the problem is somewhere between the layout, and 
> > > > >>>>>> appender/manager.
> > > > >>>>>>
> > > > >>>>>> Results running locally with no-op appender implementations:
> > > > >>>>>>
> > > > >>>>>> 1 thread:
> > > > >>>>>>
> > > > >>>>>> Benchmark                                        Mode  Cnt
> > >  Score
> > > > >>>>> Error   Units
> > > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > > 3285.743 ±
> > > > >>>>> 427.963  ops/ms
> > > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > > 2383.532 ±
> > > > >>>>> 136.034  ops/ms
> > > > >>>>>>
> > > > >>>>>> 20 threads:
> > > > >>>>>>
> > > > >>>>>> Benchmark                                        Mode  Cnt
> > >  Score
> > > > >>>>> Error   Units
> > > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > > 1177.746 ±
> > > > >>>>> 919.560  ops/ms
> > > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > >  602.614 ±
> > > > >>>>> 47.485  ops/ms
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > >
> > >
> > > -ck
> > >
> > 
> 

Reply via email to