I've identified a few things that seem impactful, but bear in mind that I haven't begun to drill down into them yet. I plan to file individual tickets and investigate in greater depth later on:
1. Formatting the date is expensive. The process of actually formatting a value is reasonable, however using a precise clock appears to cause cache misses even when the pattern results in the same value (microseconds/nanoseconds aren't included in the result). Using the SystemMillisClock improves our throughput. I imagine some part of that improvement is the result of currentTimeMillis(): long rather than Clock.instant(). 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, otherwise we see a lot of time spent in the StringBuilderEncoder, but I haven't had time to investigate that yet. Without direct encoders, we are allocating a great deal more byte arrays, but they're used across a small space that C2 may be able to inline out. 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern '%m{nolookup}' saves us some time scanning through produced messages -- I've had this disabled at work for a while. 4. [in progress] If I implement the full layout directly in java (write directly to a stringbuilder based on the event in a custom Layout instead of using PatternLayout) combined with [1] and [2] above, performance is much better than logback, however using PatternLayout with the steps above we fall just a bit short. I still need to read through the patternlayout and individual pattern converters to understand why, but there's definitely some headroom we'll be able to reclaim hiding somewhere. -ck On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote: > My pleasure, I enjoy digging into this sort of performance comparison, the > toughest piece right now is balancing investigation with active projects at > work. I completely agree about the importance of getting this resolved > quickly and I intend to continue investigating. > > Re loom: > Currently we have a few places which reuse shared objects, potentially large > ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc) > which are currently bounded to the number of application threads you have > (which is relatively small compared to the loom claims which are admittedly > large). More importantly, today we tend to reuse almost every thread in a > ThreadPoolExecutor of some kind where the model in loom involves creating > a new virtual thread for each task, meaning using a thread-local as a cache > will have a much lower hit rate as reuse will only occur within a single task, > and only if ThreadLocals are enabled. I hope that made sense sense. > > -ck > > On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote: > > Carter, thanks so much for your great effort and attention on this issue. *I > > personally find this the uttermost important thing the project needs to > > address with urgency right now.* Ceki is warming up for a new Logback (and > > SLF4J?) release. I am pretty sure he will do his part in PR, particularly > > in the context of benchmarking his new work with competitors. Once these > > results are out – independent of whether they make sense or not – > > "internet" will take this as the ultimate truth and it will require a > > tremendous effort to change people's mind. In conclusion, please prioritize > > this appropriately and keep us posted. Don't hesitate to let us/me know if > > there is anything (drinks, coffee, lunch?) we can help with. > > > > Regarding your remarks about Loom, I don't think I follow you there. Our > > thread-locals are stateless hence should be perfectly fine while running in > > virtual threads too. (I am totally against blindly using TLs, but that is > > another discussion. I would rather prefer a recycler concept similar to the > > one we have in JsonTemplateLayout. I think there was already a mailing list > > thread about this.) > > > > On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote: > > > > > Yes, I'm still looking into this. > > > > > > I agree that the zero-garbage code is difficult to follow, and Loom > > > virtual threads will have less reuse so our thread-locals will create more > > > overhead than they're worth in many cases. Fingers crossed for Valhalla to > > > land before Loom, but I haven't been following updates from either project > > > very closely lately, and Loom seems farther along. > > > > > > I think the issue with multiple buffers these days is that we have optane > > > and nvme devices which can be _nearly_ as fast as main memory, so copying > > > contents can be problematic. In fully async mode (with some caveats around > > > queue-full-policy configuration points and nested async appenders) we > > > don't > > > need locking because the appender interactions are single-threaded. > > > > > > For what it's worth, in my testing I didn't see much difference between > > > the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme > > > drive, however I did discover and resolve LOG4J2-3150 (incorrect default > > > buffer size in the RandomAccessFileAppender). > > > > > > On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: > > > > While that is true it does mean that we are locking higher up in the > > > call stack than we need to be. > > > > > > > > Ralph > > > > > > > > > On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote: > > > > > > > > > > I’m fairly certain the JIT will optimize out the locking operations on > > > the nested synchronized calls after a while. I’m not sure how soon into > > > the > > > performance tests that would happen. > > > > > > > > > > Tim > > > > > > > > > > > > > > >> On Aug 26, 2021, at 10:55 AM, Ralph Goers > > > > >> <ralph.go...@dslextreme.com> > > > wrote: > > > > >> > > > > >> So are you continuing to look at this? Frankly, the work on zero-gc > > > stuff made this > > > > >> very complicated. I am not sure I can even follow the logic anymore. > > > I also noticed > > > > >> that many of the methods dealing with byte buffers are synchronized. > > > I am not sure > > > > >> why as only the method that moves data really needs to be. > > > > >> > > > > >> Using multiple buffers would make sense if you are filling one from > > > the various patterns > > > > >> while it is unlocked and then only blocking when writing to the > > > output stream buffer. > > > > >> > > > > >> For what its worth Ceki updated his performance page again yesterday > > > to change > > > > >> the wording but I am pretty sure he didn’t change the log4j2 > > > configuration to use a > > > > >> 256K buffer to match Logback. And the page still makes it look like > > > he is testing > > > > >> the performance of asynchronous processing when it is really testing > > > the performance > > > > >> of the file appenders with a blocking queue in front of them. > > > > >> > > > > >> Ralph > > > > >> > > > > >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote: > > > > >>> > > > > >>> I also tried that with similar results, which leads me to believe > > > we're spending > > > > >>> too much time copying between buffers. > > > > >>> > > > > >>> We've proven that log4j can get LogEvents to the appender very > > > quickly and > > > > >>> efficiently. > > > > >>> > > > > >>> Once we hit PatternLayout we write data to a StringBuilder. > > > AbstractStringBuilder used to store a char array in java 8, however with > > > jep 254 compact strings it was updated (similarly to String) to use a byte > > > array + byte representing the encoding. > > > > >>> We encode the StringBuilder value into a ByteBufferDestination > > > (Heap)ByteBuffer in StringBuilderEncoder using the following steps: > > > > >>> 1. copy the StringBuilder value into a CharBuffer (this would have > > > been a direct copy from char[] to char[] in java8, but requires conversion > > > from byte[]+coder now) > > > > >>> 2. encode data from the CharBuffer to the StringBuilderEncoder > > > ByteBuffer > > > > >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the > > > ByteBufferDestination ByteBuffer (which is also always a heap buffer as > > > far > > > as I can tell) > > > > >>> 4. Write the ByteBufferDestination to the > > > FileOutputStream/RandomAccessFile by extracting the heap byte-array. > > > > >>> > > > > >>> We're copying bytes around a lot, and I think older JVMs would have > > > optimized out a bit of the copying because the types exactly matched. > > > > >>> It's unclear why we need to copy through two separate ByteBuffers > > > from StringBuilderEncoder into ByteBufferDestination > > > > >>> I've generally had better I/O performance using direct byte-buffers > > > than heap buffers, although that experience largely comes from work on > > > webservers, I haven't tested if it holds true for file I/O. If we operate > > > directly upon a FileChannel, we can write direct buffers directly to > > > files. > > > The DirectByteBuffer memory is guaranteed to be contiguous, so it should > > > have better performance around copying and manipulation as well. > > > > >>> > > > > >>> -ck > > > > >>> > > > > >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: > > > > >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null > > > to avoid > > > > >>>> hardware's influence in the test results? > > > > >>>> > > > > >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < > > > ralph.go...@dslextreme.com> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> Did you add the no-op appender to Ceki’s project? Or are you using > > > our > > > > >>>>> NullAppender? I have > > > > >>>>> my doubts about using that NullAppender as it does nothing - not > > > even > > > > >>>>> render the Layout, so > > > > >>>>> it may get completely optimized away. > > > > >>>>> > > > > >>>>> I’d still like to know what kind of disks Remko did his original > > > testing. > > > > >>>>> The page says he got > > > > >>>>> 18.495 million messages per second. Each message would be about > > > > >>>>> 130 > > > > >>>>> characters long > > > > >>>>> from I can tell. That means the destination has to be able to > > > support > > > > >>>>> about at least 2.4 GB > > > > >>>>> per second. > > > > >>>>> > > > > >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in > > > Remko’s). > > > > >>>>> That means the > > > > >>>>> message size is only going to be about 60 characters long. For me > > > the > > > > >>>>> async test is writing > > > > >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still > > > seeing the > > > > >>>>> queue backed up. > > > > >>>>> So how we are writing must be terribly inefficient. Logback is > > > getting > > > > >>>>> about 2000 ops/ms, > > > > >>>>> which is still only 120 MBs. > > > > >>>>> > > > > >>>>> Your test below would be generating about 197 MBs. > > > > >>>>> > > > > >>>>> One thing I did notice that made a big difference is that Ceki has > > > Logback > > > > >>>>> configured to > > > > >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. > > > Setting > > > > >>>>> Log4j2 to 256K > > > > >>>>> considerably improved the performance. > > > > >>>>> > > > > >>>>> > > > > >>>>> Ralph > > > > >>>>> > > > > >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net> > > > wrote: > > > > >>>>>> > > > > >>>>>>> If we would move the appender performance aside, am I right to > > > conclude > > > > >>>>>>> that the entire complex async. framework built atop Disruptor > > > with all > > > > >>>>> its > > > > >>>>>>> whistles and bells is yielding a diminishing performance gain > > > compared > > > > >>>>> to a > > > > >>>>>>> simple off the shelf blocking queue? > > > > >>>>>> > > > > >>>>>> I haven't seen any data that would give me such an impression -- > > > the > > > > >>>>> file appender > > > > >>>>>> itself is slower than the thread producing events, which appears > > > to > > > > >>>>> limit our throughput. > > > > >>>>>> Even then, log4j2 does much better in my testing on a linux > > > workstation > > > > >>>>> with 20+ producer > > > > >>>>>> threads compared to logback. > > > > >>>>>> Any time the queue is constantly full, performance will suffer > > > (and this > > > > >>>>> applies to standard > > > > >>>>>> blocking queues as well as disruptor, however disruptor will fare > > > worse > > > > >>>>> when the queue > > > > >>>>>> is full). > > > > >>>>>> The results using a single thread are roughly reproducible on the > > > > >>>>> non-async test, so I think > > > > >>>>>> the problem is somewhere between the layout, and > > > > >>>>>> appender/manager. > > > > >>>>>> > > > > >>>>>> Results running locally with no-op appender implementations: > > > > >>>>>> > > > > >>>>>> 1 thread: > > > > >>>>>> > > > > >>>>>> Benchmark Mode Cnt > > > Score > > > > >>>>> Error Units > > > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > > > 3285.743 ± > > > > >>>>> 427.963 ops/ms > > > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > > > 2383.532 ± > > > > >>>>> 136.034 ops/ms > > > > >>>>>> > > > > >>>>>> 20 threads: > > > > >>>>>> > > > > >>>>>> Benchmark Mode Cnt > > > Score > > > > >>>>> Error Units > > > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > > > 1177.746 ± > > > > >>>>> 919.560 ops/ms > > > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > > > 602.614 ± > > > > >>>>> 47.485 ops/ms > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > > > -ck > > > > > >