Carter, thanks so much for your great effort and attention on this issue. *I personally find this the uttermost important thing the project needs to address with urgency right now.* Ceki is warming up for a new Logback (and SLF4J?) release. I am pretty sure he will do his part in PR, particularly in the context of benchmarking his new work with competitors. Once these results are out – independent of whether they make sense or not – "internet" will take this as the ultimate truth and it will require a tremendous effort to change people's mind. In conclusion, please prioritize this appropriately and keep us posted. Don't hesitate to let us/me know if there is anything (drinks, coffee, lunch?) we can help with.
Regarding your remarks about Loom, I don't think I follow you there. Our thread-locals are stateless hence should be perfectly fine while running in virtual threads too. (I am totally against blindly using TLs, but that is another discussion. I would rather prefer a recycler concept similar to the one we have in JsonTemplateLayout. I think there was already a mailing list thread about this.) On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote: > Yes, I'm still looking into this. > > I agree that the zero-garbage code is difficult to follow, and Loom > virtual threads will have less reuse so our thread-locals will create more > overhead than they're worth in many cases. Fingers crossed for Valhalla to > land before Loom, but I haven't been following updates from either project > very closely lately, and Loom seems farther along. > > I think the issue with multiple buffers these days is that we have optane > and nvme devices which can be _nearly_ as fast as main memory, so copying > contents can be problematic. In fully async mode (with some caveats around > queue-full-policy configuration points and nested async appenders) we don't > need locking because the appender interactions are single-threaded. > > For what it's worth, in my testing I didn't see much difference between > the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme > drive, however I did discover and resolve LOG4J2-3150 (incorrect default > buffer size in the RandomAccessFileAppender). > > On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: > > While that is true it does mean that we are locking higher up in the > call stack than we need to be. > > > > Ralph > > > > > On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote: > > > > > > I’m fairly certain the JIT will optimize out the locking operations on > the nested synchronized calls after a while. I’m not sure how soon into the > performance tests that would happen. > > > > > > Tim > > > > > > > > >> On Aug 26, 2021, at 10:55 AM, Ralph Goers <ralph.go...@dslextreme.com> > wrote: > > >> > > >> So are you continuing to look at this? Frankly, the work on zero-gc > stuff made this > > >> very complicated. I am not sure I can even follow the logic anymore. > I also noticed > > >> that many of the methods dealing with byte buffers are synchronized. > I am not sure > > >> why as only the method that moves data really needs to be. > > >> > > >> Using multiple buffers would make sense if you are filling one from > the various patterns > > >> while it is unlocked and then only blocking when writing to the > output stream buffer. > > >> > > >> For what its worth Ceki updated his performance page again yesterday > to change > > >> the wording but I am pretty sure he didn’t change the log4j2 > configuration to use a > > >> 256K buffer to match Logback. And the page still makes it look like > he is testing > > >> the performance of asynchronous processing when it is really testing > the performance > > >> of the file appenders with a blocking queue in front of them. > > >> > > >> Ralph > > >> > > >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote: > > >>> > > >>> I also tried that with similar results, which leads me to believe > we're spending > > >>> too much time copying between buffers. > > >>> > > >>> We've proven that log4j can get LogEvents to the appender very > quickly and > > >>> efficiently. > > >>> > > >>> Once we hit PatternLayout we write data to a StringBuilder. > AbstractStringBuilder used to store a char array in java 8, however with > jep 254 compact strings it was updated (similarly to String) to use a byte > array + byte representing the encoding. > > >>> We encode the StringBuilder value into a ByteBufferDestination > (Heap)ByteBuffer in StringBuilderEncoder using the following steps: > > >>> 1. copy the StringBuilder value into a CharBuffer (this would have > been a direct copy from char[] to char[] in java8, but requires conversion > from byte[]+coder now) > > >>> 2. encode data from the CharBuffer to the StringBuilderEncoder > ByteBuffer > > >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the > ByteBufferDestination ByteBuffer (which is also always a heap buffer as far > as I can tell) > > >>> 4. Write the ByteBufferDestination to the > FileOutputStream/RandomAccessFile by extracting the heap byte-array. > > >>> > > >>> We're copying bytes around a lot, and I think older JVMs would have > optimized out a bit of the copying because the types exactly matched. > > >>> It's unclear why we need to copy through two separate ByteBuffers > from StringBuilderEncoder into ByteBufferDestination > > >>> I've generally had better I/O performance using direct byte-buffers > than heap buffers, although that experience largely comes from work on > webservers, I haven't tested if it holds true for file I/O. If we operate > directly upon a FileChannel, we can write direct buffers directly to files. > The DirectByteBuffer memory is guaranteed to be contiguous, so it should > have better performance around copying and manipulation as well. > > >>> > > >>> -ck > > >>> > > >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: > > >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null > to avoid > > >>>> hardware's influence in the test results? > > >>>> > > >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < > ralph.go...@dslextreme.com> > > >>>> wrote: > > >>>> > > >>>>> Did you add the no-op appender to Ceki’s project? Or are you using > our > > >>>>> NullAppender? I have > > >>>>> my doubts about using that NullAppender as it does nothing - not > even > > >>>>> render the Layout, so > > >>>>> it may get completely optimized away. > > >>>>> > > >>>>> I’d still like to know what kind of disks Remko did his original > testing. > > >>>>> The page says he got > > >>>>> 18.495 million messages per second. Each message would be about 130 > > >>>>> characters long > > >>>>> from I can tell. That means the destination has to be able to > support > > >>>>> about at least 2.4 GB > > >>>>> per second. > > >>>>> > > >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in > Remko’s). > > >>>>> That means the > > >>>>> message size is only going to be about 60 characters long. For me > the > > >>>>> async test is writing > > >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still > seeing the > > >>>>> queue backed up. > > >>>>> So how we are writing must be terribly inefficient. Logback is > getting > > >>>>> about 2000 ops/ms, > > >>>>> which is still only 120 MBs. > > >>>>> > > >>>>> Your test below would be generating about 197 MBs. > > >>>>> > > >>>>> One thing I did notice that made a big difference is that Ceki has > Logback > > >>>>> configured to > > >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. > Setting > > >>>>> Log4j2 to 256K > > >>>>> considerably improved the performance. > > >>>>> > > >>>>> > > >>>>> Ralph > > >>>>> > > >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net> > wrote: > > >>>>>> > > >>>>>>> If we would move the appender performance aside, am I right to > conclude > > >>>>>>> that the entire complex async. framework built atop Disruptor > with all > > >>>>> its > > >>>>>>> whistles and bells is yielding a diminishing performance gain > compared > > >>>>> to a > > >>>>>>> simple off the shelf blocking queue? > > >>>>>> > > >>>>>> I haven't seen any data that would give me such an impression -- > the > > >>>>> file appender > > >>>>>> itself is slower than the thread producing events, which appears > to > > >>>>> limit our throughput. > > >>>>>> Even then, log4j2 does much better in my testing on a linux > workstation > > >>>>> with 20+ producer > > >>>>>> threads compared to logback. > > >>>>>> Any time the queue is constantly full, performance will suffer > (and this > > >>>>> applies to standard > > >>>>>> blocking queues as well as disruptor, however disruptor will fare > worse > > >>>>> when the queue > > >>>>>> is full). > > >>>>>> The results using a single thread are roughly reproducible on the > > >>>>> non-async test, so I think > > >>>>>> the problem is somewhere between the layout, and appender/manager. > > >>>>>> > > >>>>>> Results running locally with no-op appender implementations: > > >>>>>> > > >>>>>> 1 thread: > > >>>>>> > > >>>>>> Benchmark Mode Cnt > Score > > >>>>> Error Units > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > 3285.743 ± > > >>>>> 427.963 ops/ms > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > 2383.532 ± > > >>>>> 136.034 ops/ms > > >>>>>> > > >>>>>> 20 threads: > > >>>>>> > > >>>>>> Benchmark Mode Cnt > Score > > >>>>> Error Units > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > 1177.746 ± > > >>>>> 919.560 ops/ms > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > 602.614 ± > > >>>>> 47.485 ops/ms > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >> > > >> > > > > > > > > > > > -ck >