Yes, I'm still looking into this. I agree that the zero-garbage code is difficult to follow, and Loom virtual threads will have less reuse so our thread-locals will create more overhead than they're worth in many cases. Fingers crossed for Valhalla to land before Loom, but I haven't been following updates from either project very closely lately, and Loom seems farther along.
I think the issue with multiple buffers these days is that we have optane and nvme devices which can be _nearly_ as fast as main memory, so copying contents can be problematic. In fully async mode (with some caveats around queue-full-policy configuration points and nested async appenders) we don't need locking because the appender interactions are single-threaded. For what it's worth, in my testing I didn't see much difference between the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme drive, however I did discover and resolve LOG4J2-3150 (incorrect default buffer size in the RandomAccessFileAppender). On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: > While that is true it does mean that we are locking higher up in the call > stack than we need to be. > > Ralph > > > On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]> wrote: > > > > I’m fairly certain the JIT will optimize out the locking operations on the > > nested synchronized calls after a while. I’m not sure how soon into the > > performance tests that would happen. > > > > Tim > > > > > >> On Aug 26, 2021, at 10:55 AM, Ralph Goers <[email protected]> > >> wrote: > >> > >> So are you continuing to look at this? Frankly, the work on zero-gc > >> stuff made this > >> very complicated. I am not sure I can even follow the logic anymore. I > >> also noticed > >> that many of the methods dealing with byte buffers are synchronized. I am > >> not sure > >> why as only the method that moves data really needs to be. > >> > >> Using multiple buffers would make sense if you are filling one from the > >> various patterns > >> while it is unlocked and then only blocking when writing to the output > >> stream buffer. > >> > >> For what its worth Ceki updated his performance page again yesterday to > >> change > >> the wording but I am pretty sure he didn’t change the log4j2 configuration > >> to use a > >> 256K buffer to match Logback. And the page still makes it look like he is > >> testing > >> the performance of asynchronous processing when it is really testing the > >> performance > >> of the file appenders with a blocking queue in front of them. > >> > >> Ralph > >> > >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]> wrote: > >>> > >>> I also tried that with similar results, which leads me to believe we're > >>> spending > >>> too much time copying between buffers. > >>> > >>> We've proven that log4j can get LogEvents to the appender very quickly and > >>> efficiently. > >>> > >>> Once we hit PatternLayout we write data to a StringBuilder. > >>> AbstractStringBuilder used to store a char array in java 8, however with > >>> jep 254 compact strings it was updated (similarly to String) to use a > >>> byte array + byte representing the encoding. > >>> We encode the StringBuilder value into a ByteBufferDestination > >>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps: > >>> 1. copy the StringBuilder value into a CharBuffer (this would have been a > >>> direct copy from char[] to char[] in java8, but requires conversion from > >>> byte[]+coder now) > >>> 2. encode data from the CharBuffer to the StringBuilderEncoder ByteBuffer > >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the > >>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as > >>> far as I can tell) > >>> 4. Write the ByteBufferDestination to the > >>> FileOutputStream/RandomAccessFile by extracting the heap byte-array. > >>> > >>> We're copying bytes around a lot, and I think older JVMs would have > >>> optimized out a bit of the copying because the types exactly matched. > >>> It's unclear why we need to copy through two separate ByteBuffers from > >>> StringBuilderEncoder into ByteBufferDestination > >>> I've generally had better I/O performance using direct byte-buffers than > >>> heap buffers, although that experience largely comes from work on > >>> webservers, I haven't tested if it holds true for file I/O. If we operate > >>> directly upon a FileChannel, we can write direct buffers directly to > >>> files. The DirectByteBuffer memory is guaranteed to be contiguous, so it > >>> should have better performance around copying and manipulation as well. > >>> > >>> -ck > >>> > >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: > >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null to > >>>> avoid > >>>> hardware's influence in the test results? > >>>> > >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <[email protected]> > >>>> wrote: > >>>> > >>>>> Did you add the no-op appender to Ceki’s project? Or are you using our > >>>>> NullAppender? I have > >>>>> my doubts about using that NullAppender as it does nothing - not even > >>>>> render the Layout, so > >>>>> it may get completely optimized away. > >>>>> > >>>>> I’d still like to know what kind of disks Remko did his original > >>>>> testing. > >>>>> The page says he got > >>>>> 18.495 million messages per second. Each message would be about 130 > >>>>> characters long > >>>>> from I can tell. That means the destination has to be able to support > >>>>> about at least 2.4 GB > >>>>> per second. > >>>>> > >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in > >>>>> Remko’s). > >>>>> That means the > >>>>> message size is only going to be about 60 characters long. For me the > >>>>> async test is writing > >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still seeing the > >>>>> queue backed up. > >>>>> So how we are writing must be terribly inefficient. Logback is getting > >>>>> about 2000 ops/ms, > >>>>> which is still only 120 MBs. > >>>>> > >>>>> Your test below would be generating about 197 MBs. > >>>>> > >>>>> One thing I did notice that made a big difference is that Ceki has > >>>>> Logback > >>>>> configured to > >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. Setting > >>>>> Log4j2 to 256K > >>>>> considerably improved the performance. > >>>>> > >>>>> > >>>>> Ralph > >>>>> > >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <[email protected]> wrote: > >>>>>> > >>>>>>> If we would move the appender performance aside, am I right to > >>>>>>> conclude > >>>>>>> that the entire complex async. framework built atop Disruptor with all > >>>>> its > >>>>>>> whistles and bells is yielding a diminishing performance gain compared > >>>>> to a > >>>>>>> simple off the shelf blocking queue? > >>>>>> > >>>>>> I haven't seen any data that would give me such an impression -- the > >>>>> file appender > >>>>>> itself is slower than the thread producing events, which appears to > >>>>> limit our throughput. > >>>>>> Even then, log4j2 does much better in my testing on a linux workstation > >>>>> with 20+ producer > >>>>>> threads compared to logback. > >>>>>> Any time the queue is constantly full, performance will suffer (and > >>>>>> this > >>>>> applies to standard > >>>>>> blocking queues as well as disruptor, however disruptor will fare worse > >>>>> when the queue > >>>>>> is full). > >>>>>> The results using a single thread are roughly reproducible on the > >>>>> non-async test, so I think > >>>>>> the problem is somewhere between the layout, and appender/manager. > >>>>>> > >>>>>> Results running locally with no-op appender implementations: > >>>>>> > >>>>>> 1 thread: > >>>>>> > >>>>>> Benchmark Mode Cnt Score > >>>>> Error Units > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 3285.743 ± > >>>>> 427.963 ops/ms > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 2383.532 ± > >>>>> 136.034 ops/ms > >>>>>> > >>>>>> 20 threads: > >>>>>> > >>>>>> Benchmark Mode Cnt Score > >>>>> Error Units > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 1177.746 ± > >>>>> 919.560 ops/ms > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 602.614 ± > >>>>> 47.485 ops/ms > >>>>> > >>>>> > >>>>> > >>>> > >> > >> > > > > > -ck
