Yes, I'm still looking into this.

I agree that the zero-garbage code is difficult to follow, and Loom virtual 
threads will have less reuse so our thread-locals will create more overhead 
than they're worth in many cases. Fingers crossed for Valhalla to land before 
Loom, but I haven't been following updates from either project very closely 
lately, and Loom seems farther along.

I think the issue with multiple buffers these days is that we have optane and 
nvme devices which can be _nearly_ as fast as main memory, so copying contents 
can be problematic. In fully async mode (with some caveats around 
queue-full-policy configuration points and nested async appenders) we don't 
need locking because the appender interactions are single-threaded.

For what it's worth, in my testing I didn't see much difference between the 8 
kB buffer and 256 kB buffer on a system with a relatively fast nvme drive, 
however I did discover and resolve LOG4J2-3150 (incorrect default buffer size 
in the RandomAccessFileAppender).

On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> While that is true it does mean that we are locking higher up in the call 
> stack than we need to be.
> 
> Ralph
> 
> > On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]> wrote:
> > 
> > I’m fairly certain the JIT will optimize out the locking operations on the 
> > nested synchronized calls after a while. I’m not sure how soon into the 
> > performance tests that would happen. 
> > 
> > Tim
> > 
> > 
> >> On Aug 26, 2021, at 10:55 AM, Ralph Goers <[email protected]> 
> >> wrote:
> >> 
> >> So are you continuing to look at this?  Frankly, the work on zero-gc 
> >> stuff made this 
> >> very complicated. I am not sure I can even follow the logic anymore. I 
> >> also noticed 
> >> that many of the methods dealing with byte buffers are synchronized. I am 
> >> not sure 
> >> why as only the method that moves data really needs to be.
> >> 
> >> Using multiple buffers would make sense if you are filling one from the 
> >> various patterns
> >> while it is unlocked and then only blocking when writing to the output 
> >> stream buffer.
> >> 
> >> For what its worth Ceki updated his performance page again yesterday to 
> >> change 
> >> the wording but I am pretty sure he didn’t change the log4j2 configuration 
> >> to use a 
> >> 256K buffer to match Logback.  And the page still makes it look like he is 
> >> testing 
> >> the performance of asynchronous processing when it is really testing the 
> >> performance 
> >> of the file appenders with a blocking queue in front of them.
> >> 
> >> Ralph
> >> 
> >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]> wrote:
> >>> 
> >>> I also tried that with similar results, which leads me to believe we're 
> >>> spending
> >>> too much time copying between buffers.
> >>> 
> >>> We've proven that log4j can get LogEvents to the appender very quickly and
> >>> efficiently.
> >>> 
> >>> Once we hit PatternLayout we write data to a StringBuilder. 
> >>> AbstractStringBuilder used to store a char array in java 8, however with 
> >>> jep 254 compact strings it was updated (similarly to String) to use a 
> >>> byte array + byte representing the encoding.
> >>> We encode the StringBuilder value into a ByteBufferDestination 
> >>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> >>> 1. copy the StringBuilder value into a CharBuffer (this would have been a 
> >>> direct copy from char[] to char[] in java8, but requires conversion from 
> >>> byte[]+coder now)
> >>> 2. encode data from the CharBuffer to the StringBuilderEncoder ByteBuffer
> >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the 
> >>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as 
> >>> far as I can tell)
> >>> 4. Write the ByteBufferDestination to the 
> >>> FileOutputStream/RandomAccessFile by extracting the heap byte-array.
> >>> 
> >>> We're copying bytes around a lot, and I think older JVMs would have 
> >>> optimized out a bit of the copying because the types exactly matched.
> >>> It's unclear why we need to copy through two separate ByteBuffers from 
> >>> StringBuilderEncoder into ByteBufferDestination
> >>> I've generally had better I/O performance using direct byte-buffers than 
> >>> heap buffers, although that experience largely comes from work on 
> >>> webservers, I haven't tested if it holds true for file I/O. If we operate 
> >>> directly upon a FileChannel, we can write direct buffers directly to 
> >>> files. The DirectByteBuffer memory is guaranteed to be contiguous, so it 
> >>> should have better performance around copying and manipulation as well.
> >>> 
> >>> -ck
> >>> 
> >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null to 
> >>>> avoid
> >>>> hardware's influence in the test results?
> >>>> 
> >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <[email protected]>
> >>>> wrote:
> >>>> 
> >>>>> Did you add the no-op appender to Ceki’s project? Or are you using our
> >>>>> NullAppender? I have
> >>>>> my doubts about using that NullAppender as it does nothing - not even
> >>>>> render the Layout, so
> >>>>> it may get completely optimized away.
> >>>>> 
> >>>>> I’d still like to know what kind of disks Remko did his original 
> >>>>> testing.
> >>>>> The page says he got
> >>>>> 18.495 million messages per second. Each message would be about 130
> >>>>> characters long
> >>>>> from I can tell. That means the destination has to be able to support
> >>>>> about at least 2.4 GB
> >>>>> per second.
> >>>>> 
> >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in 
> >>>>> Remko’s).
> >>>>> That means the
> >>>>> message size is only going to be about 60 characters long. For me the
> >>>>> async test is writing
> >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still seeing the
> >>>>> queue backed up.
> >>>>> So how we are writing must be terribly inefficient. Logback is getting
> >>>>> about 2000 ops/ms,
> >>>>> which is still only 120 MBs.
> >>>>> 
> >>>>> Your test below would be generating about 197 MBs.
> >>>>> 
> >>>>> One thing I did notice that made a big difference is that Ceki has 
> >>>>> Logback
> >>>>> configured to
> >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. Setting
> >>>>> Log4j2 to 256K
> >>>>> considerably improved the performance.
> >>>>> 
> >>>>> 
> >>>>> Ralph
> >>>>> 
> >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <[email protected]> wrote:
> >>>>>> 
> >>>>>>> If we would move the appender performance aside, am I right to 
> >>>>>>> conclude
> >>>>>>> that the entire complex async. framework built atop Disruptor with all
> >>>>> its
> >>>>>>> whistles and bells is yielding a diminishing performance gain compared
> >>>>> to a
> >>>>>>> simple off the shelf blocking queue?
> >>>>>> 
> >>>>>> I haven't seen any data that would give me such an impression -- the
> >>>>> file appender
> >>>>>> itself is slower than the thread producing events, which appears to
> >>>>> limit our throughput.
> >>>>>> Even then, log4j2 does much better in my testing on a linux workstation
> >>>>> with 20+ producer
> >>>>>> threads compared to logback.
> >>>>>> Any time the queue is constantly full, performance will suffer (and 
> >>>>>> this
> >>>>> applies to standard
> >>>>>> blocking queues as well as disruptor, however disruptor will fare worse
> >>>>> when the queue
> >>>>>> is full).
> >>>>>> The results using a single thread are roughly reproducible on the
> >>>>> non-async test, so I think
> >>>>>> the problem is somewhere between the layout, and appender/manager.
> >>>>>> 
> >>>>>> Results running locally with no-op appender implementations:
> >>>>>> 
> >>>>>> 1 thread:
> >>>>>> 
> >>>>>> Benchmark                                        Mode  Cnt     Score
> >>>>> Error   Units
> >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4  3285.743 ±
> >>>>> 427.963  ops/ms
> >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4  2383.532 ±
> >>>>> 136.034  ops/ms
> >>>>>> 
> >>>>>> 20 threads:
> >>>>>> 
> >>>>>> Benchmark                                        Mode  Cnt     Score
> >>>>> Error   Units
> >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4  1177.746 ±
> >>>>> 919.560  ops/ms
> >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4   602.614 ±
> >>>>> 47.485  ops/ms
> >>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >> 
> >> 
> > 
> 
> 
> 

-ck

Reply via email to