My pleasure, I enjoy digging into this sort of performance comparison, the
toughest piece right now is balancing investigation with active projects at
work. I completely agree about the importance of getting this resolved
quickly and I intend to continue investigating.

Re loom:
Currently we have a few places which reuse shared objects, potentially large
ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc)
which are currently bounded to the number of application threads you have
(which is relatively small compared to the loom claims which are admittedly
large). More importantly, today we tend to reuse almost every thread in a
ThreadPoolExecutor of some kind where the model in loom involves creating
a new virtual thread for each task, meaning using a thread-local as a cache
will have a much lower hit rate as reuse will only occur within a single task,
and only if ThreadLocals are enabled. I hope that made sense sense.

-ck

On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
> Carter, thanks so much for your great effort and attention on this issue. *I
> personally find this the uttermost important thing the project needs to
> address with urgency right now.* Ceki is warming up for a new Logback (and
> SLF4J?) release. I am pretty sure he will do his part in PR, particularly
> in the context of benchmarking his new work with competitors. Once these
> results are out – independent of whether they make sense or not –
> "internet" will take this as the ultimate truth and it will require a
> tremendous effort to change people's mind. In conclusion, please prioritize
> this appropriately and keep us posted. Don't hesitate to let us/me know if
> there is anything (drinks, coffee, lunch?) we can help with.
> 
> Regarding your remarks about Loom, I don't think I follow you there. Our
> thread-locals are stateless hence should be perfectly fine while running in
> virtual threads too. (I am totally against blindly using TLs, but that is
> another discussion. I would rather prefer a recycler concept similar to the
> one we have in JsonTemplateLayout. I think there was already a mailing list
> thread about this.)
> 
> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote:
> 
> > Yes, I'm still looking into this.
> >
> > I agree that the zero-garbage code is difficult to follow, and Loom
> > virtual threads will have less reuse so our thread-locals will create more
> > overhead than they're worth in many cases. Fingers crossed for Valhalla to
> > land before Loom, but I haven't been following updates from either project
> > very closely lately, and Loom seems farther along.
> >
> > I think the issue with multiple buffers these days is that we have optane
> > and nvme devices which can be _nearly_ as fast as main memory, so copying
> > contents can be problematic. In fully async mode (with some caveats around
> > queue-full-policy configuration points and nested async appenders) we don't
> > need locking because the appender interactions are single-threaded.
> >
> > For what it's worth, in my testing I didn't see much difference between
> > the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme
> > drive, however I did discover and resolve LOG4J2-3150 (incorrect default
> > buffer size in the RandomAccessFileAppender).
> >
> > On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> > > While that is true it does mean that we are locking higher up in the
> > call stack than we need to be.
> > >
> > > Ralph
> > >
> > > > On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote:
> > > >
> > > > I’m fairly certain the JIT will optimize out the locking operations on
> > the nested synchronized calls after a while. I’m not sure how soon into the
> > performance tests that would happen.
> > > >
> > > > Tim
> > > >
> > > >
> > > >> On Aug 26, 2021, at 10:55 AM, Ralph Goers <ralph.go...@dslextreme.com>
> > wrote:
> > > >>
> > > >> So are you continuing to look at this?  Frankly, the work on zero-gc
> > stuff made this
> > > >> very complicated. I am not sure I can even follow the logic anymore.
> > I also noticed
> > > >> that many of the methods dealing with byte buffers are synchronized.
> > I am not sure
> > > >> why as only the method that moves data really needs to be.
> > > >>
> > > >> Using multiple buffers would make sense if you are filling one from
> > the various patterns
> > > >> while it is unlocked and then only blocking when writing to the
> > output stream buffer.
> > > >>
> > > >> For what its worth Ceki updated his performance page again yesterday
> > to change
> > > >> the wording but I am pretty sure he didn’t change the log4j2
> > configuration to use a
> > > >> 256K buffer to match Logback.  And the page still makes it look like
> > he is testing
> > > >> the performance of asynchronous processing when it is really testing
> > the performance
> > > >> of the file appenders with a blocking queue in front of them.
> > > >>
> > > >> Ralph
> > > >>
> > > >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote:
> > > >>>
> > > >>> I also tried that with similar results, which leads me to believe
> > we're spending
> > > >>> too much time copying between buffers.
> > > >>>
> > > >>> We've proven that log4j can get LogEvents to the appender very
> > quickly and
> > > >>> efficiently.
> > > >>>
> > > >>> Once we hit PatternLayout we write data to a StringBuilder.
> > AbstractStringBuilder used to store a char array in java 8, however with
> > jep 254 compact strings it was updated (similarly to String) to use a byte
> > array + byte representing the encoding.
> > > >>> We encode the StringBuilder value into a ByteBufferDestination
> > (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> > > >>> 1. copy the StringBuilder value into a CharBuffer (this would have
> > been a direct copy from char[] to char[] in java8, but requires conversion
> > from byte[]+coder now)
> > > >>> 2. encode data from the CharBuffer to the StringBuilderEncoder
> > ByteBuffer
> > > >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
> > ByteBufferDestination ByteBuffer (which is also always a heap buffer as far
> > as I can tell)
> > > >>> 4. Write the ByteBufferDestination to the
> > FileOutputStream/RandomAccessFile by extracting the heap byte-array.
> > > >>>
> > > >>> We're copying bytes around a lot, and I think older JVMs would have
> > optimized out a bit of the copying because the types exactly matched.
> > > >>> It's unclear why we need to copy through two separate ByteBuffers
> > from StringBuilderEncoder into ByteBufferDestination
> > > >>> I've generally had better I/O performance using direct byte-buffers
> > than heap buffers, although that experience largely comes from work on
> > webservers, I haven't tested if it holds true for file I/O. If we operate
> > directly upon a FileChannel, we can write direct buffers directly to files.
> > The DirectByteBuffer memory is guaranteed to be contiguous, so it should
> > have better performance around copying and manipulation as well.
> > > >>>
> > > >>> -ck
> > > >>>
> > > >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> > > >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null
> > to avoid
> > > >>>> hardware's influence in the test results?
> > > >>>>
> > > >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
> > ralph.go...@dslextreme.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Did you add the no-op appender to Ceki’s project? Or are you using
> > our
> > > >>>>> NullAppender? I have
> > > >>>>> my doubts about using that NullAppender as it does nothing - not
> > even
> > > >>>>> render the Layout, so
> > > >>>>> it may get completely optimized away.
> > > >>>>>
> > > >>>>> I’d still like to know what kind of disks Remko did his original
> > testing.
> > > >>>>> The page says he got
> > > >>>>> 18.495 million messages per second. Each message would be about 130
> > > >>>>> characters long
> > > >>>>> from I can tell. That means the destination has to be able to
> > support
> > > >>>>> about at least 2.4 GB
> > > >>>>> per second.
> > > >>>>>
> > > >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in
> > Remko’s).
> > > >>>>> That means the
> > > >>>>> message size is only going to be about 60 characters long. For me
> > the
> > > >>>>> async test is writing
> > > >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
> > seeing the
> > > >>>>> queue backed up.
> > > >>>>> So how we are writing must be terribly inefficient. Logback is
> > getting
> > > >>>>> about 2000 ops/ms,
> > > >>>>> which is still only 120 MBs.
> > > >>>>>
> > > >>>>> Your test below would be generating about 197 MBs.
> > > >>>>>
> > > >>>>> One thing I did notice that made a big difference is that Ceki has
> > Logback
> > > >>>>> configured to
> > > >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K.
> > Setting
> > > >>>>> Log4j2 to 256K
> > > >>>>> considerably improved the performance.
> > > >>>>>
> > > >>>>>
> > > >>>>> Ralph
> > > >>>>>
> > > >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net>
> > wrote:
> > > >>>>>>
> > > >>>>>>> If we would move the appender performance aside, am I right to
> > conclude
> > > >>>>>>> that the entire complex async. framework built atop Disruptor
> > with all
> > > >>>>> its
> > > >>>>>>> whistles and bells is yielding a diminishing performance gain
> > compared
> > > >>>>> to a
> > > >>>>>>> simple off the shelf blocking queue?
> > > >>>>>>
> > > >>>>>> I haven't seen any data that would give me such an impression --
> > the
> > > >>>>> file appender
> > > >>>>>> itself is slower than the thread producing events, which appears
> > to
> > > >>>>> limit our throughput.
> > > >>>>>> Even then, log4j2 does much better in my testing on a linux
> > workstation
> > > >>>>> with 20+ producer
> > > >>>>>> threads compared to logback.
> > > >>>>>> Any time the queue is constantly full, performance will suffer
> > (and this
> > > >>>>> applies to standard
> > > >>>>>> blocking queues as well as disruptor, however disruptor will fare
> > worse
> > > >>>>> when the queue
> > > >>>>>> is full).
> > > >>>>>> The results using a single thread are roughly reproducible on the
> > > >>>>> non-async test, so I think
> > > >>>>>> the problem is somewhere between the layout, and appender/manager.
> > > >>>>>>
> > > >>>>>> Results running locally with no-op appender implementations:
> > > >>>>>>
> > > >>>>>> 1 thread:
> > > >>>>>>
> > > >>>>>> Benchmark                                        Mode  Cnt
> >  Score
> > > >>>>> Error   Units
> > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > 3285.743 ±
> > > >>>>> 427.963  ops/ms
> > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > 2383.532 ±
> > > >>>>> 136.034  ops/ms
> > > >>>>>>
> > > >>>>>> 20 threads:
> > > >>>>>>
> > > >>>>>> Benchmark                                        Mode  Cnt
> >  Score
> > > >>>>> Error   Units
> > > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > 1177.746 ±
> > > >>>>> 919.560  ops/ms
> > > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> >  602.614 ±
> > > >>>>> 47.485  ops/ms
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> >
> > -ck
> >
> 

Reply via email to