Carter, thanks so much for your great effort and attention on this issue. *I
personally find this the uttermost important thing the project needs to
address with urgency right now.* Ceki is warming up for a new Logback (and
SLF4J?) release. I am pretty sure he will do his part in PR, particularly
in the context of benchmarking his new work with competitors. Once these
results are out – independent of whether they make sense or not –
"internet" will take this as the ultimate truth and it will require a
tremendous effort to change people's mind. In conclusion, please prioritize
this appropriately and keep us posted. Don't hesitate to let us/me know if
there is anything (drinks, coffee, lunch?) we can help with.

Regarding your remarks about Loom, I don't think I follow you there. Our
thread-locals are stateless hence should be perfectly fine while running in
virtual threads too. (I am totally against blindly using TLs, but that is
another discussion. I would rather prefer a recycler concept similar to the
one we have in JsonTemplateLayout. I think there was already a mailing list
thread about this.)

On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote:

> Yes, I'm still looking into this.
>
> I agree that the zero-garbage code is difficult to follow, and Loom
> virtual threads will have less reuse so our thread-locals will create more
> overhead than they're worth in many cases. Fingers crossed for Valhalla to
> land before Loom, but I haven't been following updates from either project
> very closely lately, and Loom seems farther along.
>
> I think the issue with multiple buffers these days is that we have optane
> and nvme devices which can be _nearly_ as fast as main memory, so copying
> contents can be problematic. In fully async mode (with some caveats around
> queue-full-policy configuration points and nested async appenders) we don't
> need locking because the appender interactions are single-threaded.
>
> For what it's worth, in my testing I didn't see much difference between
> the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme
> drive, however I did discover and resolve LOG4J2-3150 (incorrect default
> buffer size in the RandomAccessFileAppender).
>
> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> > While that is true it does mean that we are locking higher up in the
> call stack than we need to be.
> >
> > Ralph
> >
> > > On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote:
> > >
> > > I’m fairly certain the JIT will optimize out the locking operations on
> the nested synchronized calls after a while. I’m not sure how soon into the
> performance tests that would happen.
> > >
> > > Tim
> > >
> > >
> > >> On Aug 26, 2021, at 10:55 AM, Ralph Goers <ralph.go...@dslextreme.com>
> wrote:
> > >>
> > >> So are you continuing to look at this?  Frankly, the work on zero-gc
> stuff made this
> > >> very complicated. I am not sure I can even follow the logic anymore.
> I also noticed
> > >> that many of the methods dealing with byte buffers are synchronized.
> I am not sure
> > >> why as only the method that moves data really needs to be.
> > >>
> > >> Using multiple buffers would make sense if you are filling one from
> the various patterns
> > >> while it is unlocked and then only blocking when writing to the
> output stream buffer.
> > >>
> > >> For what its worth Ceki updated his performance page again yesterday
> to change
> > >> the wording but I am pretty sure he didn’t change the log4j2
> configuration to use a
> > >> 256K buffer to match Logback.  And the page still makes it look like
> he is testing
> > >> the performance of asynchronous processing when it is really testing
> the performance
> > >> of the file appenders with a blocking queue in front of them.
> > >>
> > >> Ralph
> > >>
> > >>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote:
> > >>>
> > >>> I also tried that with similar results, which leads me to believe
> we're spending
> > >>> too much time copying between buffers.
> > >>>
> > >>> We've proven that log4j can get LogEvents to the appender very
> quickly and
> > >>> efficiently.
> > >>>
> > >>> Once we hit PatternLayout we write data to a StringBuilder.
> AbstractStringBuilder used to store a char array in java 8, however with
> jep 254 compact strings it was updated (similarly to String) to use a byte
> array + byte representing the encoding.
> > >>> We encode the StringBuilder value into a ByteBufferDestination
> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> > >>> 1. copy the StringBuilder value into a CharBuffer (this would have
> been a direct copy from char[] to char[] in java8, but requires conversion
> from byte[]+coder now)
> > >>> 2. encode data from the CharBuffer to the StringBuilderEncoder
> ByteBuffer
> > >>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
> ByteBufferDestination ByteBuffer (which is also always a heap buffer as far
> as I can tell)
> > >>> 4. Write the ByteBufferDestination to the
> FileOutputStream/RandomAccessFile by extracting the heap byte-array.
> > >>>
> > >>> We're copying bytes around a lot, and I think older JVMs would have
> optimized out a bit of the copying because the types exactly matched.
> > >>> It's unclear why we need to copy through two separate ByteBuffers
> from StringBuilderEncoder into ByteBufferDestination
> > >>> I've generally had better I/O performance using direct byte-buffers
> than heap buffers, although that experience largely comes from work on
> webservers, I haven't tested if it holds true for file I/O. If we operate
> directly upon a FileChannel, we can write direct buffers directly to files.
> The DirectByteBuffer memory is guaranteed to be contiguous, so it should
> have better performance around copying and manipulation as well.
> > >>>
> > >>> -ck
> > >>>
> > >>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> > >>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null
> to avoid
> > >>>> hardware's influence in the test results?
> > >>>>
> > >>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
> ralph.go...@dslextreme.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Did you add the no-op appender to Ceki’s project? Or are you using
> our
> > >>>>> NullAppender? I have
> > >>>>> my doubts about using that NullAppender as it does nothing - not
> even
> > >>>>> render the Layout, so
> > >>>>> it may get completely optimized away.
> > >>>>>
> > >>>>> I’d still like to know what kind of disks Remko did his original
> testing.
> > >>>>> The page says he got
> > >>>>> 18.495 million messages per second. Each message would be about 130
> > >>>>> characters long
> > >>>>> from I can tell. That means the destination has to be able to
> support
> > >>>>> about at least 2.4 GB
> > >>>>> per second.
> > >>>>>
> > >>>>> In Ceki’s test the message is only 23 characters long (vs 100 in
> Remko’s).
> > >>>>> That means the
> > >>>>> message size is only going to be about 60 characters long. For me
> the
> > >>>>> async test is writing
> > >>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
> seeing the
> > >>>>> queue backed up.
> > >>>>> So how we are writing must be terribly inefficient. Logback is
> getting
> > >>>>> about 2000 ops/ms,
> > >>>>> which is still only 120 MBs.
> > >>>>>
> > >>>>> Your test below would be generating about 197 MBs.
> > >>>>>
> > >>>>> One thing I did notice that made a big difference is that Ceki has
> Logback
> > >>>>> configured to
> > >>>>> use a buffer size of 256K while Log4j2 uses the default of 8K.
> Setting
> > >>>>> Log4j2 to 256K
> > >>>>> considerably improved the performance.
> > >>>>>
> > >>>>>
> > >>>>> Ralph
> > >>>>>
> > >>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net>
> wrote:
> > >>>>>>
> > >>>>>>> If we would move the appender performance aside, am I right to
> conclude
> > >>>>>>> that the entire complex async. framework built atop Disruptor
> with all
> > >>>>> its
> > >>>>>>> whistles and bells is yielding a diminishing performance gain
> compared
> > >>>>> to a
> > >>>>>>> simple off the shelf blocking queue?
> > >>>>>>
> > >>>>>> I haven't seen any data that would give me such an impression --
> the
> > >>>>> file appender
> > >>>>>> itself is slower than the thread producing events, which appears
> to
> > >>>>> limit our throughput.
> > >>>>>> Even then, log4j2 does much better in my testing on a linux
> workstation
> > >>>>> with 20+ producer
> > >>>>>> threads compared to logback.
> > >>>>>> Any time the queue is constantly full, performance will suffer
> (and this
> > >>>>> applies to standard
> > >>>>>> blocking queues as well as disruptor, however disruptor will fare
> worse
> > >>>>> when the queue
> > >>>>>> is full).
> > >>>>>> The results using a single thread are roughly reproducible on the
> > >>>>> non-async test, so I think
> > >>>>>> the problem is somewhere between the layout, and appender/manager.
> > >>>>>>
> > >>>>>> Results running locally with no-op appender implementations:
> > >>>>>>
> > >>>>>> 1 thread:
> > >>>>>>
> > >>>>>> Benchmark                                        Mode  Cnt
>  Score
> > >>>>> Error   Units
> > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> 3285.743 ±
> > >>>>> 427.963  ops/ms
> > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> 2383.532 ±
> > >>>>> 136.034  ops/ms
> > >>>>>>
> > >>>>>> 20 threads:
> > >>>>>>
> > >>>>>> Benchmark                                        Mode  Cnt
>  Score
> > >>>>> Error   Units
> > >>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> 1177.746 ±
> > >>>>> 919.560  ops/ms
> > >>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
>  602.614 ±
> > >>>>> 47.485  ops/ms
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>
> > >>
> > >
> >
> >
> >
>
> -ck
>

Reply via email to