I've pushed some changes I've been using to validate performance here: 
https://github.com/carterkozak/logback-perf/pull/new/ckozak/sandbox

Using the linux perf-norm profiler, the direct (garbage free) encoders on my 
machine use about 5000 instructions per operation while the byte-array method 
uses about 3500 instructions per operation.

The allocations appear to be small and short-lived enough to fit nicely into 
TLAB, if I disable TLAB with '-XX:-UseTLAB', the gc-free implementation isn't 
quite as heavily impacted as the byte-array encoder.

Interestingly, performance suffers dramatically when I introduce unicode 
characters to the log line. It looks like the garbage-free implementation fares 
better than the byte-array implementation by about 5%. I need to dig in a bit 
more here and re-validate the benchmarking results.

On Sat, Aug 28, 2021, at 17:58, Remko Popma wrote:
> On Sat, Aug 28, 2021 at 3:16 PM Ron Grabowski
> <[email protected]> wrote:
> 
> >  Follow-up to "Formatting the date is expensive. The process of actually
> > formatting a value is reasonable". Is this still an issue from LOG4J2-930:
> > %m %ex%n: 1,755,573 msg/sec%d %m %ex%n: 1,194,184 msg/sec
> >
> 
> No, I believe that formatting the date is no longer the bottleneck.
> The analysis done in LOG4J2-930 led to
> https://issues.apache.org/jira/browse/LOG4J2-1097 which resulted in the
> FixedDateFormat.
> This format gives a good trade-off between speed and flexibility.
> The drawback is that it only works for some fixed formats, but those are
> the most widely used formats.
> I don't think that focusing on the date formatting and pregenerate
> formatted timestamps will be fruitful (but I could be wrong).
> 
> Avoiding the PreciseTime Instant and using System.currentTimeMillis when it
> is known that none of the downstream formatters require sub-millisecond
> precision may be a good optimization.
> 
> The message formatting (PatternLayout) as a whole is expensive somehow,
> there may be more to optimize there.
> 
> 
> > If so, isn't date rendering essentially a sequence we can generate ahead
> > of time similar to how a local ID generator asks for an allocation of keys
> > then uses that to quickly assign IDs to new objects? When its time to
> > render %d we can just grab it via an index:
> >
> > 1)At startup calculate the next 32k formatted dates. If
> > Clock.currentTimeMillis() were configured down to the second, 32000 seconds
> > would pre-allocate %d for the next 8 hours.
> > 2)Apply math to Clock.currentTimeMillis() to get an index into the buffer.
> > Seconds precision:
> > [10] = "2021-08-28 09:44:31,000"
> > [11] = "2021-08-28 09:44:32,000"[12] = "2021-08-28 09:44:33,000"[13] =
> > "2021-08-28 09:44:34,000"[14] = "2021-08-28 09:44:35,000"[15] = "2021-08-28
> > 09:44:36,000"...[31999] = "..."
> > 50ms precision:
> > [10] = "2021-08-28 09:44:31,050"[11] = "2021-08-28 09:44:31,075"[12] =
> > "2021-08-28 09:44:31,100"[13] = "2021-08-28 09:44:31,150"[14] = "2021-08-28
> > 09:44:31,175"[15] = "2021-08-28 09:44:31,200"...[31999] = "..."
> >
> > 3)Rendering %d{SEQ(DEFAULT,32000)} is just a index lookup into the
> > sequence of 32000 pre-calculated %d{DEFAULT} values without the cost of
> > formatting. I made up the "SEQ" notation, there's likely a better way to
> > express the feature. Everything can read from the buffer without locking.
> > 4)Have a background thread casually keep the sequence filled in a ring so
> > dates in the past are replaced with future dates so the structure consumes
> > a consistent amount of memory.
> >     On Friday, August 27, 2021, 10:07:59 PM EDT, Carter Kozak <
> > [email protected]> wrote:
> >
> >  Thanks, Remko. The default '%d' uses FixedDateFormat with
> > FixedFormat.DEFAULT. The FastDateFormat alternative does not support
> > microseconds, so it doesn't suffer from the same problem. I think I can
> > substantially reduce the frequency we re-format dates by checking
> > FixedFormat.secondFractionDigits to determine if we meed to compare
> > microseconds.
> >
> > On Fri, Aug 27, 2021, at 16:10, Remko Popma wrote:
> > > I remember looking at PatternLayout performance, I reported my findings
> > here, hopefully they’re still useful:
> > https://issues.apache.org/jira/browse/LOG4J2-930
> > >
> > > If %d is used in the pattern, does the FixedDateFormat get used?
> > >
> > >
> > >
> > >
> > > > On Aug 28, 2021, at 4:33, Ralph Goers <[email protected]>
> > wrote:
> > > >
> > > > All of that agrees with my observations as well.
> > > >
> > > > Ralph
> > > >
> > > >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <[email protected]> wrote:
> > > >>
> > > >> I've identified a few things that seem impactful, but bear in mind
> > that I haven't begun to drill down into them yet. I plan to file individual
> > tickets and investigate in greater depth later on:
> > > >>
> > > >> 1. Formatting the date is expensive. The process of actually
> > formatting a value is reasonable, however using a precise clock appears to
> > cause cache misses even when the pattern results in the same value
> > (microseconds/nanoseconds aren't included in the result). Using the
> > SystemMillisClock improves our throughput. I imagine some part of that
> > improvement is the result of currentTimeMillis(): long rather than
> > Clock.instant().
> > > >>
> > > >> 2. Setting 'log4j2.enable.direct.encoders' to false improves
> > performance, otherwise we see a lot of time spent in the
> > StringBuilderEncoder, but I haven't had time to investigate that yet.
> > Without direct encoders, we are allocating a great deal more byte arrays,
> > but they're used across a small space that C2 may be able to inline out.
> > > >>
> > > >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern
> > '%m{nolookup}' saves us some time scanning through produced messages --
> > I've had this disabled at work for a while.
> > > >>
> > > >> 4. [in progress] If I implement the full layout directly in java
> > (write directly to a stringbuilder based on the event in a custom Layout
> > instead of using PatternLayout) combined with [1] and [2] above,
> > performance is much better than logback, however using PatternLayout with
> > the steps above we fall just a bit short. I still need to read through the
> > patternlayout and individual pattern converters to understand why, but
> > there's definitely some headroom we'll be able to reclaim hiding somewhere.
> > > >>
> > > >> -ck
> > > >>
> > > >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote:
> > > >>> My pleasure, I enjoy digging into this sort of performance
> > comparison, the
> > > >>> toughest piece right now is balancing investigation with active
> > projects at
> > > >>> work. I completely agree about the importance of getting this
> > resolved
> > > >>> quickly and I intend to continue investigating.
> > > >>>
> > > >>> Re loom:
> > > >>> Currently we have a few places which reuse shared objects,
> > potentially large
> > > >>> ones (thinking StringBuilder instances on
> > ReusableLogEvent/StringLayout/etc)
> > > >>> which are currently bounded to the number of application threads you
> > have
> > > >>> (which is relatively small compared to the loom claims which are
> > admittedly
> > > >>> large). More importantly, today we tend to reuse almost every thread
> > in a
> > > >>> ThreadPoolExecutor of some kind where the model in loom involves
> > creating
> > > >>> a new virtual thread for each task, meaning using a thread-local as
> > a cache
> > > >>> will have a much lower hit rate as reuse will only occur within a
> > single task,
> > > >>> and only if ThreadLocals are enabled. I hope that made sense sense.
> > > >>>
> > > >>> -ck
> > > >>>
> > > >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
> > > >>>> Carter, thanks so much for your great effort and attention on this
> > issue. *I
> > > >>>> personally find this the uttermost important thing the project
> > needs to
> > > >>>> address with urgency right now.* Ceki is warming up for a new
> > Logback (and
> > > >>>> SLF4J?) release. I am pretty sure he will do his part in PR,
> > particularly
> > > >>>> in the context of benchmarking his new work with competitors. Once
> > these
> > > >>>> results are out – independent of whether they make sense or not –
> > > >>>> "internet" will take this as the ultimate truth and it will require
> > a
> > > >>>> tremendous effort to change people's mind. In conclusion, please
> > prioritize
> > > >>>> this appropriately and keep us posted. Don't hesitate to let us/me
> > know if
> > > >>>> there is anything (drinks, coffee, lunch?) we can help with.
> > > >>>>
> > > >>>> Regarding your remarks about Loom, I don't think I follow you
> > there. Our
> > > >>>> thread-locals are stateless hence should be perfectly fine while
> > running in
> > > >>>> virtual threads too. (I am totally against blindly using TLs, but
> > that is
> > > >>>> another discussion. I would rather prefer a recycler concept
> > similar to the
> > > >>>> one we have in JsonTemplateLayout. I think there was already a
> > mailing list
> > > >>>> thread about this.)
> > > >>>>
> > > >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <[email protected]>
> > wrote:
> > > >>>>
> > > >>>>> Yes, I'm still looking into this.
> > > >>>>>
> > > >>>>> I agree that the zero-garbage code is difficult to follow, and Loom
> > > >>>>> virtual threads will have less reuse so our thread-locals will
> > create more
> > > >>>>> overhead than they're worth in many cases. Fingers crossed for
> > Valhalla to
> > > >>>>> land before Loom, but I haven't been following updates from either
> > project
> > > >>>>> very closely lately, and Loom seems farther along.
> > > >>>>>
> > > >>>>> I think the issue with multiple buffers these days is that we have
> > optane
> > > >>>>> and nvme devices which can be _nearly_ as fast as main memory, so
> > copying
> > > >>>>> contents can be problematic. In fully async mode (with some
> > caveats around
> > > >>>>> queue-full-policy configuration points and nested async appenders)
> > we don't
> > > >>>>> need locking because the appender interactions are single-threaded.
> > > >>>>>
> > > >>>>> For what it's worth, in my testing I didn't see much difference
> > between
> > > >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively
> > fast nvme
> > > >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect
> > default
> > > >>>>> buffer size in the RandomAccessFileAppender).
> > > >>>>>
> > > >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> > > >>>>>> While that is true it does mean that we are locking higher up in
> > the
> > > >>>>> call stack than we need to be.
> > > >>>>>>
> > > >>>>>> Ralph
> > > >>>>>>
> > > >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]>
> > wrote:
> > > >>>>>>>
> > > >>>>>>> I’m fairly certain the JIT will optimize out the locking
> > operations on
> > > >>>>> the nested synchronized calls after a while. I’m not sure how soon
> > into the
> > > >>>>> performance tests that would happen.
> > > >>>>>>>
> > > >>>>>>> Tim
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers <
> > [email protected]>
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>> So are you continuing to look at this?  Frankly, the work on
> > zero-gc
> > > >>>>> stuff made this
> > > >>>>>>>> very complicated. I am not sure I can even follow the logic
> > anymore.
> > > >>>>> I also noticed
> > > >>>>>>>> that many of the methods dealing with byte buffers are
> > synchronized.
> > > >>>>> I am not sure
> > > >>>>>>>> why as only the method that moves data really needs to be.
> > > >>>>>>>>
> > > >>>>>>>> Using multiple buffers would make sense if you are filling one
> > from
> > > >>>>> the various patterns
> > > >>>>>>>> while it is unlocked and then only blocking when writing to the
> > > >>>>> output stream buffer.
> > > >>>>>>>>
> > > >>>>>>>> For what its worth Ceki updated his performance page again
> > yesterday
> > > >>>>> to change
> > > >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2
> > > >>>>> configuration to use a
> > > >>>>>>>> 256K buffer to match Logback.  And the page still makes it look
> > like
> > > >>>>> he is testing
> > > >>>>>>>> the performance of asynchronous processing when it is really
> > testing
> > > >>>>> the performance
> > > >>>>>>>> of the file appenders with a blocking queue in front of them.
> > > >>>>>>>>
> > > >>>>>>>> Ralph
> > > >>>>>>>>
> > > >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]>
> > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> I also tried that with similar results, which leads me to
> > believe
> > > >>>>> we're spending
> > > >>>>>>>>> too much time copying between buffers.
> > > >>>>>>>>>
> > > >>>>>>>>> We've proven that log4j can get LogEvents to the appender very
> > > >>>>> quickly and
> > > >>>>>>>>> efficiently.
> > > >>>>>>>>>
> > > >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder.
> > > >>>>> AbstractStringBuilder used to store a char array in java 8,
> > however with
> > > >>>>> jep 254 compact strings it was updated (similarly to String) to
> > use a byte
> > > >>>>> array + byte representing the encoding.
> > > >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination
> > > >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> > > >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would
> > have
> > > >>>>> been a direct copy from char[] to char[] in java8, but requires
> > conversion
> > > >>>>> from byte[]+coder now)
> > > >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder
> > > >>>>> ByteBuffer
> > > >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
> > > >>>>> ByteBufferDestination ByteBuffer (which is also always a heap
> > buffer as far
> > > >>>>> as I can tell)
> > > >>>>>>>>> 4. Write the ByteBufferDestination to the
> > > >>>>> FileOutputStream/RandomAccessFile by extracting the heap
> > byte-array.
> > > >>>>>>>>>
> > > >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would
> > have
> > > >>>>> optimized out a bit of the copying because the types exactly
> > matched.
> > > >>>>>>>>> It's unclear why we need to copy through two separate
> > ByteBuffers
> > > >>>>> from StringBuilderEncoder into ByteBufferDestination
> > > >>>>>>>>> I've generally had better I/O performance using direct
> > byte-buffers
> > > >>>>> than heap buffers, although that experience largely comes from
> > work on
> > > >>>>> webservers, I haven't tested if it holds true for file I/O. If we
> > operate
> > > >>>>> directly upon a FileChannel, we can write direct buffers directly
> > to files.
> > > >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it
> > should
> > > >>>>> have better performance around copying and manipulation as well.
> > > >>>>>>>>>
> > > >>>>>>>>> -ck
> > > >>>>>>>>>
> > > >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> > > >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to
> > /dev/null
> > > >>>>> to avoid
> > > >>>>>>>>>> hardware's influence in the test results?
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
> > > >>>>> [email protected]>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you
> > using
> > > >>>>> our
> > > >>>>>>>>>>> NullAppender? I have
> > > >>>>>>>>>>> my doubts about using that NullAppender as it does nothing -
> > not
> > > >>>>> even
> > > >>>>>>>>>>> render the Layout, so
> > > >>>>>>>>>>> it may get completely optimized away.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I’d still like to know what kind of disks Remko did his
> > original
> > > >>>>> testing.
> > > >>>>>>>>>>> The page says he got
> > > >>>>>>>>>>> 18.495 million messages per second. Each message would be
> > about 130
> > > >>>>>>>>>>> characters long
> > > >>>>>>>>>>> from I can tell. That means the destination has to be able to
> > > >>>>> support
> > > >>>>>>>>>>> about at least 2.4 GB
> > > >>>>>>>>>>> per second.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs
> > 100 in
> > > >>>>> Remko’s).
> > > >>>>>>>>>>> That means the
> > > >>>>>>>>>>> message size is only going to be about 60 characters long.
> > For me
> > > >>>>> the
> > > >>>>>>>>>>> async test is writing
> > > >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
> > > >>>>> seeing the
> > > >>>>>>>>>>> queue backed up.
> > > >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback
> > is
> > > >>>>> getting
> > > >>>>>>>>>>> about 2000 ops/ms,
> > > >>>>>>>>>>> which is still only 120 MBs.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Your test below would be generating about 197 MBs.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> One thing I did notice that made a big difference is that
> > Ceki has
> > > >>>>> Logback
> > > >>>>>>>>>>> configured to
> > > >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of
> > 8K.
> > > >>>>> Setting
> > > >>>>>>>>>>> Log4j2 to 256K
> > > >>>>>>>>>>> considerably improved the performance.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Ralph
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <
> > [email protected]>
> > > >>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> If we would move the appender performance aside, am I
> > right to
> > > >>>>> conclude
> > > >>>>>>>>>>>>> that the entire complex async. framework built atop
> > Disruptor
> > > >>>>> with all
> > > >>>>>>>>>>> its
> > > >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance
> > gain
> > > >>>>> compared
> > > >>>>>>>>>>> to a
> > > >>>>>>>>>>>>> simple off the shelf blocking queue?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I haven't seen any data that would give me such an
> > impression --
> > > >>>>> the
> > > >>>>>>>>>>> file appender
> > > >>>>>>>>>>>> itself is slower than the thread producing events, which
> > appears
> > > >>>>> to
> > > >>>>>>>>>>> limit our throughput.
> > > >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux
> > > >>>>> workstation
> > > >>>>>>>>>>> with 20+ producer
> > > >>>>>>>>>>>> threads compared to logback.
> > > >>>>>>>>>>>> Any time the queue is constantly full, performance will
> > suffer
> > > >>>>> (and this
> > > >>>>>>>>>>> applies to standard
> > > >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor
> > will fare
> > > >>>>> worse
> > > >>>>>>>>>>> when the queue
> > > >>>>>>>>>>>> is full).
> > > >>>>>>>>>>>> The results using a single thread are roughly reproducible
> > on the
> > > >>>>>>>>>>> non-async test, so I think
> > > >>>>>>>>>>>> the problem is somewhere between the layout, and
> > appender/manager.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Results running locally with no-op appender implementations:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 1 thread:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Benchmark                                        Mode  Cnt
> > > >>>>> Score
> > > >>>>>>>>>>> Error  Units
> > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > > >>>>> 3285.743 ±
> > > >>>>>>>>>>> 427.963  ops/ms
> > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > > >>>>> 2383.532 ±
> > > >>>>>>>>>>> 136.034  ops/ms
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 20 threads:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Benchmark                                        Mode  Cnt
> > > >>>>> Score
> > > >>>>>>>>>>> Error  Units
> > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > > >>>>> 1177.746 ±
> > > >>>>>>>>>>> 919.560  ops/ms
> > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > > >>>>> 602.614 ±
> > > >>>>>>>>>>> 47.485  ops/ms
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>> -ck
> > > >>>>>
> > > >>>>
> > > >>>
> > > >
> > > >
> > >
> >
> > -ck
> >
> 

-ck

Reply via email to