I've pushed some changes I've been using to validate performance here: https://github.com/carterkozak/logback-perf/pull/new/ckozak/sandbox
Using the linux perf-norm profiler, the direct (garbage free) encoders on my machine use about 5000 instructions per operation while the byte-array method uses about 3500 instructions per operation. The allocations appear to be small and short-lived enough to fit nicely into TLAB, if I disable TLAB with '-XX:-UseTLAB', the gc-free implementation isn't quite as heavily impacted as the byte-array encoder. Interestingly, performance suffers dramatically when I introduce unicode characters to the log line. It looks like the garbage-free implementation fares better than the byte-array implementation by about 5%. I need to dig in a bit more here and re-validate the benchmarking results. On Sat, Aug 28, 2021, at 17:58, Remko Popma wrote: > On Sat, Aug 28, 2021 at 3:16 PM Ron Grabowski > <[email protected]> wrote: > > > Follow-up to "Formatting the date is expensive. The process of actually > > formatting a value is reasonable". Is this still an issue from LOG4J2-930: > > %m %ex%n: 1,755,573 msg/sec%d %m %ex%n: 1,194,184 msg/sec > > > > No, I believe that formatting the date is no longer the bottleneck. > The analysis done in LOG4J2-930 led to > https://issues.apache.org/jira/browse/LOG4J2-1097 which resulted in the > FixedDateFormat. > This format gives a good trade-off between speed and flexibility. > The drawback is that it only works for some fixed formats, but those are > the most widely used formats. > I don't think that focusing on the date formatting and pregenerate > formatted timestamps will be fruitful (but I could be wrong). > > Avoiding the PreciseTime Instant and using System.currentTimeMillis when it > is known that none of the downstream formatters require sub-millisecond > precision may be a good optimization. > > The message formatting (PatternLayout) as a whole is expensive somehow, > there may be more to optimize there. > > > > If so, isn't date rendering essentially a sequence we can generate ahead > > of time similar to how a local ID generator asks for an allocation of keys > > then uses that to quickly assign IDs to new objects? When its time to > > render %d we can just grab it via an index: > > > > 1)At startup calculate the next 32k formatted dates. If > > Clock.currentTimeMillis() were configured down to the second, 32000 seconds > > would pre-allocate %d for the next 8 hours. > > 2)Apply math to Clock.currentTimeMillis() to get an index into the buffer. > > Seconds precision: > > [10] = "2021-08-28 09:44:31,000" > > [11] = "2021-08-28 09:44:32,000"[12] = "2021-08-28 09:44:33,000"[13] = > > "2021-08-28 09:44:34,000"[14] = "2021-08-28 09:44:35,000"[15] = "2021-08-28 > > 09:44:36,000"...[31999] = "..." > > 50ms precision: > > [10] = "2021-08-28 09:44:31,050"[11] = "2021-08-28 09:44:31,075"[12] = > > "2021-08-28 09:44:31,100"[13] = "2021-08-28 09:44:31,150"[14] = "2021-08-28 > > 09:44:31,175"[15] = "2021-08-28 09:44:31,200"...[31999] = "..." > > > > 3)Rendering %d{SEQ(DEFAULT,32000)} is just a index lookup into the > > sequence of 32000 pre-calculated %d{DEFAULT} values without the cost of > > formatting. I made up the "SEQ" notation, there's likely a better way to > > express the feature. Everything can read from the buffer without locking. > > 4)Have a background thread casually keep the sequence filled in a ring so > > dates in the past are replaced with future dates so the structure consumes > > a consistent amount of memory. > > On Friday, August 27, 2021, 10:07:59 PM EDT, Carter Kozak < > > [email protected]> wrote: > > > > Thanks, Remko. The default '%d' uses FixedDateFormat with > > FixedFormat.DEFAULT. The FastDateFormat alternative does not support > > microseconds, so it doesn't suffer from the same problem. I think I can > > substantially reduce the frequency we re-format dates by checking > > FixedFormat.secondFractionDigits to determine if we meed to compare > > microseconds. > > > > On Fri, Aug 27, 2021, at 16:10, Remko Popma wrote: > > > I remember looking at PatternLayout performance, I reported my findings > > here, hopefully they’re still useful: > > https://issues.apache.org/jira/browse/LOG4J2-930 > > > > > > If %d is used in the pattern, does the FixedDateFormat get used? > > > > > > > > > > > > > > > > On Aug 28, 2021, at 4:33, Ralph Goers <[email protected]> > > wrote: > > > > > > > > All of that agrees with my observations as well. > > > > > > > > Ralph > > > > > > > >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <[email protected]> wrote: > > > >> > > > >> I've identified a few things that seem impactful, but bear in mind > > that I haven't begun to drill down into them yet. I plan to file individual > > tickets and investigate in greater depth later on: > > > >> > > > >> 1. Formatting the date is expensive. The process of actually > > formatting a value is reasonable, however using a precise clock appears to > > cause cache misses even when the pattern results in the same value > > (microseconds/nanoseconds aren't included in the result). Using the > > SystemMillisClock improves our throughput. I imagine some part of that > > improvement is the result of currentTimeMillis(): long rather than > > Clock.instant(). > > > >> > > > >> 2. Setting 'log4j2.enable.direct.encoders' to false improves > > performance, otherwise we see a lot of time spent in the > > StringBuilderEncoder, but I haven't had time to investigate that yet. > > Without direct encoders, we are allocating a great deal more byte arrays, > > but they're used across a small space that C2 may be able to inline out. > > > >> > > > >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern > > '%m{nolookup}' saves us some time scanning through produced messages -- > > I've had this disabled at work for a while. > > > >> > > > >> 4. [in progress] If I implement the full layout directly in java > > (write directly to a stringbuilder based on the event in a custom Layout > > instead of using PatternLayout) combined with [1] and [2] above, > > performance is much better than logback, however using PatternLayout with > > the steps above we fall just a bit short. I still need to read through the > > patternlayout and individual pattern converters to understand why, but > > there's definitely some headroom we'll be able to reclaim hiding somewhere. > > > >> > > > >> -ck > > > >> > > > >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote: > > > >>> My pleasure, I enjoy digging into this sort of performance > > comparison, the > > > >>> toughest piece right now is balancing investigation with active > > projects at > > > >>> work. I completely agree about the importance of getting this > > resolved > > > >>> quickly and I intend to continue investigating. > > > >>> > > > >>> Re loom: > > > >>> Currently we have a few places which reuse shared objects, > > potentially large > > > >>> ones (thinking StringBuilder instances on > > ReusableLogEvent/StringLayout/etc) > > > >>> which are currently bounded to the number of application threads you > > have > > > >>> (which is relatively small compared to the loom claims which are > > admittedly > > > >>> large). More importantly, today we tend to reuse almost every thread > > in a > > > >>> ThreadPoolExecutor of some kind where the model in loom involves > > creating > > > >>> a new virtual thread for each task, meaning using a thread-local as > > a cache > > > >>> will have a much lower hit rate as reuse will only occur within a > > single task, > > > >>> and only if ThreadLocals are enabled. I hope that made sense sense. > > > >>> > > > >>> -ck > > > >>> > > > >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote: > > > >>>> Carter, thanks so much for your great effort and attention on this > > issue. *I > > > >>>> personally find this the uttermost important thing the project > > needs to > > > >>>> address with urgency right now.* Ceki is warming up for a new > > Logback (and > > > >>>> SLF4J?) release. I am pretty sure he will do his part in PR, > > particularly > > > >>>> in the context of benchmarking his new work with competitors. Once > > these > > > >>>> results are out – independent of whether they make sense or not – > > > >>>> "internet" will take this as the ultimate truth and it will require > > a > > > >>>> tremendous effort to change people's mind. In conclusion, please > > prioritize > > > >>>> this appropriately and keep us posted. Don't hesitate to let us/me > > know if > > > >>>> there is anything (drinks, coffee, lunch?) we can help with. > > > >>>> > > > >>>> Regarding your remarks about Loom, I don't think I follow you > > there. Our > > > >>>> thread-locals are stateless hence should be perfectly fine while > > running in > > > >>>> virtual threads too. (I am totally against blindly using TLs, but > > that is > > > >>>> another discussion. I would rather prefer a recycler concept > > similar to the > > > >>>> one we have in JsonTemplateLayout. I think there was already a > > mailing list > > > >>>> thread about this.) > > > >>>> > > > >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <[email protected]> > > wrote: > > > >>>> > > > >>>>> Yes, I'm still looking into this. > > > >>>>> > > > >>>>> I agree that the zero-garbage code is difficult to follow, and Loom > > > >>>>> virtual threads will have less reuse so our thread-locals will > > create more > > > >>>>> overhead than they're worth in many cases. Fingers crossed for > > Valhalla to > > > >>>>> land before Loom, but I haven't been following updates from either > > project > > > >>>>> very closely lately, and Loom seems farther along. > > > >>>>> > > > >>>>> I think the issue with multiple buffers these days is that we have > > optane > > > >>>>> and nvme devices which can be _nearly_ as fast as main memory, so > > copying > > > >>>>> contents can be problematic. In fully async mode (with some > > caveats around > > > >>>>> queue-full-policy configuration points and nested async appenders) > > we don't > > > >>>>> need locking because the appender interactions are single-threaded. > > > >>>>> > > > >>>>> For what it's worth, in my testing I didn't see much difference > > between > > > >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively > > fast nvme > > > >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect > > default > > > >>>>> buffer size in the RandomAccessFileAppender). > > > >>>>> > > > >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: > > > >>>>>> While that is true it does mean that we are locking higher up in > > the > > > >>>>> call stack than we need to be. > > > >>>>>> > > > >>>>>> Ralph > > > >>>>>> > > > >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]> > > wrote: > > > >>>>>>> > > > >>>>>>> I’m fairly certain the JIT will optimize out the locking > > operations on > > > >>>>> the nested synchronized calls after a while. I’m not sure how soon > > into the > > > >>>>> performance tests that would happen. > > > >>>>>>> > > > >>>>>>> Tim > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers < > > [email protected]> > > > >>>>> wrote: > > > >>>>>>>> > > > >>>>>>>> So are you continuing to look at this? Frankly, the work on > > zero-gc > > > >>>>> stuff made this > > > >>>>>>>> very complicated. I am not sure I can even follow the logic > > anymore. > > > >>>>> I also noticed > > > >>>>>>>> that many of the methods dealing with byte buffers are > > synchronized. > > > >>>>> I am not sure > > > >>>>>>>> why as only the method that moves data really needs to be. > > > >>>>>>>> > > > >>>>>>>> Using multiple buffers would make sense if you are filling one > > from > > > >>>>> the various patterns > > > >>>>>>>> while it is unlocked and then only blocking when writing to the > > > >>>>> output stream buffer. > > > >>>>>>>> > > > >>>>>>>> For what its worth Ceki updated his performance page again > > yesterday > > > >>>>> to change > > > >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2 > > > >>>>> configuration to use a > > > >>>>>>>> 256K buffer to match Logback. And the page still makes it look > > like > > > >>>>> he is testing > > > >>>>>>>> the performance of asynchronous processing when it is really > > testing > > > >>>>> the performance > > > >>>>>>>> of the file appenders with a blocking queue in front of them. > > > >>>>>>>> > > > >>>>>>>> Ralph > > > >>>>>>>> > > > >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]> > > wrote: > > > >>>>>>>>> > > > >>>>>>>>> I also tried that with similar results, which leads me to > > believe > > > >>>>> we're spending > > > >>>>>>>>> too much time copying between buffers. > > > >>>>>>>>> > > > >>>>>>>>> We've proven that log4j can get LogEvents to the appender very > > > >>>>> quickly and > > > >>>>>>>>> efficiently. > > > >>>>>>>>> > > > >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder. > > > >>>>> AbstractStringBuilder used to store a char array in java 8, > > however with > > > >>>>> jep 254 compact strings it was updated (similarly to String) to > > use a byte > > > >>>>> array + byte representing the encoding. > > > >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination > > > >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps: > > > >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would > > have > > > >>>>> been a direct copy from char[] to char[] in java8, but requires > > conversion > > > >>>>> from byte[]+coder now) > > > >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder > > > >>>>> ByteBuffer > > > >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the > > > >>>>> ByteBufferDestination ByteBuffer (which is also always a heap > > buffer as far > > > >>>>> as I can tell) > > > >>>>>>>>> 4. Write the ByteBufferDestination to the > > > >>>>> FileOutputStream/RandomAccessFile by extracting the heap > > byte-array. > > > >>>>>>>>> > > > >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would > > have > > > >>>>> optimized out a bit of the copying because the types exactly > > matched. > > > >>>>>>>>> It's unclear why we need to copy through two separate > > ByteBuffers > > > >>>>> from StringBuilderEncoder into ByteBufferDestination > > > >>>>>>>>> I've generally had better I/O performance using direct > > byte-buffers > > > >>>>> than heap buffers, although that experience largely comes from > > work on > > > >>>>> webservers, I haven't tested if it holds true for file I/O. If we > > operate > > > >>>>> directly upon a FileChannel, we can write direct buffers directly > > to files. > > > >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it > > should > > > >>>>> have better performance around copying and manipulation as well. > > > >>>>>>>>> > > > >>>>>>>>> -ck > > > >>>>>>>>> > > > >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: > > > >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to > > /dev/null > > > >>>>> to avoid > > > >>>>>>>>>> hardware's influence in the test results? > > > >>>>>>>>>> > > > >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < > > > >>>>> [email protected]> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you > > using > > > >>>>> our > > > >>>>>>>>>>> NullAppender? I have > > > >>>>>>>>>>> my doubts about using that NullAppender as it does nothing - > > not > > > >>>>> even > > > >>>>>>>>>>> render the Layout, so > > > >>>>>>>>>>> it may get completely optimized away. > > > >>>>>>>>>>> > > > >>>>>>>>>>> I’d still like to know what kind of disks Remko did his > > original > > > >>>>> testing. > > > >>>>>>>>>>> The page says he got > > > >>>>>>>>>>> 18.495 million messages per second. Each message would be > > about 130 > > > >>>>>>>>>>> characters long > > > >>>>>>>>>>> from I can tell. That means the destination has to be able to > > > >>>>> support > > > >>>>>>>>>>> about at least 2.4 GB > > > >>>>>>>>>>> per second. > > > >>>>>>>>>>> > > > >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs > > 100 in > > > >>>>> Remko’s). > > > >>>>>>>>>>> That means the > > > >>>>>>>>>>> message size is only going to be about 60 characters long. > > For me > > > >>>>> the > > > >>>>>>>>>>> async test is writing > > > >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still > > > >>>>> seeing the > > > >>>>>>>>>>> queue backed up. > > > >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback > > is > > > >>>>> getting > > > >>>>>>>>>>> about 2000 ops/ms, > > > >>>>>>>>>>> which is still only 120 MBs. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Your test below would be generating about 197 MBs. > > > >>>>>>>>>>> > > > >>>>>>>>>>> One thing I did notice that made a big difference is that > > Ceki has > > > >>>>> Logback > > > >>>>>>>>>>> configured to > > > >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of > > 8K. > > > >>>>> Setting > > > >>>>>>>>>>> Log4j2 to 256K > > > >>>>>>>>>>> considerably improved the performance. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Ralph > > > >>>>>>>>>>> > > > >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak < > > [email protected]> > > > >>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> If we would move the appender performance aside, am I > > right to > > > >>>>> conclude > > > >>>>>>>>>>>>> that the entire complex async. framework built atop > > Disruptor > > > >>>>> with all > > > >>>>>>>>>>> its > > > >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance > > gain > > > >>>>> compared > > > >>>>>>>>>>> to a > > > >>>>>>>>>>>>> simple off the shelf blocking queue? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I haven't seen any data that would give me such an > > impression -- > > > >>>>> the > > > >>>>>>>>>>> file appender > > > >>>>>>>>>>>> itself is slower than the thread producing events, which > > appears > > > >>>>> to > > > >>>>>>>>>>> limit our throughput. > > > >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux > > > >>>>> workstation > > > >>>>>>>>>>> with 20+ producer > > > >>>>>>>>>>>> threads compared to logback. > > > >>>>>>>>>>>> Any time the queue is constantly full, performance will > > suffer > > > >>>>> (and this > > > >>>>>>>>>>> applies to standard > > > >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor > > will fare > > > >>>>> worse > > > >>>>>>>>>>> when the queue > > > >>>>>>>>>>>> is full). > > > >>>>>>>>>>>> The results using a single thread are roughly reproducible > > on the > > > >>>>>>>>>>> non-async test, so I think > > > >>>>>>>>>>>> the problem is somewhere between the layout, and > > appender/manager. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Results running locally with no-op appender implementations: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 1 thread: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Benchmark Mode Cnt > > > >>>>> Score > > > >>>>>>>>>>> Error Units > > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > > > >>>>> 3285.743 ± > > > >>>>>>>>>>> 427.963 ops/ms > > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > > > >>>>> 2383.532 ± > > > >>>>>>>>>>> 136.034 ops/ms > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 20 threads: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Benchmark Mode Cnt > > > >>>>> Score > > > >>>>>>>>>>> Error Units > > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > > > >>>>> 1177.746 ± > > > >>>>>>>>>>> 919.560 ops/ms > > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > > > >>>>> 602.614 ± > > > >>>>>>>>>>> 47.485 ops/ms > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>>> -ck > > > >>>>> > > > >>>> > > > >>> > > > > > > > > > > > > > > > -ck > > > -ck
