Thanks, Remko. The default '%d' uses FixedDateFormat with FixedFormat.DEFAULT. The FastDateFormat alternative does not support microseconds, so it doesn't suffer from the same problem. I think I can substantially reduce the frequency we re-format dates by checking FixedFormat.secondFractionDigits to determine if we meed to compare microseconds.
On Fri, Aug 27, 2021, at 16:10, Remko Popma wrote: > I remember looking at PatternLayout performance, I reported my findings here, > hopefully they’re still useful: > https://issues.apache.org/jira/browse/LOG4J2-930 > > If %d is used in the pattern, does the FixedDateFormat get used? > > > > > > On Aug 28, 2021, at 4:33, Ralph Goers <ralph.go...@dslextreme.com> wrote: > > > > All of that agrees with my observations as well. > > > > Ralph > > > >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <cko...@ckozak.net> wrote: > >> > >> I've identified a few things that seem impactful, but bear in mind that I > >> haven't begun to drill down into them yet. I plan to file individual > >> tickets and investigate in greater depth later on: > >> > >> 1. Formatting the date is expensive. The process of actually formatting a > >> value is reasonable, however using a precise clock appears to cause cache > >> misses even when the pattern results in the same value > >> (microseconds/nanoseconds aren't included in the result). Using the > >> SystemMillisClock improves our throughput. I imagine some part of that > >> improvement is the result of currentTimeMillis(): long rather than > >> Clock.instant(). > >> > >> 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, > >> otherwise we see a lot of time spent in the StringBuilderEncoder, but I > >> haven't had time to investigate that yet. Without direct encoders, we are > >> allocating a great deal more byte arrays, but they're used across a small > >> space that C2 may be able to inline out. > >> > >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern > >> '%m{nolookup}' saves us some time scanning through produced messages -- > >> I've had this disabled at work for a while. > >> > >> 4. [in progress] If I implement the full layout directly in java (write > >> directly to a stringbuilder based on the event in a custom Layout instead > >> of using PatternLayout) combined with [1] and [2] above, performance is > >> much better than logback, however using PatternLayout with the steps above > >> we fall just a bit short. I still need to read through the patternlayout > >> and individual pattern converters to understand why, but there's > >> definitely some headroom we'll be able to reclaim hiding somewhere. > >> > >> -ck > >> > >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote: > >>> My pleasure, I enjoy digging into this sort of performance comparison, the > >>> toughest piece right now is balancing investigation with active projects > >>> at > >>> work. I completely agree about the importance of getting this resolved > >>> quickly and I intend to continue investigating. > >>> > >>> Re loom: > >>> Currently we have a few places which reuse shared objects, potentially > >>> large > >>> ones (thinking StringBuilder instances on > >>> ReusableLogEvent/StringLayout/etc) > >>> which are currently bounded to the number of application threads you have > >>> (which is relatively small compared to the loom claims which are > >>> admittedly > >>> large). More importantly, today we tend to reuse almost every thread in a > >>> ThreadPoolExecutor of some kind where the model in loom involves creating > >>> a new virtual thread for each task, meaning using a thread-local as a > >>> cache > >>> will have a much lower hit rate as reuse will only occur within a single > >>> task, > >>> and only if ThreadLocals are enabled. I hope that made sense sense. > >>> > >>> -ck > >>> > >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote: > >>>> Carter, thanks so much for your great effort and attention on this > >>>> issue. *I > >>>> personally find this the uttermost important thing the project needs to > >>>> address with urgency right now.* Ceki is warming up for a new Logback > >>>> (and > >>>> SLF4J?) release. I am pretty sure he will do his part in PR, particularly > >>>> in the context of benchmarking his new work with competitors. Once these > >>>> results are out – independent of whether they make sense or not – > >>>> "internet" will take this as the ultimate truth and it will require a > >>>> tremendous effort to change people's mind. In conclusion, please > >>>> prioritize > >>>> this appropriately and keep us posted. Don't hesitate to let us/me know > >>>> if > >>>> there is anything (drinks, coffee, lunch?) we can help with. > >>>> > >>>> Regarding your remarks about Loom, I don't think I follow you there. Our > >>>> thread-locals are stateless hence should be perfectly fine while running > >>>> in > >>>> virtual threads too. (I am totally against blindly using TLs, but that is > >>>> another discussion. I would rather prefer a recycler concept similar to > >>>> the > >>>> one we have in JsonTemplateLayout. I think there was already a mailing > >>>> list > >>>> thread about this.) > >>>> > >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote: > >>>> > >>>>> Yes, I'm still looking into this. > >>>>> > >>>>> I agree that the zero-garbage code is difficult to follow, and Loom > >>>>> virtual threads will have less reuse so our thread-locals will create > >>>>> more > >>>>> overhead than they're worth in many cases. Fingers crossed for Valhalla > >>>>> to > >>>>> land before Loom, but I haven't been following updates from either > >>>>> project > >>>>> very closely lately, and Loom seems farther along. > >>>>> > >>>>> I think the issue with multiple buffers these days is that we have > >>>>> optane > >>>>> and nvme devices which can be _nearly_ as fast as main memory, so > >>>>> copying > >>>>> contents can be problematic. In fully async mode (with some caveats > >>>>> around > >>>>> queue-full-policy configuration points and nested async appenders) we > >>>>> don't > >>>>> need locking because the appender interactions are single-threaded. > >>>>> > >>>>> For what it's worth, in my testing I didn't see much difference between > >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively fast > >>>>> nvme > >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect default > >>>>> buffer size in the RandomAccessFileAppender). > >>>>> > >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: > >>>>>> While that is true it does mean that we are locking higher up in the > >>>>> call stack than we need to be. > >>>>>> > >>>>>> Ralph > >>>>>> > >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote: > >>>>>>> > >>>>>>> I’m fairly certain the JIT will optimize out the locking operations on > >>>>> the nested synchronized calls after a while. I’m not sure how soon into > >>>>> the > >>>>> performance tests that would happen. > >>>>>>> > >>>>>>> Tim > >>>>>>> > >>>>>>> > >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers > >>>>>>>> <ralph.go...@dslextreme.com> > >>>>> wrote: > >>>>>>>> > >>>>>>>> So are you continuing to look at this? Frankly, the work on zero-gc > >>>>> stuff made this > >>>>>>>> very complicated. I am not sure I can even follow the logic anymore. > >>>>> I also noticed > >>>>>>>> that many of the methods dealing with byte buffers are synchronized. > >>>>> I am not sure > >>>>>>>> why as only the method that moves data really needs to be. > >>>>>>>> > >>>>>>>> Using multiple buffers would make sense if you are filling one from > >>>>> the various patterns > >>>>>>>> while it is unlocked and then only blocking when writing to the > >>>>> output stream buffer. > >>>>>>>> > >>>>>>>> For what its worth Ceki updated his performance page again yesterday > >>>>> to change > >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2 > >>>>> configuration to use a > >>>>>>>> 256K buffer to match Logback. And the page still makes it look like > >>>>> he is testing > >>>>>>>> the performance of asynchronous processing when it is really testing > >>>>> the performance > >>>>>>>> of the file appenders with a blocking queue in front of them. > >>>>>>>> > >>>>>>>> Ralph > >>>>>>>> > >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote: > >>>>>>>>> > >>>>>>>>> I also tried that with similar results, which leads me to believe > >>>>> we're spending > >>>>>>>>> too much time copying between buffers. > >>>>>>>>> > >>>>>>>>> We've proven that log4j can get LogEvents to the appender very > >>>>> quickly and > >>>>>>>>> efficiently. > >>>>>>>>> > >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder. > >>>>> AbstractStringBuilder used to store a char array in java 8, however with > >>>>> jep 254 compact strings it was updated (similarly to String) to use a > >>>>> byte > >>>>> array + byte representing the encoding. > >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination > >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps: > >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would have > >>>>> been a direct copy from char[] to char[] in java8, but requires > >>>>> conversion > >>>>> from byte[]+coder now) > >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder > >>>>> ByteBuffer > >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the > >>>>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as > >>>>> far > >>>>> as I can tell) > >>>>>>>>> 4. Write the ByteBufferDestination to the > >>>>> FileOutputStream/RandomAccessFile by extracting the heap byte-array. > >>>>>>>>> > >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would have > >>>>> optimized out a bit of the copying because the types exactly matched. > >>>>>>>>> It's unclear why we need to copy through two separate ByteBuffers > >>>>> from StringBuilderEncoder into ByteBufferDestination > >>>>>>>>> I've generally had better I/O performance using direct byte-buffers > >>>>> than heap buffers, although that experience largely comes from work on > >>>>> webservers, I haven't tested if it holds true for file I/O. If we > >>>>> operate > >>>>> directly upon a FileChannel, we can write direct buffers directly to > >>>>> files. > >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it should > >>>>> have better performance around copying and manipulation as well. > >>>>>>>>> > >>>>>>>>> -ck > >>>>>>>>> > >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: > >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null > >>>>> to avoid > >>>>>>>>>> hardware's influence in the test results? > >>>>>>>>>> > >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < > >>>>> ralph.go...@dslextreme.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you using > >>>>> our > >>>>>>>>>>> NullAppender? I have > >>>>>>>>>>> my doubts about using that NullAppender as it does nothing - not > >>>>> even > >>>>>>>>>>> render the Layout, so > >>>>>>>>>>> it may get completely optimized away. > >>>>>>>>>>> > >>>>>>>>>>> I’d still like to know what kind of disks Remko did his original > >>>>> testing. > >>>>>>>>>>> The page says he got > >>>>>>>>>>> 18.495 million messages per second. Each message would be about > >>>>>>>>>>> 130 > >>>>>>>>>>> characters long > >>>>>>>>>>> from I can tell. That means the destination has to be able to > >>>>> support > >>>>>>>>>>> about at least 2.4 GB > >>>>>>>>>>> per second. > >>>>>>>>>>> > >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs 100 in > >>>>> Remko’s). > >>>>>>>>>>> That means the > >>>>>>>>>>> message size is only going to be about 60 characters long. For me > >>>>> the > >>>>>>>>>>> async test is writing > >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still > >>>>> seeing the > >>>>>>>>>>> queue backed up. > >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback is > >>>>> getting > >>>>>>>>>>> about 2000 ops/ms, > >>>>>>>>>>> which is still only 120 MBs. > >>>>>>>>>>> > >>>>>>>>>>> Your test below would be generating about 197 MBs. > >>>>>>>>>>> > >>>>>>>>>>> One thing I did notice that made a big difference is that Ceki has > >>>>> Logback > >>>>>>>>>>> configured to > >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. > >>>>> Setting > >>>>>>>>>>> Log4j2 to 256K > >>>>>>>>>>> considerably improved the performance. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Ralph > >>>>>>>>>>> > >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net> > >>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> If we would move the appender performance aside, am I right to > >>>>> conclude > >>>>>>>>>>>>> that the entire complex async. framework built atop Disruptor > >>>>> with all > >>>>>>>>>>> its > >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance gain > >>>>> compared > >>>>>>>>>>> to a > >>>>>>>>>>>>> simple off the shelf blocking queue? > >>>>>>>>>>>> > >>>>>>>>>>>> I haven't seen any data that would give me such an impression -- > >>>>> the > >>>>>>>>>>> file appender > >>>>>>>>>>>> itself is slower than the thread producing events, which appears > >>>>> to > >>>>>>>>>>> limit our throughput. > >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux > >>>>> workstation > >>>>>>>>>>> with 20+ producer > >>>>>>>>>>>> threads compared to logback. > >>>>>>>>>>>> Any time the queue is constantly full, performance will suffer > >>>>> (and this > >>>>>>>>>>> applies to standard > >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor will fare > >>>>> worse > >>>>>>>>>>> when the queue > >>>>>>>>>>>> is full). > >>>>>>>>>>>> The results using a single thread are roughly reproducible on the > >>>>>>>>>>> non-async test, so I think > >>>>>>>>>>>> the problem is somewhere between the layout, and > >>>>>>>>>>>> appender/manager. > >>>>>>>>>>>> > >>>>>>>>>>>> Results running locally with no-op appender implementations: > >>>>>>>>>>>> > >>>>>>>>>>>> 1 thread: > >>>>>>>>>>>> > >>>>>>>>>>>> Benchmark Mode Cnt > >>>>> Score > >>>>>>>>>>> Error Units > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > >>>>> 3285.743 ± > >>>>>>>>>>> 427.963 ops/ms > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > >>>>> 2383.532 ± > >>>>>>>>>>> 136.034 ops/ms > >>>>>>>>>>>> > >>>>>>>>>>>> 20 threads: > >>>>>>>>>>>> > >>>>>>>>>>>> Benchmark Mode Cnt > >>>>> Score > >>>>>>>>>>> Error Units > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > >>>>> 1177.746 ± > >>>>>>>>>>> 919.560 ops/ms > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > >>>>> 602.614 ± > >>>>>>>>>>> 47.485 ops/ms > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -ck > >>>>> > >>>> > >>> > > > > > -ck