All of that agrees with my observations as well. Ralph
> On Aug 27, 2021, at 12:23 PM, Carter Kozak <[email protected]> wrote: > > I've identified a few things that seem impactful, but bear in mind that I > haven't begun to drill down into them yet. I plan to file individual tickets > and investigate in greater depth later on: > > 1. Formatting the date is expensive. The process of actually formatting a > value is reasonable, however using a precise clock appears to cause cache > misses even when the pattern results in the same value > (microseconds/nanoseconds aren't included in the result). Using the > SystemMillisClock improves our throughput. I imagine some part of that > improvement is the result of currentTimeMillis(): long rather than > Clock.instant(). > > 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, > otherwise we see a lot of time spent in the StringBuilderEncoder, but I > haven't had time to investigate that yet. Without direct encoders, we are > allocating a great deal more byte arrays, but they're used across a small > space that C2 may be able to inline out. > > 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern > '%m{nolookup}' saves us some time scanning through produced messages -- I've > had this disabled at work for a while. > > 4. [in progress] If I implement the full layout directly in java (write > directly to a stringbuilder based on the event in a custom Layout instead of > using PatternLayout) combined with [1] and [2] above, performance is much > better than logback, however using PatternLayout with the steps above we fall > just a bit short. I still need to read through the patternlayout and > individual pattern converters to understand why, but there's definitely some > headroom we'll be able to reclaim hiding somewhere. > > -ck > > On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote: >> My pleasure, I enjoy digging into this sort of performance comparison, the >> toughest piece right now is balancing investigation with active projects at >> work. I completely agree about the importance of getting this resolved >> quickly and I intend to continue investigating. >> >> Re loom: >> Currently we have a few places which reuse shared objects, potentially large >> ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc) >> which are currently bounded to the number of application threads you have >> (which is relatively small compared to the loom claims which are admittedly >> large). More importantly, today we tend to reuse almost every thread in a >> ThreadPoolExecutor of some kind where the model in loom involves creating >> a new virtual thread for each task, meaning using a thread-local as a cache >> will have a much lower hit rate as reuse will only occur within a single >> task, >> and only if ThreadLocals are enabled. I hope that made sense sense. >> >> -ck >> >> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote: >>> Carter, thanks so much for your great effort and attention on this issue. *I >>> personally find this the uttermost important thing the project needs to >>> address with urgency right now.* Ceki is warming up for a new Logback (and >>> SLF4J?) release. I am pretty sure he will do his part in PR, particularly >>> in the context of benchmarking his new work with competitors. Once these >>> results are out – independent of whether they make sense or not – >>> "internet" will take this as the ultimate truth and it will require a >>> tremendous effort to change people's mind. In conclusion, please prioritize >>> this appropriately and keep us posted. Don't hesitate to let us/me know if >>> there is anything (drinks, coffee, lunch?) we can help with. >>> >>> Regarding your remarks about Loom, I don't think I follow you there. Our >>> thread-locals are stateless hence should be perfectly fine while running in >>> virtual threads too. (I am totally against blindly using TLs, but that is >>> another discussion. I would rather prefer a recycler concept similar to the >>> one we have in JsonTemplateLayout. I think there was already a mailing list >>> thread about this.) >>> >>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <[email protected]> wrote: >>> >>>> Yes, I'm still looking into this. >>>> >>>> I agree that the zero-garbage code is difficult to follow, and Loom >>>> virtual threads will have less reuse so our thread-locals will create more >>>> overhead than they're worth in many cases. Fingers crossed for Valhalla to >>>> land before Loom, but I haven't been following updates from either project >>>> very closely lately, and Loom seems farther along. >>>> >>>> I think the issue with multiple buffers these days is that we have optane >>>> and nvme devices which can be _nearly_ as fast as main memory, so copying >>>> contents can be problematic. In fully async mode (with some caveats around >>>> queue-full-policy configuration points and nested async appenders) we don't >>>> need locking because the appender interactions are single-threaded. >>>> >>>> For what it's worth, in my testing I didn't see much difference between >>>> the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme >>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect default >>>> buffer size in the RandomAccessFileAppender). >>>> >>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: >>>>> While that is true it does mean that we are locking higher up in the >>>> call stack than we need to be. >>>>> >>>>> Ralph >>>>> >>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]> wrote: >>>>>> >>>>>> I’m fairly certain the JIT will optimize out the locking operations on >>>> the nested synchronized calls after a while. I’m not sure how soon into the >>>> performance tests that would happen. >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers <[email protected]> >>>> wrote: >>>>>>> >>>>>>> So are you continuing to look at this? Frankly, the work on zero-gc >>>> stuff made this >>>>>>> very complicated. I am not sure I can even follow the logic anymore. >>>> I also noticed >>>>>>> that many of the methods dealing with byte buffers are synchronized. >>>> I am not sure >>>>>>> why as only the method that moves data really needs to be. >>>>>>> >>>>>>> Using multiple buffers would make sense if you are filling one from >>>> the various patterns >>>>>>> while it is unlocked and then only blocking when writing to the >>>> output stream buffer. >>>>>>> >>>>>>> For what its worth Ceki updated his performance page again yesterday >>>> to change >>>>>>> the wording but I am pretty sure he didn’t change the log4j2 >>>> configuration to use a >>>>>>> 256K buffer to match Logback. And the page still makes it look like >>>> he is testing >>>>>>> the performance of asynchronous processing when it is really testing >>>> the performance >>>>>>> of the file appenders with a blocking queue in front of them. >>>>>>> >>>>>>> Ralph >>>>>>> >>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]> wrote: >>>>>>>> >>>>>>>> I also tried that with similar results, which leads me to believe >>>> we're spending >>>>>>>> too much time copying between buffers. >>>>>>>> >>>>>>>> We've proven that log4j can get LogEvents to the appender very >>>> quickly and >>>>>>>> efficiently. >>>>>>>> >>>>>>>> Once we hit PatternLayout we write data to a StringBuilder. >>>> AbstractStringBuilder used to store a char array in java 8, however with >>>> jep 254 compact strings it was updated (similarly to String) to use a byte >>>> array + byte representing the encoding. >>>>>>>> We encode the StringBuilder value into a ByteBufferDestination >>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps: >>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would have >>>> been a direct copy from char[] to char[] in java8, but requires conversion >>>> from byte[]+coder now) >>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder >>>> ByteBuffer >>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the >>>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as far >>>> as I can tell) >>>>>>>> 4. Write the ByteBufferDestination to the >>>> FileOutputStream/RandomAccessFile by extracting the heap byte-array. >>>>>>>> >>>>>>>> We're copying bytes around a lot, and I think older JVMs would have >>>> optimized out a bit of the copying because the types exactly matched. >>>>>>>> It's unclear why we need to copy through two separate ByteBuffers >>>> from StringBuilderEncoder into ByteBufferDestination >>>>>>>> I've generally had better I/O performance using direct byte-buffers >>>> than heap buffers, although that experience largely comes from work on >>>> webservers, I haven't tested if it holds true for file I/O. If we operate >>>> directly upon a FileChannel, we can write direct buffers directly to files. >>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it should >>>> have better performance around copying and manipulation as well. >>>>>>>> >>>>>>>> -ck >>>>>>>> >>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: >>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null >>>> to avoid >>>>>>>>> hardware's influence in the test results? >>>>>>>>> >>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < >>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you using >>>> our >>>>>>>>>> NullAppender? I have >>>>>>>>>> my doubts about using that NullAppender as it does nothing - not >>>> even >>>>>>>>>> render the Layout, so >>>>>>>>>> it may get completely optimized away. >>>>>>>>>> >>>>>>>>>> I’d still like to know what kind of disks Remko did his original >>>> testing. >>>>>>>>>> The page says he got >>>>>>>>>> 18.495 million messages per second. Each message would be about 130 >>>>>>>>>> characters long >>>>>>>>>> from I can tell. That means the destination has to be able to >>>> support >>>>>>>>>> about at least 2.4 GB >>>>>>>>>> per second. >>>>>>>>>> >>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs 100 in >>>> Remko’s). >>>>>>>>>> That means the >>>>>>>>>> message size is only going to be about 60 characters long. For me >>>> the >>>>>>>>>> async test is writing >>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still >>>> seeing the >>>>>>>>>> queue backed up. >>>>>>>>>> So how we are writing must be terribly inefficient. Logback is >>>> getting >>>>>>>>>> about 2000 ops/ms, >>>>>>>>>> which is still only 120 MBs. >>>>>>>>>> >>>>>>>>>> Your test below would be generating about 197 MBs. >>>>>>>>>> >>>>>>>>>> One thing I did notice that made a big difference is that Ceki has >>>> Logback >>>>>>>>>> configured to >>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. >>>> Setting >>>>>>>>>> Log4j2 to 256K >>>>>>>>>> considerably improved the performance. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <[email protected]> >>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> If we would move the appender performance aside, am I right to >>>> conclude >>>>>>>>>>>> that the entire complex async. framework built atop Disruptor >>>> with all >>>>>>>>>> its >>>>>>>>>>>> whistles and bells is yielding a diminishing performance gain >>>> compared >>>>>>>>>> to a >>>>>>>>>>>> simple off the shelf blocking queue? >>>>>>>>>>> >>>>>>>>>>> I haven't seen any data that would give me such an impression -- >>>> the >>>>>>>>>> file appender >>>>>>>>>>> itself is slower than the thread producing events, which appears >>>> to >>>>>>>>>> limit our throughput. >>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux >>>> workstation >>>>>>>>>> with 20+ producer >>>>>>>>>>> threads compared to logback. >>>>>>>>>>> Any time the queue is constantly full, performance will suffer >>>> (and this >>>>>>>>>> applies to standard >>>>>>>>>>> blocking queues as well as disruptor, however disruptor will fare >>>> worse >>>>>>>>>> when the queue >>>>>>>>>>> is full). >>>>>>>>>>> The results using a single thread are roughly reproducible on the >>>>>>>>>> non-async test, so I think >>>>>>>>>>> the problem is somewhere between the layout, and appender/manager. >>>>>>>>>>> >>>>>>>>>>> Results running locally with no-op appender implementations: >>>>>>>>>>> >>>>>>>>>>> 1 thread: >>>>>>>>>>> >>>>>>>>>>> Benchmark Mode Cnt >>>> Score >>>>>>>>>> Error Units >>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 >>>> 3285.743 ± >>>>>>>>>> 427.963 ops/ms >>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 >>>> 2383.532 ± >>>>>>>>>> 136.034 ops/ms >>>>>>>>>>> >>>>>>>>>>> 20 threads: >>>>>>>>>>> >>>>>>>>>>> Benchmark Mode Cnt >>>> Score >>>>>>>>>> Error Units >>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 >>>> 1177.746 ± >>>>>>>>>> 919.560 ops/ms >>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 >>>> 602.614 ± >>>>>>>>>> 47.485 ops/ms >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> -ck >>>> >>> >>
