On Sat, Aug 28, 2021 at 3:16 PM Ron Grabowski <rongrabow...@yahoo.com.invalid> wrote:
> Follow-up to "Formatting the date is expensive. The process of actually > formatting a value is reasonable". Is this still an issue from LOG4J2-930: > %m %ex%n: 1,755,573 msg/sec%d %m %ex%n: 1,194,184 msg/sec > No, I believe that formatting the date is no longer the bottleneck. The analysis done in LOG4J2-930 led to https://issues.apache.org/jira/browse/LOG4J2-1097 which resulted in the FixedDateFormat. This format gives a good trade-off between speed and flexibility. The drawback is that it only works for some fixed formats, but those are the most widely used formats. I don't think that focusing on the date formatting and pregenerate formatted timestamps will be fruitful (but I could be wrong). Avoiding the PreciseTime Instant and using System.currentTimeMillis when it is known that none of the downstream formatters require sub-millisecond precision may be a good optimization. The message formatting (PatternLayout) as a whole is expensive somehow, there may be more to optimize there. > If so, isn't date rendering essentially a sequence we can generate ahead > of time similar to how a local ID generator asks for an allocation of keys > then uses that to quickly assign IDs to new objects? When its time to > render %d we can just grab it via an index: > > 1)At startup calculate the next 32k formatted dates. If > Clock.currentTimeMillis() were configured down to the second, 32000 seconds > would pre-allocate %d for the next 8 hours. > 2)Apply math to Clock.currentTimeMillis() to get an index into the buffer. > Seconds precision: > [10] = "2021-08-28 09:44:31,000" > [11] = "2021-08-28 09:44:32,000"[12] = "2021-08-28 09:44:33,000"[13] = > "2021-08-28 09:44:34,000"[14] = "2021-08-28 09:44:35,000"[15] = "2021-08-28 > 09:44:36,000"...[31999] = "..." > 50ms precision: > [10] = "2021-08-28 09:44:31,050"[11] = "2021-08-28 09:44:31,075"[12] = > "2021-08-28 09:44:31,100"[13] = "2021-08-28 09:44:31,150"[14] = "2021-08-28 > 09:44:31,175"[15] = "2021-08-28 09:44:31,200"...[31999] = "..." > > 3)Rendering %d{SEQ(DEFAULT,32000)} is just a index lookup into the > sequence of 32000 pre-calculated %d{DEFAULT} values without the cost of > formatting. I made up the "SEQ" notation, there's likely a better way to > express the feature. Everything can read from the buffer without locking. > 4)Have a background thread casually keep the sequence filled in a ring so > dates in the past are replaced with future dates so the structure consumes > a consistent amount of memory. > On Friday, August 27, 2021, 10:07:59 PM EDT, Carter Kozak < > cko...@ckozak.net> wrote: > > Thanks, Remko. The default '%d' uses FixedDateFormat with > FixedFormat.DEFAULT. The FastDateFormat alternative does not support > microseconds, so it doesn't suffer from the same problem. I think I can > substantially reduce the frequency we re-format dates by checking > FixedFormat.secondFractionDigits to determine if we meed to compare > microseconds. > > On Fri, Aug 27, 2021, at 16:10, Remko Popma wrote: > > I remember looking at PatternLayout performance, I reported my findings > here, hopefully they’re still useful: > https://issues.apache.org/jira/browse/LOG4J2-930 > > > > If %d is used in the pattern, does the FixedDateFormat get used? > > > > > > > > > > > On Aug 28, 2021, at 4:33, Ralph Goers <ralph.go...@dslextreme.com> > wrote: > > > > > > All of that agrees with my observations as well. > > > > > > Ralph > > > > > >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <cko...@ckozak.net> wrote: > > >> > > >> I've identified a few things that seem impactful, but bear in mind > that I haven't begun to drill down into them yet. I plan to file individual > tickets and investigate in greater depth later on: > > >> > > >> 1. Formatting the date is expensive. The process of actually > formatting a value is reasonable, however using a precise clock appears to > cause cache misses even when the pattern results in the same value > (microseconds/nanoseconds aren't included in the result). Using the > SystemMillisClock improves our throughput. I imagine some part of that > improvement is the result of currentTimeMillis(): long rather than > Clock.instant(). > > >> > > >> 2. Setting 'log4j2.enable.direct.encoders' to false improves > performance, otherwise we see a lot of time spent in the > StringBuilderEncoder, but I haven't had time to investigate that yet. > Without direct encoders, we are allocating a great deal more byte arrays, > but they're used across a small space that C2 may be able to inline out. > > >> > > >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern > '%m{nolookup}' saves us some time scanning through produced messages -- > I've had this disabled at work for a while. > > >> > > >> 4. [in progress] If I implement the full layout directly in java > (write directly to a stringbuilder based on the event in a custom Layout > instead of using PatternLayout) combined with [1] and [2] above, > performance is much better than logback, however using PatternLayout with > the steps above we fall just a bit short. I still need to read through the > patternlayout and individual pattern converters to understand why, but > there's definitely some headroom we'll be able to reclaim hiding somewhere. > > >> > > >> -ck > > >> > > >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote: > > >>> My pleasure, I enjoy digging into this sort of performance > comparison, the > > >>> toughest piece right now is balancing investigation with active > projects at > > >>> work. I completely agree about the importance of getting this > resolved > > >>> quickly and I intend to continue investigating. > > >>> > > >>> Re loom: > > >>> Currently we have a few places which reuse shared objects, > potentially large > > >>> ones (thinking StringBuilder instances on > ReusableLogEvent/StringLayout/etc) > > >>> which are currently bounded to the number of application threads you > have > > >>> (which is relatively small compared to the loom claims which are > admittedly > > >>> large). More importantly, today we tend to reuse almost every thread > in a > > >>> ThreadPoolExecutor of some kind where the model in loom involves > creating > > >>> a new virtual thread for each task, meaning using a thread-local as > a cache > > >>> will have a much lower hit rate as reuse will only occur within a > single task, > > >>> and only if ThreadLocals are enabled. I hope that made sense sense. > > >>> > > >>> -ck > > >>> > > >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote: > > >>>> Carter, thanks so much for your great effort and attention on this > issue. *I > > >>>> personally find this the uttermost important thing the project > needs to > > >>>> address with urgency right now.* Ceki is warming up for a new > Logback (and > > >>>> SLF4J?) release. I am pretty sure he will do his part in PR, > particularly > > >>>> in the context of benchmarking his new work with competitors. Once > these > > >>>> results are out – independent of whether they make sense or not – > > >>>> "internet" will take this as the ultimate truth and it will require > a > > >>>> tremendous effort to change people's mind. In conclusion, please > prioritize > > >>>> this appropriately and keep us posted. Don't hesitate to let us/me > know if > > >>>> there is anything (drinks, coffee, lunch?) we can help with. > > >>>> > > >>>> Regarding your remarks about Loom, I don't think I follow you > there. Our > > >>>> thread-locals are stateless hence should be perfectly fine while > running in > > >>>> virtual threads too. (I am totally against blindly using TLs, but > that is > > >>>> another discussion. I would rather prefer a recycler concept > similar to the > > >>>> one we have in JsonTemplateLayout. I think there was already a > mailing list > > >>>> thread about this.) > > >>>> > > >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> > wrote: > > >>>> > > >>>>> Yes, I'm still looking into this. > > >>>>> > > >>>>> I agree that the zero-garbage code is difficult to follow, and Loom > > >>>>> virtual threads will have less reuse so our thread-locals will > create more > > >>>>> overhead than they're worth in many cases. Fingers crossed for > Valhalla to > > >>>>> land before Loom, but I haven't been following updates from either > project > > >>>>> very closely lately, and Loom seems farther along. > > >>>>> > > >>>>> I think the issue with multiple buffers these days is that we have > optane > > >>>>> and nvme devices which can be _nearly_ as fast as main memory, so > copying > > >>>>> contents can be problematic. In fully async mode (with some > caveats around > > >>>>> queue-full-policy configuration points and nested async appenders) > we don't > > >>>>> need locking because the appender interactions are single-threaded. > > >>>>> > > >>>>> For what it's worth, in my testing I didn't see much difference > between > > >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively > fast nvme > > >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect > default > > >>>>> buffer size in the RandomAccessFileAppender). > > >>>>> > > >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: > > >>>>>> While that is true it does mean that we are locking higher up in > the > > >>>>> call stack than we need to be. > > >>>>>> > > >>>>>> Ralph > > >>>>>> > > >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> > wrote: > > >>>>>>> > > >>>>>>> I’m fairly certain the JIT will optimize out the locking > operations on > > >>>>> the nested synchronized calls after a while. I’m not sure how soon > into the > > >>>>> performance tests that would happen. > > >>>>>>> > > >>>>>>> Tim > > >>>>>>> > > >>>>>>> > > >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers < > ralph.go...@dslextreme.com> > > >>>>> wrote: > > >>>>>>>> > > >>>>>>>> So are you continuing to look at this? Frankly, the work on > zero-gc > > >>>>> stuff made this > > >>>>>>>> very complicated. I am not sure I can even follow the logic > anymore. > > >>>>> I also noticed > > >>>>>>>> that many of the methods dealing with byte buffers are > synchronized. > > >>>>> I am not sure > > >>>>>>>> why as only the method that moves data really needs to be. > > >>>>>>>> > > >>>>>>>> Using multiple buffers would make sense if you are filling one > from > > >>>>> the various patterns > > >>>>>>>> while it is unlocked and then only blocking when writing to the > > >>>>> output stream buffer. > > >>>>>>>> > > >>>>>>>> For what its worth Ceki updated his performance page again > yesterday > > >>>>> to change > > >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2 > > >>>>> configuration to use a > > >>>>>>>> 256K buffer to match Logback. And the page still makes it look > like > > >>>>> he is testing > > >>>>>>>> the performance of asynchronous processing when it is really > testing > > >>>>> the performance > > >>>>>>>> of the file appenders with a blocking queue in front of them. > > >>>>>>>> > > >>>>>>>> Ralph > > >>>>>>>> > > >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> > wrote: > > >>>>>>>>> > > >>>>>>>>> I also tried that with similar results, which leads me to > believe > > >>>>> we're spending > > >>>>>>>>> too much time copying between buffers. > > >>>>>>>>> > > >>>>>>>>> We've proven that log4j can get LogEvents to the appender very > > >>>>> quickly and > > >>>>>>>>> efficiently. > > >>>>>>>>> > > >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder. > > >>>>> AbstractStringBuilder used to store a char array in java 8, > however with > > >>>>> jep 254 compact strings it was updated (similarly to String) to > use a byte > > >>>>> array + byte representing the encoding. > > >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination > > >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps: > > >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would > have > > >>>>> been a direct copy from char[] to char[] in java8, but requires > conversion > > >>>>> from byte[]+coder now) > > >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder > > >>>>> ByteBuffer > > >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the > > >>>>> ByteBufferDestination ByteBuffer (which is also always a heap > buffer as far > > >>>>> as I can tell) > > >>>>>>>>> 4. Write the ByteBufferDestination to the > > >>>>> FileOutputStream/RandomAccessFile by extracting the heap > byte-array. > > >>>>>>>>> > > >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would > have > > >>>>> optimized out a bit of the copying because the types exactly > matched. > > >>>>>>>>> It's unclear why we need to copy through two separate > ByteBuffers > > >>>>> from StringBuilderEncoder into ByteBufferDestination > > >>>>>>>>> I've generally had better I/O performance using direct > byte-buffers > > >>>>> than heap buffers, although that experience largely comes from > work on > > >>>>> webservers, I haven't tested if it holds true for file I/O. If we > operate > > >>>>> directly upon a FileChannel, we can write direct buffers directly > to files. > > >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it > should > > >>>>> have better performance around copying and manipulation as well. > > >>>>>>>>> > > >>>>>>>>> -ck > > >>>>>>>>> > > >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: > > >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to > /dev/null > > >>>>> to avoid > > >>>>>>>>>> hardware's influence in the test results? > > >>>>>>>>>> > > >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < > > >>>>> ralph.go...@dslextreme.com> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you > using > > >>>>> our > > >>>>>>>>>>> NullAppender? I have > > >>>>>>>>>>> my doubts about using that NullAppender as it does nothing - > not > > >>>>> even > > >>>>>>>>>>> render the Layout, so > > >>>>>>>>>>> it may get completely optimized away. > > >>>>>>>>>>> > > >>>>>>>>>>> I’d still like to know what kind of disks Remko did his > original > > >>>>> testing. > > >>>>>>>>>>> The page says he got > > >>>>>>>>>>> 18.495 million messages per second. Each message would be > about 130 > > >>>>>>>>>>> characters long > > >>>>>>>>>>> from I can tell. That means the destination has to be able to > > >>>>> support > > >>>>>>>>>>> about at least 2.4 GB > > >>>>>>>>>>> per second. > > >>>>>>>>>>> > > >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs > 100 in > > >>>>> Remko’s). > > >>>>>>>>>>> That means the > > >>>>>>>>>>> message size is only going to be about 60 characters long. > For me > > >>>>> the > > >>>>>>>>>>> async test is writing > > >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still > > >>>>> seeing the > > >>>>>>>>>>> queue backed up. > > >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback > is > > >>>>> getting > > >>>>>>>>>>> about 2000 ops/ms, > > >>>>>>>>>>> which is still only 120 MBs. > > >>>>>>>>>>> > > >>>>>>>>>>> Your test below would be generating about 197 MBs. > > >>>>>>>>>>> > > >>>>>>>>>>> One thing I did notice that made a big difference is that > Ceki has > > >>>>> Logback > > >>>>>>>>>>> configured to > > >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of > 8K. > > >>>>> Setting > > >>>>>>>>>>> Log4j2 to 256K > > >>>>>>>>>>> considerably improved the performance. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Ralph > > >>>>>>>>>>> > > >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak < > cko...@ckozak.net> > > >>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> If we would move the appender performance aside, am I > right to > > >>>>> conclude > > >>>>>>>>>>>>> that the entire complex async. framework built atop > Disruptor > > >>>>> with all > > >>>>>>>>>>> its > > >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance > gain > > >>>>> compared > > >>>>>>>>>>> to a > > >>>>>>>>>>>>> simple off the shelf blocking queue? > > >>>>>>>>>>>> > > >>>>>>>>>>>> I haven't seen any data that would give me such an > impression -- > > >>>>> the > > >>>>>>>>>>> file appender > > >>>>>>>>>>>> itself is slower than the thread producing events, which > appears > > >>>>> to > > >>>>>>>>>>> limit our throughput. > > >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux > > >>>>> workstation > > >>>>>>>>>>> with 20+ producer > > >>>>>>>>>>>> threads compared to logback. > > >>>>>>>>>>>> Any time the queue is constantly full, performance will > suffer > > >>>>> (and this > > >>>>>>>>>>> applies to standard > > >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor > will fare > > >>>>> worse > > >>>>>>>>>>> when the queue > > >>>>>>>>>>>> is full). > > >>>>>>>>>>>> The results using a single thread are roughly reproducible > on the > > >>>>>>>>>>> non-async test, so I think > > >>>>>>>>>>>> the problem is somewhere between the layout, and > appender/manager. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Results running locally with no-op appender implementations: > > >>>>>>>>>>>> > > >>>>>>>>>>>> 1 thread: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Benchmark Mode Cnt > > >>>>> Score > > >>>>>>>>>>> Error Units > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > > >>>>> 3285.743 ± > > >>>>>>>>>>> 427.963 ops/ms > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > > >>>>> 2383.532 ± > > >>>>>>>>>>> 136.034 ops/ms > > >>>>>>>>>>>> > > >>>>>>>>>>>> 20 threads: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Benchmark Mode Cnt > > >>>>> Score > > >>>>>>>>>>> Error Units > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 > > >>>>> 1177.746 ± > > >>>>>>>>>>> 919.560 ops/ms > > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 > > >>>>> 602.614 ± > > >>>>>>>>>>> 47.485 ops/ms > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> -ck > > >>>>> > > >>>> > > >>> > > > > > > > > > > -ck >