All of that agrees with my observations as well.

Ralph

> On Aug 27, 2021, at 12:23 PM, Carter Kozak <cko...@ckozak.net> wrote:
> 
> I've identified a few things that seem impactful, but bear in mind that I 
> haven't begun to drill down into them yet. I plan to file individual tickets 
> and investigate in greater depth later on:
> 
> 1. Formatting the date is expensive. The process of actually formatting a 
> value is reasonable, however using a precise clock appears to cause cache 
> misses even when the pattern results in the same value 
> (microseconds/nanoseconds aren't included in the result). Using the 
> SystemMillisClock improves our throughput. I imagine some part of that 
> improvement is the result of currentTimeMillis(): long rather than 
> Clock.instant().
> 
> 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, 
> otherwise we see a lot of time spent in the StringBuilderEncoder, but I 
> haven't had time to investigate that yet. Without direct encoders, we are 
> allocating a great deal more byte arrays, but they're used across a small 
> space that C2 may be able to inline out.
> 
> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern 
> '%m{nolookup}' saves us some time scanning through produced messages -- I've 
> had this disabled at work for a while.
> 
> 4. [in progress] If I implement the full layout directly in java (write 
> directly to a stringbuilder based on the event in a custom Layout instead of 
> using PatternLayout) combined with [1] and [2] above, performance is much 
> better than logback, however using PatternLayout with the steps above we fall 
> just a bit short. I still need to read through the patternlayout and 
> individual pattern converters to understand why, but there's definitely some 
> headroom we'll be able to reclaim hiding somewhere.
> 
> -ck
> 
> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote:
>> My pleasure, I enjoy digging into this sort of performance comparison, the
>> toughest piece right now is balancing investigation with active projects at
>> work. I completely agree about the importance of getting this resolved
>> quickly and I intend to continue investigating.
>> 
>> Re loom:
>> Currently we have a few places which reuse shared objects, potentially large
>> ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc)
>> which are currently bounded to the number of application threads you have
>> (which is relatively small compared to the loom claims which are admittedly
>> large). More importantly, today we tend to reuse almost every thread in a
>> ThreadPoolExecutor of some kind where the model in loom involves creating
>> a new virtual thread for each task, meaning using a thread-local as a cache
>> will have a much lower hit rate as reuse will only occur within a single 
>> task,
>> and only if ThreadLocals are enabled. I hope that made sense sense.
>> 
>> -ck
>> 
>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
>>> Carter, thanks so much for your great effort and attention on this issue. *I
>>> personally find this the uttermost important thing the project needs to
>>> address with urgency right now.* Ceki is warming up for a new Logback (and
>>> SLF4J?) release. I am pretty sure he will do his part in PR, particularly
>>> in the context of benchmarking his new work with competitors. Once these
>>> results are out – independent of whether they make sense or not –
>>> "internet" will take this as the ultimate truth and it will require a
>>> tremendous effort to change people's mind. In conclusion, please prioritize
>>> this appropriately and keep us posted. Don't hesitate to let us/me know if
>>> there is anything (drinks, coffee, lunch?) we can help with.
>>> 
>>> Regarding your remarks about Loom, I don't think I follow you there. Our
>>> thread-locals are stateless hence should be perfectly fine while running in
>>> virtual threads too. (I am totally against blindly using TLs, but that is
>>> another discussion. I would rather prefer a recycler concept similar to the
>>> one we have in JsonTemplateLayout. I think there was already a mailing list
>>> thread about this.)
>>> 
>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net> wrote:
>>> 
>>>> Yes, I'm still looking into this.
>>>> 
>>>> I agree that the zero-garbage code is difficult to follow, and Loom
>>>> virtual threads will have less reuse so our thread-locals will create more
>>>> overhead than they're worth in many cases. Fingers crossed for Valhalla to
>>>> land before Loom, but I haven't been following updates from either project
>>>> very closely lately, and Loom seems farther along.
>>>> 
>>>> I think the issue with multiple buffers these days is that we have optane
>>>> and nvme devices which can be _nearly_ as fast as main memory, so copying
>>>> contents can be problematic. In fully async mode (with some caveats around
>>>> queue-full-policy configuration points and nested async appenders) we don't
>>>> need locking because the appender interactions are single-threaded.
>>>> 
>>>> For what it's worth, in my testing I didn't see much difference between
>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme
>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect default
>>>> buffer size in the RandomAccessFileAppender).
>>>> 
>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
>>>>> While that is true it does mean that we are locking higher up in the
>>>> call stack than we need to be.
>>>>> 
>>>>> Ralph
>>>>> 
>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com> wrote:
>>>>>> 
>>>>>> I’m fairly certain the JIT will optimize out the locking operations on
>>>> the nested synchronized calls after a while. I’m not sure how soon into the
>>>> performance tests that would happen.
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> 
>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers <ralph.go...@dslextreme.com>
>>>> wrote:
>>>>>>> 
>>>>>>> So are you continuing to look at this?  Frankly, the work on zero-gc
>>>> stuff made this
>>>>>>> very complicated. I am not sure I can even follow the logic anymore.
>>>> I also noticed
>>>>>>> that many of the methods dealing with byte buffers are synchronized.
>>>> I am not sure
>>>>>>> why as only the method that moves data really needs to be.
>>>>>>> 
>>>>>>> Using multiple buffers would make sense if you are filling one from
>>>> the various patterns
>>>>>>> while it is unlocked and then only blocking when writing to the
>>>> output stream buffer.
>>>>>>> 
>>>>>>> For what its worth Ceki updated his performance page again yesterday
>>>> to change
>>>>>>> the wording but I am pretty sure he didn’t change the log4j2
>>>> configuration to use a
>>>>>>> 256K buffer to match Logback.  And the page still makes it look like
>>>> he is testing
>>>>>>> the performance of asynchronous processing when it is really testing
>>>> the performance
>>>>>>> of the file appenders with a blocking queue in front of them.
>>>>>>> 
>>>>>>> Ralph
>>>>>>> 
>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net> wrote:
>>>>>>>> 
>>>>>>>> I also tried that with similar results, which leads me to believe
>>>> we're spending
>>>>>>>> too much time copying between buffers.
>>>>>>>> 
>>>>>>>> We've proven that log4j can get LogEvents to the appender very
>>>> quickly and
>>>>>>>> efficiently.
>>>>>>>> 
>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder.
>>>> AbstractStringBuilder used to store a char array in java 8, however with
>>>> jep 254 compact strings it was updated (similarly to String) to use a byte
>>>> array + byte representing the encoding.
>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination
>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would have
>>>> been a direct copy from char[] to char[] in java8, but requires conversion
>>>> from byte[]+coder now)
>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder
>>>> ByteBuffer
>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
>>>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as far
>>>> as I can tell)
>>>>>>>> 4. Write the ByteBufferDestination to the
>>>> FileOutputStream/RandomAccessFile by extracting the heap byte-array.
>>>>>>>> 
>>>>>>>> We're copying bytes around a lot, and I think older JVMs would have
>>>> optimized out a bit of the copying because the types exactly matched.
>>>>>>>> It's unclear why we need to copy through two separate ByteBuffers
>>>> from StringBuilderEncoder into ByteBufferDestination
>>>>>>>> I've generally had better I/O performance using direct byte-buffers
>>>> than heap buffers, although that experience largely comes from work on
>>>> webservers, I haven't tested if it holds true for file I/O. If we operate
>>>> directly upon a FileChannel, we can write direct buffers directly to files.
>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it should
>>>> have better performance around copying and manipulation as well.
>>>>>>>> 
>>>>>>>> -ck
>>>>>>>> 
>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null
>>>> to avoid
>>>>>>>>> hardware's influence in the test results?
>>>>>>>>> 
>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
>>>> ralph.go...@dslextreme.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you using
>>>> our
>>>>>>>>>> NullAppender? I have
>>>>>>>>>> my doubts about using that NullAppender as it does nothing - not
>>>> even
>>>>>>>>>> render the Layout, so
>>>>>>>>>> it may get completely optimized away.
>>>>>>>>>> 
>>>>>>>>>> I’d still like to know what kind of disks Remko did his original
>>>> testing.
>>>>>>>>>> The page says he got
>>>>>>>>>> 18.495 million messages per second. Each message would be about 130
>>>>>>>>>> characters long
>>>>>>>>>> from I can tell. That means the destination has to be able to
>>>> support
>>>>>>>>>> about at least 2.4 GB
>>>>>>>>>> per second.
>>>>>>>>>> 
>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs 100 in
>>>> Remko’s).
>>>>>>>>>> That means the
>>>>>>>>>> message size is only going to be about 60 characters long. For me
>>>> the
>>>>>>>>>> async test is writing
>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
>>>> seeing the
>>>>>>>>>> queue backed up.
>>>>>>>>>> So how we are writing must be terribly inefficient. Logback is
>>>> getting
>>>>>>>>>> about 2000 ops/ms,
>>>>>>>>>> which is still only 120 MBs.
>>>>>>>>>> 
>>>>>>>>>> Your test below would be generating about 197 MBs.
>>>>>>>>>> 
>>>>>>>>>> One thing I did notice that made a big difference is that Ceki has
>>>> Logback
>>>>>>>>>> configured to
>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of 8K.
>>>> Setting
>>>>>>>>>> Log4j2 to 256K
>>>>>>>>>> considerably improved the performance.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Ralph
>>>>>>>>>> 
>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <cko...@ckozak.net>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> If we would move the appender performance aside, am I right to
>>>> conclude
>>>>>>>>>>>> that the entire complex async. framework built atop Disruptor
>>>> with all
>>>>>>>>>> its
>>>>>>>>>>>> whistles and bells is yielding a diminishing performance gain
>>>> compared
>>>>>>>>>> to a
>>>>>>>>>>>> simple off the shelf blocking queue?
>>>>>>>>>>> 
>>>>>>>>>>> I haven't seen any data that would give me such an impression --
>>>> the
>>>>>>>>>> file appender
>>>>>>>>>>> itself is slower than the thread producing events, which appears
>>>> to
>>>>>>>>>> limit our throughput.
>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux
>>>> workstation
>>>>>>>>>> with 20+ producer
>>>>>>>>>>> threads compared to logback.
>>>>>>>>>>> Any time the queue is constantly full, performance will suffer
>>>> (and this
>>>>>>>>>> applies to standard
>>>>>>>>>>> blocking queues as well as disruptor, however disruptor will fare
>>>> worse
>>>>>>>>>> when the queue
>>>>>>>>>>> is full).
>>>>>>>>>>> The results using a single thread are roughly reproducible on the
>>>>>>>>>> non-async test, so I think
>>>>>>>>>>> the problem is somewhere between the layout, and appender/manager.
>>>>>>>>>>> 
>>>>>>>>>>> Results running locally with no-op appender implementations:
>>>>>>>>>>> 
>>>>>>>>>>> 1 thread:
>>>>>>>>>>> 
>>>>>>>>>>> Benchmark                                        Mode  Cnt
>>>> Score
>>>>>>>>>> Error   Units
>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
>>>> 3285.743 ±
>>>>>>>>>> 427.963  ops/ms
>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
>>>> 2383.532 ±
>>>>>>>>>> 136.034  ops/ms
>>>>>>>>>>> 
>>>>>>>>>>> 20 threads:
>>>>>>>>>>> 
>>>>>>>>>>> Benchmark                                        Mode  Cnt
>>>> Score
>>>>>>>>>> Error   Units
>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
>>>> 1177.746 ±
>>>>>>>>>> 919.560  ops/ms
>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
>>>> 602.614 ±
>>>>>>>>>> 47.485  ops/ms
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> -ck
>>>> 
>>> 
>> 


Reply via email to