On Sat, Aug 28, 2021 at 3:16 PM Ron Grabowski
<rongrabow...@yahoo.com.invalid> wrote:

>  Follow-up to "Formatting the date is expensive. The process of actually
> formatting a value is reasonable". Is this still an issue from LOG4J2-930:
> %m %ex%n: 1,755,573 msg/sec%d %m %ex%n: 1,194,184 msg/sec
>

No, I believe that formatting the date is no longer the bottleneck.
The analysis done in LOG4J2-930 led to
https://issues.apache.org/jira/browse/LOG4J2-1097 which resulted in the
FixedDateFormat.
This format gives a good trade-off between speed and flexibility.
The drawback is that it only works for some fixed formats, but those are
the most widely used formats.
I don't think that focusing on the date formatting and pregenerate
formatted timestamps will be fruitful (but I could be wrong).

Avoiding the PreciseTime Instant and using System.currentTimeMillis when it
is known that none of the downstream formatters require sub-millisecond
precision may be a good optimization.

The message formatting (PatternLayout) as a whole is expensive somehow,
there may be more to optimize there.


> If so, isn't date rendering essentially a sequence we can generate ahead
> of time similar to how a local ID generator asks for an allocation of keys
> then uses that to quickly assign IDs to new objects? When its time to
> render %d we can just grab it via an index:
>
> 1)At startup calculate the next 32k formatted dates. If
> Clock.currentTimeMillis() were configured down to the second, 32000 seconds
> would pre-allocate %d for the next 8 hours.
> 2)Apply math to Clock.currentTimeMillis() to get an index into the buffer.
> Seconds precision:
> [10] = "2021-08-28 09:44:31,000"
> [11] = "2021-08-28 09:44:32,000"[12] = "2021-08-28 09:44:33,000"[13] =
> "2021-08-28 09:44:34,000"[14] = "2021-08-28 09:44:35,000"[15] = "2021-08-28
> 09:44:36,000"...[31999] = "..."
> 50ms precision:
> [10] = "2021-08-28 09:44:31,050"[11] = "2021-08-28 09:44:31,075"[12] =
> "2021-08-28 09:44:31,100"[13] = "2021-08-28 09:44:31,150"[14] = "2021-08-28
> 09:44:31,175"[15] = "2021-08-28 09:44:31,200"...[31999] = "..."
>
> 3)Rendering %d{SEQ(DEFAULT,32000)} is just a index lookup into the
> sequence of 32000 pre-calculated %d{DEFAULT} values without the cost of
> formatting. I made up the "SEQ" notation, there's likely a better way to
> express the feature. Everything can read from the buffer without locking.
> 4)Have a background thread casually keep the sequence filled in a ring so
> dates in the past are replaced with future dates so the structure consumes
> a consistent amount of memory.
>     On Friday, August 27, 2021, 10:07:59 PM EDT, Carter Kozak <
> cko...@ckozak.net> wrote:
>
>  Thanks, Remko. The default '%d' uses FixedDateFormat with
> FixedFormat.DEFAULT. The FastDateFormat alternative does not support
> microseconds, so it doesn't suffer from the same problem. I think I can
> substantially reduce the frequency we re-format dates by checking
> FixedFormat.secondFractionDigits to determine if we meed to compare
> microseconds.
>
> On Fri, Aug 27, 2021, at 16:10, Remko Popma wrote:
> > I remember looking at PatternLayout performance, I reported my findings
> here, hopefully they’re still useful:
> https://issues.apache.org/jira/browse/LOG4J2-930
> >
> > If %d is used in the pattern, does the FixedDateFormat get used?
> >
> >
> >
> >
> > > On Aug 28, 2021, at 4:33, Ralph Goers <ralph.go...@dslextreme.com>
> wrote:
> > >
> > > All of that agrees with my observations as well.
> > >
> > > Ralph
> > >
> > >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <cko...@ckozak.net> wrote:
> > >>
> > >> I've identified a few things that seem impactful, but bear in mind
> that I haven't begun to drill down into them yet. I plan to file individual
> tickets and investigate in greater depth later on:
> > >>
> > >> 1. Formatting the date is expensive. The process of actually
> formatting a value is reasonable, however using a precise clock appears to
> cause cache misses even when the pattern results in the same value
> (microseconds/nanoseconds aren't included in the result). Using the
> SystemMillisClock improves our throughput. I imagine some part of that
> improvement is the result of currentTimeMillis(): long rather than
> Clock.instant().
> > >>
> > >> 2. Setting 'log4j2.enable.direct.encoders' to false improves
> performance, otherwise we see a lot of time spent in the
> StringBuilderEncoder, but I haven't had time to investigate that yet.
> Without direct encoders, we are allocating a great deal more byte arrays,
> but they're used across a small space that C2 may be able to inline out.
> > >>
> > >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern
> '%m{nolookup}' saves us some time scanning through produced messages --
> I've had this disabled at work for a while.
> > >>
> > >> 4. [in progress] If I implement the full layout directly in java
> (write directly to a stringbuilder based on the event in a custom Layout
> instead of using PatternLayout) combined with [1] and [2] above,
> performance is much better than logback, however using PatternLayout with
> the steps above we fall just a bit short. I still need to read through the
> patternlayout and individual pattern converters to understand why, but
> there's definitely some headroom we'll be able to reclaim hiding somewhere.
> > >>
> > >> -ck
> > >>
> > >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote:
> > >>> My pleasure, I enjoy digging into this sort of performance
> comparison, the
> > >>> toughest piece right now is balancing investigation with active
> projects at
> > >>> work. I completely agree about the importance of getting this
> resolved
> > >>> quickly and I intend to continue investigating.
> > >>>
> > >>> Re loom:
> > >>> Currently we have a few places which reuse shared objects,
> potentially large
> > >>> ones (thinking StringBuilder instances on
> ReusableLogEvent/StringLayout/etc)
> > >>> which are currently bounded to the number of application threads you
> have
> > >>> (which is relatively small compared to the loom claims which are
> admittedly
> > >>> large). More importantly, today we tend to reuse almost every thread
> in a
> > >>> ThreadPoolExecutor of some kind where the model in loom involves
> creating
> > >>> a new virtual thread for each task, meaning using a thread-local as
> a cache
> > >>> will have a much lower hit rate as reuse will only occur within a
> single task,
> > >>> and only if ThreadLocals are enabled. I hope that made sense sense.
> > >>>
> > >>> -ck
> > >>>
> > >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
> > >>>> Carter, thanks so much for your great effort and attention on this
> issue. *I
> > >>>> personally find this the uttermost important thing the project
> needs to
> > >>>> address with urgency right now.* Ceki is warming up for a new
> Logback (and
> > >>>> SLF4J?) release. I am pretty sure he will do his part in PR,
> particularly
> > >>>> in the context of benchmarking his new work with competitors. Once
> these
> > >>>> results are out – independent of whether they make sense or not –
> > >>>> "internet" will take this as the ultimate truth and it will require
> a
> > >>>> tremendous effort to change people's mind. In conclusion, please
> prioritize
> > >>>> this appropriately and keep us posted. Don't hesitate to let us/me
> know if
> > >>>> there is anything (drinks, coffee, lunch?) we can help with.
> > >>>>
> > >>>> Regarding your remarks about Loom, I don't think I follow you
> there. Our
> > >>>> thread-locals are stateless hence should be perfectly fine while
> running in
> > >>>> virtual threads too. (I am totally against blindly using TLs, but
> that is
> > >>>> another discussion. I would rather prefer a recycler concept
> similar to the
> > >>>> one we have in JsonTemplateLayout. I think there was already a
> mailing list
> > >>>> thread about this.)
> > >>>>
> > >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <cko...@ckozak.net>
> wrote:
> > >>>>
> > >>>>> Yes, I'm still looking into this.
> > >>>>>
> > >>>>> I agree that the zero-garbage code is difficult to follow, and Loom
> > >>>>> virtual threads will have less reuse so our thread-locals will
> create more
> > >>>>> overhead than they're worth in many cases. Fingers crossed for
> Valhalla to
> > >>>>> land before Loom, but I haven't been following updates from either
> project
> > >>>>> very closely lately, and Loom seems farther along.
> > >>>>>
> > >>>>> I think the issue with multiple buffers these days is that we have
> optane
> > >>>>> and nvme devices which can be _nearly_ as fast as main memory, so
> copying
> > >>>>> contents can be problematic. In fully async mode (with some
> caveats around
> > >>>>> queue-full-policy configuration points and nested async appenders)
> we don't
> > >>>>> need locking because the appender interactions are single-threaded.
> > >>>>>
> > >>>>> For what it's worth, in my testing I didn't see much difference
> between
> > >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively
> fast nvme
> > >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect
> default
> > >>>>> buffer size in the RandomAccessFileAppender).
> > >>>>>
> > >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
> > >>>>>> While that is true it does mean that we are locking higher up in
> the
> > >>>>> call stack than we need to be.
> > >>>>>>
> > >>>>>> Ralph
> > >>>>>>
> > >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <tim.v...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>> I’m fairly certain the JIT will optimize out the locking
> operations on
> > >>>>> the nested synchronized calls after a while. I’m not sure how soon
> into the
> > >>>>> performance tests that would happen.
> > >>>>>>>
> > >>>>>>> Tim
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers <
> ralph.go...@dslextreme.com>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> So are you continuing to look at this?  Frankly, the work on
> zero-gc
> > >>>>> stuff made this
> > >>>>>>>> very complicated. I am not sure I can even follow the logic
> anymore.
> > >>>>> I also noticed
> > >>>>>>>> that many of the methods dealing with byte buffers are
> synchronized.
> > >>>>> I am not sure
> > >>>>>>>> why as only the method that moves data really needs to be.
> > >>>>>>>>
> > >>>>>>>> Using multiple buffers would make sense if you are filling one
> from
> > >>>>> the various patterns
> > >>>>>>>> while it is unlocked and then only blocking when writing to the
> > >>>>> output stream buffer.
> > >>>>>>>>
> > >>>>>>>> For what its worth Ceki updated his performance page again
> yesterday
> > >>>>> to change
> > >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2
> > >>>>> configuration to use a
> > >>>>>>>> 256K buffer to match Logback.  And the page still makes it look
> like
> > >>>>> he is testing
> > >>>>>>>> the performance of asynchronous processing when it is really
> testing
> > >>>>> the performance
> > >>>>>>>> of the file appenders with a blocking queue in front of them.
> > >>>>>>>>
> > >>>>>>>> Ralph
> > >>>>>>>>
> > >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <cko...@ckozak.net>
> wrote:
> > >>>>>>>>>
> > >>>>>>>>> I also tried that with similar results, which leads me to
> believe
> > >>>>> we're spending
> > >>>>>>>>> too much time copying between buffers.
> > >>>>>>>>>
> > >>>>>>>>> We've proven that log4j can get LogEvents to the appender very
> > >>>>> quickly and
> > >>>>>>>>> efficiently.
> > >>>>>>>>>
> > >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder.
> > >>>>> AbstractStringBuilder used to store a char array in java 8,
> however with
> > >>>>> jep 254 compact strings it was updated (similarly to String) to
> use a byte
> > >>>>> array + byte representing the encoding.
> > >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination
> > >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
> > >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would
> have
> > >>>>> been a direct copy from char[] to char[] in java8, but requires
> conversion
> > >>>>> from byte[]+coder now)
> > >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder
> > >>>>> ByteBuffer
> > >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
> > >>>>> ByteBufferDestination ByteBuffer (which is also always a heap
> buffer as far
> > >>>>> as I can tell)
> > >>>>>>>>> 4. Write the ByteBufferDestination to the
> > >>>>> FileOutputStream/RandomAccessFile by extracting the heap
> byte-array.
> > >>>>>>>>>
> > >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would
> have
> > >>>>> optimized out a bit of the copying because the types exactly
> matched.
> > >>>>>>>>> It's unclear why we need to copy through two separate
> ByteBuffers
> > >>>>> from StringBuilderEncoder into ByteBufferDestination
> > >>>>>>>>> I've generally had better I/O performance using direct
> byte-buffers
> > >>>>> than heap buffers, although that experience largely comes from
> work on
> > >>>>> webservers, I haven't tested if it holds true for file I/O. If we
> operate
> > >>>>> directly upon a FileChannel, we can write direct buffers directly
> to files.
> > >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it
> should
> > >>>>> have better performance around copying and manipulation as well.
> > >>>>>>>>>
> > >>>>>>>>> -ck
> > >>>>>>>>>
> > >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
> > >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to
> /dev/null
> > >>>>> to avoid
> > >>>>>>>>>> hardware's influence in the test results?
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
> > >>>>> ralph.go...@dslextreme.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you
> using
> > >>>>> our
> > >>>>>>>>>>> NullAppender? I have
> > >>>>>>>>>>> my doubts about using that NullAppender as it does nothing -
> not
> > >>>>> even
> > >>>>>>>>>>> render the Layout, so
> > >>>>>>>>>>> it may get completely optimized away.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I’d still like to know what kind of disks Remko did his
> original
> > >>>>> testing.
> > >>>>>>>>>>> The page says he got
> > >>>>>>>>>>> 18.495 million messages per second. Each message would be
> about 130
> > >>>>>>>>>>> characters long
> > >>>>>>>>>>> from I can tell. That means the destination has to be able to
> > >>>>> support
> > >>>>>>>>>>> about at least 2.4 GB
> > >>>>>>>>>>> per second.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs
> 100 in
> > >>>>> Remko’s).
> > >>>>>>>>>>> That means the
> > >>>>>>>>>>> message size is only going to be about 60 characters long.
> For me
> > >>>>> the
> > >>>>>>>>>>> async test is writing
> > >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
> > >>>>> seeing the
> > >>>>>>>>>>> queue backed up.
> > >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback
> is
> > >>>>> getting
> > >>>>>>>>>>> about 2000 ops/ms,
> > >>>>>>>>>>> which is still only 120 MBs.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Your test below would be generating about 197 MBs.
> > >>>>>>>>>>>
> > >>>>>>>>>>> One thing I did notice that made a big difference is that
> Ceki has
> > >>>>> Logback
> > >>>>>>>>>>> configured to
> > >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of
> 8K.
> > >>>>> Setting
> > >>>>>>>>>>> Log4j2 to 256K
> > >>>>>>>>>>> considerably improved the performance.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ralph
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <
> cko...@ckozak.net>
> > >>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> If we would move the appender performance aside, am I
> right to
> > >>>>> conclude
> > >>>>>>>>>>>>> that the entire complex async. framework built atop
> Disruptor
> > >>>>> with all
> > >>>>>>>>>>> its
> > >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance
> gain
> > >>>>> compared
> > >>>>>>>>>>> to a
> > >>>>>>>>>>>>> simple off the shelf blocking queue?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I haven't seen any data that would give me such an
> impression --
> > >>>>> the
> > >>>>>>>>>>> file appender
> > >>>>>>>>>>>> itself is slower than the thread producing events, which
> appears
> > >>>>> to
> > >>>>>>>>>>> limit our throughput.
> > >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux
> > >>>>> workstation
> > >>>>>>>>>>> with 20+ producer
> > >>>>>>>>>>>> threads compared to logback.
> > >>>>>>>>>>>> Any time the queue is constantly full, performance will
> suffer
> > >>>>> (and this
> > >>>>>>>>>>> applies to standard
> > >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor
> will fare
> > >>>>> worse
> > >>>>>>>>>>> when the queue
> > >>>>>>>>>>>> is full).
> > >>>>>>>>>>>> The results using a single thread are roughly reproducible
> on the
> > >>>>>>>>>>> non-async test, so I think
> > >>>>>>>>>>>> the problem is somewhere between the layout, and
> appender/manager.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Results running locally with no-op appender implementations:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1 thread:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Benchmark                                        Mode  Cnt
> > >>>>> Score
> > >>>>>>>>>>> Error  Units
> > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > >>>>> 3285.743 ±
> > >>>>>>>>>>> 427.963  ops/ms
> > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > >>>>> 2383.532 ±
> > >>>>>>>>>>> 136.034  ops/ms
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 20 threads:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Benchmark                                        Mode  Cnt
> > >>>>> Score
> > >>>>>>>>>>> Error  Units
> > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
> > >>>>> 1177.746 ±
> > >>>>>>>>>>> 919.560  ops/ms
> > >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
> > >>>>> 602.614 ±
> > >>>>>>>>>>> 47.485  ops/ms
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> -ck
> > >>>>>
> > >>>>
> > >>>
> > >
> > >
> >
>
> -ck
>

Reply via email to