This is a very useful proposal. LGTM

-Lari

On Tue, Jun 7, 2022 at 3:48 AM Matteo Merli <matteo.me...@gmail.com> wrote:

> https://github.com/apache/pulsar/issues/15954
>
> WIP can be seen at: https://github.com/apache/pulsar/pull/15955
>
> -----------
>
>
> ## Motivation
>
> The current implementation of the read cache in the Pulsar broker has
> largely
> remained unchanged for a long time, except for a few minor tweaks.
>
> While the implementation is stable and reasonably efficient for
> typical workloads,
> the overhead required for managing the cache evictions in a broker
> that is running
> many topics can be pretty high in terms of extra CPU utilization and on
> the JVM
> garbage collection to track an increased number of medium-lived objects.
>
> The goal is to provide an alternative implementation that can adapt better
> to
> a wider variety of operating conditions.
>
> ### Current implementation details
>
> The broker cache is implemented as part of the `ManagedLedger` component,
> which sits in the Pulsar broker and provides a higher level of
> abstraction of top
> of BookKeeper.
>
> Each topic (and managed-ledger) has its own private cache space. This
> cache is implemented
> as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> payload`. The payload
> is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we
> got
> when reading from a socket, or it can be a copied buffer.
>
> Each topic cache is allowed to use the full broker max cache size before an
> eviction is triggered. The total cache size is effectively a resource
> shared across all
> the topics, where a topic can use a more prominent portion of it if it
> "asks for more".
>
> When the eviction happens, we need to do an expensive ranking of all
> the caches in the broker
> and do an eviction in a proportional way to the currently used space
> for each of them.
>
> The bigger problem is represented by the `ConcurrentSkipList` and the
> `ByteBuf` objects
> that need to be tracked. The skip list is essentially like a "tree"
> structure and needs to
> maintain Java objects for each entry in the cache. We also need to
> potentially have
> a huge number of ByteBuf objects.
>
> A cache workload is typically the worst-case scenario for each garbage
> collector implementation because it involves creating objects, storing
> them for some amount of
> time and then throwing them away. During that time, the GC would have
> already tenured these
> objects and copy them into an "old generation" space, and sometime
> later, a costly compaction
> of that memory would have to be performed.
>
> To mitigate the effect of the cache workload on the GC, we're being
> very aggressive in
> purging the cache by triggering time-based eviction. By putting a max
> TTL on the elements in
> the cache, we can avoid keeping the objects around for too long to be
> a problem for the GC.
>
> The reverse side of this is that we're artificially reducing the cache
> capacity to a very
> short time frame, reducing the cache usefulness.
>
> The other problem is the CPU cost involved in doing these frequent
> evictions, which can
> be very high when there are 10s of thousands of topics in a broker.
>
>
> ## Proposed changes
>
> Instead of dealing with individual caches for each topic, let's adopt
> a model where
> there is a single cache space for the broker.
>
> This cache is broken into N segments which act as a circular buffer.
> Whenever a segment
> is full, we start writing into the next one, and when we reach the
> last one, we will
> restart recycling the first segment.
>
> Each segment is composed of a buffer, an offset, and a hashmap which maps
> `(ledgerId, entryId) -> offset`.
>
> This model has been working very well for the BookKeeper `ReadCache`:
>
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
>
> There are two main advantages to this approach:
>
>  1. Entries are copied into the cache buffer (in direct memory), and
> we don't need to keep any
>     long-lived Java objects around
>  2. The eviction becomes a completely trivial operation, buffers are
> just rotated and
>     overwritten. We don't need to do any per-topic task or keep track
> of utilization.
>
> ### API changes
>
> No user-facing API changes are required.
>
> ### New configuration options
>
> The existing cache implementation will not be removed at this point. Users
> will
> be able to configure the old implementation in `broker.conf`.
>
> This option will be helpful in case of performance regressions would be
> seen for
> some use cases with the new cache implementation.
>

Reply via email to