Re: [DISCUSS] PIP-174: Provide new implementation for broker dispatch cache

Haiting Jiang Sun, 12 Jun 2022 20:25:16 -0700

> Instead of dealing with individual caches for each topic, let's adopt
> a model where
> there is a single cache space for the broker.


If all topics share the same cache space, and we won't recycle space by topic.
So when some topics are rebalanced to other brokers, the previous broker keeps 
its cache 
and results in cache space wasting. 
The topic shedding would be quite frequent under some cases, I am not sure how 
badly it 
will go, maybe we can add some metrics like valid cache space rate for 
monitoring ?

Thanks,
Haiting

On 2022/06/07 00:48:20 Matteo Merli wrote:
> https://github.com/apache/pulsar/issues/15954
> 
> WIP can be seen at: https://github.com/apache/pulsar/pull/15955
> 
> -----------
> 
> 
> ## Motivation
> 
> The current implementation of the read cache in the Pulsar broker has largely
> remained unchanged for a long time, except for a few minor tweaks.
> 
> While the implementation is stable and reasonably efficient for
> typical workloads,
> the overhead required for managing the cache evictions in a broker
> that is running
> many topics can be pretty high in terms of extra CPU utilization and on the 
> JVM
> garbage collection to track an increased number of medium-lived objects.
> 
> The goal is to provide an alternative implementation that can adapt better to
> a wider variety of operating conditions.
> 
> ### Current implementation details
> 
> The broker cache is implemented as part of the `ManagedLedger` component,
> which sits in the Pulsar broker and provides a higher level of
> abstraction of top
> of BookKeeper.
> 
> Each topic (and managed-ledger) has its own private cache space. This
> cache is implemented
> as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> payload`. The payload
> is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we got
> when reading from a socket, or it can be a copied buffer.
> 
> Each topic cache is allowed to use the full broker max cache size before an
> eviction is triggered. The total cache size is effectively a resource
> shared across all
> the topics, where a topic can use a more prominent portion of it if it
> "asks for more".
> 
> When the eviction happens, we need to do an expensive ranking of all
> the caches in the broker
> and do an eviction in a proportional way to the currently used space
> for each of them.
> 
> The bigger problem is represented by the `ConcurrentSkipList` and the
> `ByteBuf` objects
> that need to be tracked. The skip list is essentially like a "tree"
> structure and needs to
> maintain Java objects for each entry in the cache. We also need to
> potentially have
> a huge number of ByteBuf objects.
> 
> A cache workload is typically the worst-case scenario for each garbage
> collector implementation because it involves creating objects, storing
> them for some amount of
> time and then throwing them away. During that time, the GC would have
> already tenured these
> objects and copy them into an "old generation" space, and sometime
> later, a costly compaction
> of that memory would have to be performed.
> 
> To mitigate the effect of the cache workload on the GC, we're being
> very aggressive in
> purging the cache by triggering time-based eviction. By putting a max
> TTL on the elements in
> the cache, we can avoid keeping the objects around for too long to be
> a problem for the GC.
> 
> The reverse side of this is that we're artificially reducing the cache
> capacity to a very
> short time frame, reducing the cache usefulness.
> 
> The other problem is the CPU cost involved in doing these frequent
> evictions, which can
> be very high when there are 10s of thousands of topics in a broker.
> 
> 
> ## Proposed changes
> 
> Instead of dealing with individual caches for each topic, let's adopt
> a model where
> there is a single cache space for the broker.
> 
> This cache is broken into N segments which act as a circular buffer.
> Whenever a segment
> is full, we start writing into the next one, and when we reach the
> last one, we will
> restart recycling the first segment.
> 
> Each segment is composed of a buffer, an offset, and a hashmap which maps
> `(ledgerId, entryId) -> offset`.
> 
> This model has been working very well for the BookKeeper `ReadCache`:
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
> 
> There are two main advantages to this approach:
> 
>  1. Entries are copied into the cache buffer (in direct memory), and
> we don't need to keep any
>     long-lived Java objects around
>  2. The eviction becomes a completely trivial operation, buffers are
> just rotated and
>     overwritten. We don't need to do any per-topic task or keep track
> of utilization.
> 
> ### API changes
> 
> No user-facing API changes are required.
> 
> ### New configuration options
> 
> The existing cache implementation will not be removed at this point. Users 
> will
> be able to configure the old implementation in `broker.conf`.
> 
> This option will be helpful in case of performance regressions would be seen 
> for
> some use cases with the new cache implementation.
>

Re: [DISCUSS] PIP-174: Provide new implementation for broker dispatch cache

Reply via email to