Hi Jose,

Thanks for the proposal. I think there are three main motivations for
snapshotting over the existing compaction semantics.

First we are arguing that compaction is a poor semantic fit for how we want
to model the metadata in the cluster. We are trying to view the changes in
the cluster as a stream of events, not necessarily as a stream of key/value
updates. The reason this is useful is that a single event may correspond to
a set of key/value updates. We don't need to delete each partition
individually for example if we are deleting the full topic. Outside of
deletion, however, the benefits of this approach are less obvious. I am
wondering if there are other cases where the event-based approach has some
benefit?

The second motivation is from the perspective of consistency. Basically we
don't like the existing solution for the tombstone deletion problem, which
is just to add a delay before removal. The case we are concerned about
requires a replica to fetch up to a specific offset and then stall for a
time which is longer than the deletion retention timeout. If this happens,
then the replica might not see the tombstone, which would lead to an
inconsistent state. I think we are already talking about a rare case, but I
wonder if there are simple ways to tighten it further. For the sake of
argument, what if we had the replica start over from the beginning whenever
there is a replication delay which is longer than tombstone retention time?
Just want to be sure we're not missing any simple/pragmatic solutions
here...

Finally, I think we are arguing that compaction gives a poor performance
tradeoff when the state is already in memory. It requires us to read and
replay all of the changes even though we already know the end result. One
way to think about it is that compaction works O(the rate of changes) while
snapshotting is O(the size of data). Contrarily, the nice thing about
compaction is that it works irrespective of the size of the data, which
makes it a better fit for user partitions. I feel like this might be an
argument we can make empirically or at least with back-of-the-napkin
calculations. If we assume a fixed size of data and a certain rate of
change, then what are the respective costs of snapshotting vs compaction? I
think compaction fares worse as the rate of change increases. In the case
of __consumer_offsets, which sometimes has to support a very high rate of
offset commits, I think snapshotting would be a great tradeoff to reduce
load time on coordinator failover. The rate of change for metadata on the
other hand might not be as high, though it can be very bursty.

Thanks,
Jason


On Wed, Jul 29, 2020 at 2:03 PM Jose Garcia Sancio <jsan...@confluent.io>
wrote:

> Thanks Ron for the additional comments and suggestions.
>
> Here are the changes to the KIP:
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=17&selectedPageVersions=15
>
> On Wed, Jul 29, 2020 at 8:44 AM Ron Dagostino <rndg...@gmail.com> wrote:
> >
> > Thanks, Jose.  It's looking good.  Here is one minor correction:
> >
> > <<< If the Kafka topic partition leader receives a fetch request with an
> > offset and epoch greater than or equal to the LBO (x + 1, a)
> > >>> If the Kafka topic partition leader receives a fetch request with an
> > offset and epoch greater than or equal to the LBO (x + 1, b)
> >
>
> Done.
>
> > Here is one more question.  Is there an ability to evolve the snapshot
> > format over time, and if so, how is that managed for upgrades? It would
> be
> > both Controllers and Brokers that would depend on the format, correct?
> > Those could be the same thing if the controller was running inside the
> > broker JVM, but that is an option rather than a requirement, I think.
> > Might the Controller upgrade have to be coordinated with the broker
> upgrade
> > in the separate-JVM case?  Perhaps a section discussing this would be
> > appropriate?
> >
>
> The content set though the FetchSnapshot RPC is expected to be
> compatible with future changes. In KIP-631 the Kafka Controller is
> going to use the existing Kafka Message and versioning scheme.
> Specifically see section "Record Format Versions". I added some
> wording around this.
>
> Thanks!
> -Jose
>

Reply via email to