Hey Jun,

Thanks much for the explanation.

I understand the advantage of partition_epoch over metadata_epoch. My
current concern is that the use of leader_epoch and the partition_epoch
requires us considerable change to consumer's public API to take care of
the case where user stores offset externally. For example, *consumer*.
*commitSync*(..) would have to take a map whose value is <offset, metadata,
leader epoch, partition epoch>. *consumer*.*seek*(...) would also need
leader_epoch and partition_epoch as parameter. Technically we can probably
still make it work in a backward compatible manner after careful design and
discussion. But these changes can make the consumer's interface
unnecessarily complex for more users who do not store offset externally.

After thinking more about it, we can address all problems discussed by only
using the metadata_epoch without introducing leader_epoch or the
partition_epoch. The current KIP describes the changes to the consumer API
and how the new API can be used if user stores offset externally. In order
to address the scenario you described earlier, we can include
metadata_epoch in the FetchResponse and the LeaderAndIsrRequest. Consumer
remembers the largest metadata_epoch from all the FetchResponse it has
received. The metadata_epoch committed with the offset, either within or
outside Kafka, should be the largest metadata_epoch across all
FetchResponse and MetadataResponse ever received by this consumer.

The drawback of using only the metadata_epoch is that we can not always do
the smart offset reset in case of unclean leader election which you
mentioned earlier. But in most case, unclean leader election probably
happens when consumer is not rebalancing/restarting. In these cases, either
consumer is not directly affected by unclean leader election since it is
not consuming from the end of the log, or consumer can derive the
leader_epoch from the most recent message received before it sees
OffsetOutOfRangeException. So I am not sure it is worth adding the
leader_epoch to consumer API to address the remaining corner case. What do
you think?

Thanks,
Dong



On Tue, Jan 2, 2018 at 6:28 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Dong,
>
> Thanks for the reply.
>
> To solve the topic recreation issue, we could use either a global metadata
> version or a partition level epoch. But either one will be a new concept,
> right? To me, the latter seems more natural. It also makes it easier to
> detect if a consumer's offset is still valid after a topic is recreated. As
> you pointed out, we don't need to store the partition epoch in the message.
> The following is what I am thinking. When a partition is created, we can
> assign a partition epoch from an ever-increasing global counter and store
> it in /brokers/topics/[topic]/partitions/[partitionId] in ZK. The
> partition
> epoch is propagated to every broker. The consumer will be tracking a tuple
> of <offset, leader epoch, partition epoch> for offsets. If a topic is
> recreated, it's possible that a consumer's offset and leader epoch still
> match that in the broker, but partition epoch won't be. In this case, we
> can potentially still treat the consumer's offset as out of range and reset
> the offset based on the offset reset policy in the consumer. This seems
> harder to do with a global metadata version.
>
> Jun
>
>
>
> On Mon, Dec 25, 2017 at 6:56 AM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > This is a very good example. After thinking through this in detail, I
> agree
> > that we need to commit offset with leader epoch in order to address this
> > example.
> >
> > I think the remaining question is how to address the scenario that the
> > topic is deleted and re-created. One possible solution is to commit
> offset
> > with both the leader epoch and the metadata version. The logic and the
> > implementation of this solution does not require a new concept (e.g.
> > partition epoch) and it does not require any change to the message format
> > or leader epoch. It also allows us to order the metadata in a
> > straightforward manner which may be useful in the future. So it may be a
> > better solution than generating a random partition epoch every time we
> > create a partition. Does this sound reasonable?
> >
> > Previously one concern with using the metadata version is that consumer
> > will be forced to refresh metadata even if metadata version is increased
> > due to topics that the consumer is not interested in. Now I realized that
> > this is probably not a problem. Currently client will refresh metadata
> > either due to InvalidMetadataException in the response from broker or due
> > to metadata expiry. The addition of the metadata version should increase
> > the overhead of metadata refresh caused by InvalidMetadataException. If
> > client refresh metadata due to expiry and it receives a metadata whose
> > version is lower than the current metadata version, we can reject the
> > metadata but still reset the metadata age, which essentially keep the
> > existing behavior in the client.
> >
> > Thanks much,
> > Dong
> >
>

Reply via email to