Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Jun Rao Wed, 19 Aug 2020 13:25:20 -0700

Hi, Satish, Ying, Harsha,

Do you think it would be useful to have a regular virtual meeting to
discuss this KIP? The goal of the meeting will be sharing
design/development progress and discussing any open issues to
accelerate this KIP. If so, will every Tuesday (from next week) 9am-10am PT
work for you? I can help set up a Zoom meeting, invite everyone who might
be interested, have it recorded and shared, etc.


Thanks,

Jun

On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <satish.dugg...@gmail.com>
wrote:

> Hi  Kowshik,
>
> Thanks for looking into the  KIP and sending your comments.
>
> 5001. Under the section "Follower fetch protocol in detail", the
> next-local-offset is the offset upto which the segments are copied to
> remote storage. Instead, would last-tiered-offset be a better name than
> next-local-offset? last-tiered-offset seems to naturally align well with
> the definition provided in the KIP.
>
> Both next-local-offset and local-log-start-offset were introduced to
> talk about offsets related to local log. We are fine with
> last-tiered-offset too as you suggested.
>
> 5002. After leadership is established for a partition, the leader would
> begin uploading a segment to remote storage. If successful, the leader
> would write the updated RemoteLogSegmentMetadata to the metadata topic (via
> RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> useful that before the first time the segment is uploaded by the leader for
> a partition, the leader should ensure to catch up to all the metadata
> events written so far in the metadata topic for that partition (ex: by
> previous leader). To achieve this, the leader could start a lease (using an
> establish_leader metadata event) before commencing tiering, and wait until
> the event is read back. For example, this seems useful to avoid cases where
> zombie leaders can be active for the same partition. This can also prove
> useful to help avoid making decisions on which segments to be uploaded for
> a partition, until the current leader has caught up to a complete view of
> all segments uploaded for the partition so far (otherwise this may cause
> same segment being uploaded twice -- once by the previous leader and then
> by the new leader).
>
> We allow copying segments to remote storage which may have common
> offsets. Please go through the KIP to understand the follower fetch
> protocol(1) and follower to leader transition(2).
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
>
>
> 5003. There is a natural interleaving between uploading a segment to remote
> store, and, writing a metadata event for the same (via
> RLMM.putRemoteLogSegmentData). There can be cases where a remote segment is
> uploaded, then the leader fails and a corresponding metadata event never
> gets written. In such cases, the orphaned remote segment has to be
> eventually deleted (since there is no confirmation of the upload). To
> handle this, we could use 2 separate metadata events viz. copy_initiated
> and copy_completed, so that copy_initiated events that don't have a
> corresponding copy_completed event can be treated as garbage and deleted
> from the remote object store by the broker.
>
> We are already updating RMM with RemoteLogSegmentMetadata pre and post
> copying of log segments. We had a flag in RemoteLogSegmentMetadata
> whether it is copied or not. But we are making changes in
> RemoteLogSegmentMetadata to introduce a state field in
> RemoteLogSegmentMetadata which will have the respective started and
> finished states. This includes for other operations like delete too.
>
> 5004. In the default implementation of RLMM (using the internal topic
> __remote_log_metadata), a separate topic called
> __remote_segments_to_be_deleted is going to be used just to track failures
> in removing remote log segments. A separate topic (effectively another
> metadata stream) introduces some maintenance overhead and design
> complexity. It seems to me that the same can be achieved just by using just
> the __remote_log_metadata topic with the following steps: 1) the leader
> writes a delete_initiated metadata event, 2) the leader deletes the segment
> and 3) the leader writes a delete_completed metadata event. Tiered segments
> that have delete_initiated message and not delete_completed message, can be
> considered to be a failure and retried.
>
> Jun suggested in earlier mail to keep this simple . We decided not to
> have this topic as mentioned in our earlier replies, updated the KIP.
> As I mentioned in an earlier comment, we are  adding state entries for
> delete operations too.
>
> 5005. When a Kafka cluster is provisioned for the first time with KIP-405
> tiered storage enabled, could you explain in the KIP about how the
> bootstrap for __remote_log_metadata topic will be performed in the the
> default RLMM implementation?
>
> __remote_log_segment_metadata topic is created by default with the
> respective topic like  partitions/replication-factor etc. Can  you be
> more specific on what you are looking for?
>
> 5008. The system-wide configuration 'remote.log.storage.enable' is used to
> enable tiered storage. Can this be made a topic-level configuration, so
> that the user can enable/disable tiered storage at a topic level rather
> than a system-wide default for an entire Kafka cluster?
>
> Yes, we mentioned in an earlier mail thread that it will be supported
> at topic level too, updated  the KIP.
>
> 5009. Whenever a topic with tiered storage enabled is deleted, the
> underlying actions require the topic data to be deleted in local store as
> well as remote store, and eventually the topic metadata needs to be deleted
> too. What is the role of the controller in deleting a topic and it's
> contents, while the topic has tiered storage enabled?
>
> When a topic partition is deleted, there will be an event for that in
> RLMM for its deletion and the controller considers that topic is
> deleted only when all the remote log segments are also deleted.
>
> 5010. RLMM APIs are currently synchronous, for example
> RLMM.putRemoteLogSegmentData waits until the put operation is completed in
> the remote metadata store. It may also block until the leader has caught up
> to the metadata (not sure). Could we make these apis asynchronous (ex:
> based on java.util.concurrent.Future) to provide room for tapping
> performance improvements such as non-blocking i/o?
> 5011. The same question as 5009 on sync vs async api for RSM. Have we
> considered the pros/cons of making the RSM apis asynchronous?
>
> Async methods are used to do other tasks while the result is not
> available. In this case, we need to have the result before proceeding
> to take next actions. These APIs are evolving and these can be updated
> as and when needed instead of having them as asynchronous now.
>
> Thanks,
> Satish.
>
> On Fri, Aug 14, 2020 at 4:30 AM Kowshik Prakasam <kpraka...@confluent.io>
> wrote:
> >
> > Hi Harsha/Satish,
> >
> > Thanks for the great KIP. Below are the first set of
> questions/suggestions
> > I had after making a pass on the KIP.
> >
> > 5001. Under the section "Follower fetch protocol in detail", the
> > next-local-offset is the offset upto which the segments are copied to
> > remote storage. Instead, would last-tiered-offset be a better name than
> > next-local-offset? last-tiered-offset seems to naturally align well with
> > the definition provided in the KIP.
> >
> > 5002. After leadership is established for a partition, the leader would
> > begin uploading a segment to remote storage. If successful, the leader
> > would write the updated RemoteLogSegmentMetadata to the metadata topic
> (via
> > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> > useful that before the first time the segment is uploaded by the leader
> for
> > a partition, the leader should ensure to catch up to all the metadata
> > events written so far in the metadata topic for that partition (ex: by
> > previous leader). To achieve this, the leader could start a lease (using
> an
> > establish_leader metadata event) before commencing tiering, and wait
> until
> > the event is read back. For example, this seems useful to avoid cases
> where
> > zombie leaders can be active for the same partition. This can also prove
> > useful to help avoid making decisions on which segments to be uploaded
> for
> > a partition, until the current leader has caught up to a complete view of
> > all segments uploaded for the partition so far (otherwise this may cause
> > same segment being uploaded twice -- once by the previous leader and then
> > by the new leader).
> >
> > 5003. There is a natural interleaving between uploading a segment to
> remote
> > store, and, writing a metadata event for the same (via
> > RLMM.putRemoteLogSegmentData). There can be cases where a remote segment
> is
> > uploaded, then the leader fails and a corresponding metadata event never
> > gets written. In such cases, the orphaned remote segment has to be
> > eventually deleted (since there is no confirmation of the upload). To
> > handle this, we could use 2 separate metadata events viz. copy_initiated
> > and copy_completed, so that copy_initiated events that don't have a
> > corresponding copy_completed event can be treated as garbage and deleted
> > from the remote object store by the broker.
> >
> > 5004. In the default implementation of RLMM (using the internal topic
> > __remote_log_metadata), a separate topic called
> > __remote_segments_to_be_deleted is going to be used just to track
> failures
> > in removing remote log segments. A separate topic (effectively another
> > metadata stream) introduces some maintenance overhead and design
> > complexity. It seems to me that the same can be achieved just by using
> just
> > the __remote_log_metadata topic with the following steps: 1) the leader
> > writes a delete_initiated metadata event, 2) the leader deletes the
> segment
> > and 3) the leader writes a delete_completed metadata event. Tiered
> segments
> > that have delete_initiated message and not delete_completed message, can
> be
> > considered to be a failure and retried.
> >
> > 5005. When a Kafka cluster is provisioned for the first time with KIP-405
> > tiered storage enabled, could you explain in the KIP about how the
> > bootstrap for __remote_log_metadata topic will be performed in the the
> > default RLMM implementation?
> >
> > 5006. I currently do not see details on the KIP on why RocksDB was chosen
> > as the default cache implementation, and how it is going to be used. Were
> > alternatives compared/considered? For example, it would be useful to
> > explain/evaulate the following: 1) debuggability of the RocksDB JNI
> > interface, 2) performance, 3) portability across platforms and 4)
> interface
> > parity of RocksDB’s JNI api with it's underlying C/C++ api.
> >
> > 5007. For the RocksDB cache (the default implementation of RLMM), what is
> > the relationship/mapping between the following: 1) # of tiered
> partitions,
> > 2) # of partitions of metadata topic __remote_log_metadata and 3) # of
> > RocksDB instances? i.e. is the plan to have a RocksDB instance per tiered
> > partition, or per metadata topic partition, or just 1 for per broker?
> >
> > 5008. The system-wide configuration 'remote.log.storage.enable' is used
> to
> > enable tiered storage. Can this be made a topic-level configuration, so
> > that the user can enable/disable tiered storage at a topic level rather
> > than a system-wide default for an entire Kafka cluster?
> >
> > 5009. Whenever a topic with tiered storage enabled is deleted, the
> > underlying actions require the topic data to be deleted in local store as
> > well as remote store, and eventually the topic metadata needs to be
> deleted
> > too. What is the role of the controller in deleting a topic and it's
> > contents, while the topic has tiered storage enabled?
> >
> > 5010. RLMM APIs are currently synchronous, for example
> > RLMM.putRemoteLogSegmentData waits until the put operation is completed
> in
> > the remote metadata store. It may also block until the leader has caught
> up
> > to the metadata (not sure). Could we make these apis asynchronous (ex:
> > based on java.util.concurrent.Future) to provide room for tapping
> > performance improvements such as non-blocking i/o?
> >
> > 5011. The same question as 5009 on sync vs async api for RSM. Have we
> > considered the pros/cons of making the RSM apis asynchronous?
> >
> >
> > Cheers,
> > Kowshik
> >
> >
> > On Thu, Aug 6, 2020 at 11:02 AM Satish Duggana <satish.dugg...@gmail.com
> >
> > wrote:
> >
> > > Hi Jun,
> > > Thanks for your comments.
> > >
> > > > At the high level, that approach sounds reasonable to
> > > me. It would be useful to document how RLMM handles overlapping
> archived
> > > offset ranges and how those overlapping segments are deleted through
> > > retention.
> > >
> > > Sure, we will document that in the KIP.
> > >
> > > >How is the remaining part of the KIP coming along? To me, the two
> biggest
> > > missing items are (1) more detailed documentation on how all the new
> APIs
> > > are being used and (2) metadata format and usage in the internal
> > > topic __remote_log_metadata.
> > >
> > > We are working on updating APIs based on the recent discussions and
> > > get the perf numbers by plugging in rocksdb as a cache store for RLMM.
> > > We will update the KIP with the updated APIs and with the above
> > > requested details in a few days and let you know.
> > >
> > > Thanks,
> > > Satish.
> > >
> > >
> > >
> > >
> > > On Wed, Aug 5, 2020 at 12:49 AM Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > Hi, Ying, Satish,
> > > >
> > > > Thanks for the reply. At the high level, that approach sounds
> reasonable
> > > to
> > > > me. It would be useful to document how RLMM handles overlapping
> archived
> > > > offset ranges and how those overlapping segments are deleted through
> > > > retention.
> > > >
> > > > How is the remaining part of the KIP coming along? To me, the two
> biggest
> > > > missing items are (1) more detailed documentation on how all the new
> APIs
> > > > are being used and (2) metadata format and usage in the internal
> > > > topic __remote_log_metadata.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Aug 4, 2020 at 8:32 AM Satish Duggana <
> satish.dugg...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > > Thanks for your comment,
> > > > >
> > > > > 1001. Using the new leader as the source of truth may be fine too.
> > > What's
> > > > > not clear to me is when a follower takes over as the new leader,
> from
> > > which
> > > > > offset does it start archiving to the block storage. I assume that
> the
> > > new
> > > > > leader starts from the latest archived ooffset by the previous
> leader,
> > > but
> > > > > it seems that's not the case. It would be useful to document this
> in
> > > the
> > > > > Wiki.
> > > > >
> > > > > When a follower becomes a leader it needs to findout the offset
> from
> > > > > which the segments to be copied to remote storage. This is found by
> > > > > traversing from the the latest leader epoch from leader epoch
> history
> > > > > and find the highest offset of a segment with that epoch copied
> into
> > > > > remote storage by using respective RLMM APIs. If it can not find an
> > > > > entry then it checks for the previous leader epoch till it finds an
> > > > > entry, If there are no entries till the earliest leader epoch in
> > > > > leader epoch cache then it starts copying the segments from the
> > > > > earliest epoch entry’s offset.
> > > > > Added an example in the KIP here[1]. We will update RLMM APIs in
> the
> > > KIP.
> > > > >
> > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > >
> > > > > Satish.
> > > > >
> > > > >
> > > > > On Tue, Aug 4, 2020 at 9:00 PM Satish Duggana <
> > > satish.dugg...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi Ying,
> > > > > > Thanks for your comment.
> > > > > >
> > > > > > 1001. Using the new leader as the source of truth may be fine
> too.
> > > What's
> > > > > > not clear to me is when a follower takes over as the new leader,
> from
> > > > > which
> > > > > > offset does it start archiving to the block storage. I assume
> that
> > > the
> > > > > new
> > > > > > leader starts from the latest archived ooffset by the previous
> > > leader,
> > > > > but
> > > > > > it seems that's not the case. It would be useful to document
> this in
> > > the
> > > > > > Wiki.
> > > > > >
> > > > > > When a follower becomes a leader it needs to findout the offset
> from
> > > > > > which the segments to be copied to remote storage. This is found
> by
> > > > > > traversing from the the latest leader epoch from leader epoch
> history
> > > > > > and find the highest offset of a segment with that epoch copied
> into
> > > > > > remote storage by using respective RLMM APIs. If it can not find
> an
> > > > > > entry then it checks for the previous leader epoch till it finds
> an
> > > > > > entry, If there are no entries till the earliest leader epoch in
> > > > > > leader epoch cache then it starts copying the segments from the
> > > > > > earliest epoch entry’s offset.
> > > > > > Added an example in the KIP here[1]. We will update RLMM APIs in
> the
> > > KIP.
> > > > > >
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > >
> > > > > > Satish.
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 4, 2020 at 10:28 AM Ying Zheng
> <yi...@uber.com.invalid>
> > > > > wrote:
> > > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > Thank you for the comment! The current KIP is not very clear
> about
> > > this
> > > > > > > part.
> > > > > > >
> > > > > > > 1001. The new leader will start archiving from the earliest
> local
> > > > > segment
> > > > > > > that is not fully
> > > > > > > covered by the "valid" remote data. "valid" means the (offset,
> > > leader
> > > > > > > epoch) pair is valid
> > > > > > > based on the leader-epoch history.
> > > > > > >
> > > > > > > There are some edge cases where the same offset range (with the
> > > same
> > > > > leader
> > > > > > > epoch) can
> > > > > > > be copied to the remote storage more than once. But this kind
> of
> > > > > > > duplication shouldn't be a
> > > > > > > problem.
> > > > > > >
> > > > > > > Staish is going to explain the details in the KIP with
> examples.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 31, 2020 at 2:55 PM Jun Rao <j...@confluent.io>
> wrote:
> > > > > > >
> > > > > > > > Hi, Ying,
> > > > > > > >
> > > > > > > > Thanks for the reply.
> > > > > > > >
> > > > > > > > 1001. Using the new leader as the source of truth may be fine
> > > too.
> > > > > What's
> > > > > > > > not clear to me is when a follower takes over as the new
> leader,
> > > > > from which
> > > > > > > > offset does it start archiving to the block storage. I assume
> > > that
> > > > > the new
> > > > > > > > leader starts from the latest archived ooffset by the
> previous
> > > > > leader, but
> > > > > > > > it seems that's not the case. It would be useful to document
> > > this in
> > > > > the
> > > > > > > > wiki.
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Tue, Jul 28, 2020 at 12:11 PM Ying Zheng
> > > <yi...@uber.com.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > 1001.
> > > > > > > > >
> > > > > > > > > We did consider this approach. The concerns are
> > > > > > > > > 1)  This makes unclean-leader-election rely on remote
> storage.
> > > In
> > > > > case
> > > > > > > > the
> > > > > > > > > remote storage
> > > > > > > > >  is unavailable, Kafka will not be able to finish the
> > > > > > > > > unclean-leader-election.
> > > > > > > > > 2) Since the user set local retention time (or local
> retention
> > > > > bytes), I
> > > > > > > > > think we are expected to
> > > > > > > > > keep that much local data when possible (avoid truncating
> all
> > > the
> > > > > local
> > > > > > > > > data). But, as you said,
> > > > > > > > > unclean leader elections are very rare, this may not be a
> big
> > > > > problem.
> > > > > > > > >
> > > > > > > > > The current design uses the leader broker as
> source-of-truth.
> > > This
> > > > > is
> > > > > > > > > consistent with the
> > > > > > > > > existing Kafka behavior.
> > > > > > > > >
> > > > > > > > > By using remote storage as the source-of-truth, the
> follower
> > > logic
> > > > > can
> > > > > > > > be a
> > > > > > > > > little simpler,
> > > > > > > > > but the leader logic is going to be more complex. Overall,
> I
> > > don't
> > > > > see
> > > > > > > > > there many benefits
> > > > > > > > > of using remote storage as the source-of-truth.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jul 28, 2020 at 10:25 AM Jun Rao <j...@confluent.io
> >
> > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Satish,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply.
> > > > > > > > > >
> > > > > > > > > > 1001. In your example, I was thinking that you could just
> > > > > download the
> > > > > > > > > > latest leader epoch from the object store. After that you
> > > know
> > > > > the
> > > > > > > > leader
> > > > > > > > > > should end with offset 1100. The leader will delete all
> its
> > > > > local data
> > > > > > > > > > before offset 1000 and start accepting new messages at
> offset
> > > > > 1100.
> > > > > > > > > > Consumer requests for messages before offset 1100 will be
> > > served
> > > > > from
> > > > > > > > the
> > > > > > > > > > object store. The benefit with this approach is that it's
> > > > > simpler to
> > > > > > > > > reason
> > > > > > > > > > about who is the source of truth. The downside is
> slightly
> > > > > increased
> > > > > > > > > > unavailability window during unclean leader election.
> Since
> > > > > unclean
> > > > > > > > > leader
> > > > > > > > > > elections are rare, I am not sure if this is a big
> concern.
> > > > > > > > > >
> > > > > > > > > > 1008. Yes, I think introducing sth like
> local.retention.ms
> > > > > seems more
> > > > > > > > > > consistent.
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 28, 2020 at 2:30 AM Satish Duggana <
> > > > > > > > satish.dugg...@gmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > HI Jun,
> > > > > > > > > > > Thanks for your comments. We put our inline replies
> below.
> > > > > > > > > > >
> > > > > > > > > > > 1001. I was thinking that you could just use the tiered
> > > > > metadata to
> > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > > reconciliation. The tiered metadata contains offset
> ranges
> > > and
> > > > > epoch
> > > > > > > > > > > history. Those should be enough for reconciliation
> > > purposes.
> > > > > > > > > > >
> > > > > > > > > > > If we use remote storage as the source-of-truth during
> > > > > > > > > > > unclean-leader-election, it's possible that after
> > > > > reconciliation the
> > > > > > > > > > > remote storage will have more recent data than the new
> > > > > leader's local
> > > > > > > > > > > storage. For example, the new leader's latest message
> is
> > > > > offset 1000,
> > > > > > > > > > > while the remote storage has message 1100. In such a
> case,
> > > the
> > > > > new
> > > > > > > > > > > leader will have to download the messages from 1001 to
> > > 1100,
> > > > > before
> > > > > > > > > > > accepting new messages from producers. Otherwise, there
> > > would
> > > > > be a
> > > > > > > > gap
> > > > > > > > > > > in the local data between 1000 and 1101.
> > > > > > > > > > >
> > > > > > > > > > > Moreover, with the current design, leader epoch
> history is
> > > > > stored in
> > > > > > > > > > > remote storage, rather than the metadata topic. We did
> > > consider
> > > > > > > > saving
> > > > > > > > > > > epoch history in remote segment metadata. But the
> concern
> > > is
> > > > > that
> > > > > > > > > > > there is currently no limit for the epoch history size.
> > > > > > > > Theoretically,
> > > > > > > > > > > if a user has a very long remote retention time and
> there
> > > are
> > > > > very
> > > > > > > > > > > frequent leadership changes, the leader epoch history
> can
> > > > > become too
> > > > > > > > > > > long to fit into a regular Kafka message.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 1003.3 Having just a serverEndpoint string is probably
> not
> > > > > enough.
> > > > > > > > > > > Connecting to a Kafka cluster may need various security
> > > > > credentials.
> > > > > > > > We
> > > > > > > > > > can
> > > > > > > > > > > make RLMM configurable and pass in the properties
> through
> > > the
> > > > > > > > > configure()
> > > > > > > > > > > method. Ditto for RSM.
> > > > > > > > > > >
> > > > > > > > > > > RLMM and  RSM are already configurable and they take
> > > > > properties which
> > > > > > > > > > > start with "remote.log.metadata." and
> "remote.log.storage."
> > > > > > > > > > > respectively and a few others. We have listener-name
> as the
> > > > > config
> > > > > > > > for
> > > > > > > > > > > RLMM and other properties(like security) can be sent
> as you
> > > > > > > > suggested.
> > > > > > > > > > > We will update the KIP with the details.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 1008.1 We started with log.retention.hours and
> > > > > log.retention.minutes,
> > > > > > > > > and
> > > > > > > > > > > added log.retention.ms later. If we are adding a new
> > > > > configuration,
> > > > > > > > ms
> > > > > > > > > > > level config alone is enough and is simpler. We can
> build
> > > > > tools to
> > > > > > > > make
> > > > > > > > > > the
> > > > > > > > > > > configuration at different granularities easier. The
> > > > > definition of
> > > > > > > > > > > log.retention.ms is "The number of milliseconds to
> keep a
> > > log
> > > > > file
> > > > > > > > > > before
> > > > > > > > > > > deleting it". The deletion is independent of whether
> > > tiering is
> > > > > > > > enabled
> > > > > > > > > > or
> > > > > > > > > > > not. If this changes to just the local portion of the
> > > data, we
> > > > > are
> > > > > > > > > > changing
> > > > > > > > > > > the meaning of an existing configuration.
> > > > > > > > > > >
> > > > > > > > > > > We are fine with either way. We can go with
> > > log.retention.xxxx
> > > > > as the
> > > > > > > > > > > effective log retention instead of local log retention.
> > > With
> > > > > this
> > > > > > > > > > > convention, we need to introduce  local.log.retention
> > > instead
> > > > > of
> > > > > > > > > > > remote.log.retention.ms that we proposed. If
> > > log.retention.ms
> > > > > as -1
> > > > > > > > > > > then remote retention is also considered as unlimited
> but
> > > user
> > > > > should
> > > > > > > > > > > be able to set the local.retention.ms.
> > > > > > > > > > > So, we need to introduce local.log.retention.ms and
> > > > > > > > > > > local.log.retention.bytes which should  always  be <=
> > > > > > > > > > > log.retention.ms/bytes respectively.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jul 24, 2020 at 3:37 AM Jun Rao <
> j...@confluent.io>
> > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi, Satish,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the reply. A few quick comments below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1001. I was thinking that you could just use the
> tiered
> > > > > metadata to
> > > > > > > > > do
> > > > > > > > > > > the
> > > > > > > > > > > > reconciliation. The tiered metadata contains offset
> > > ranges
> > > > > and
> > > > > > > > epoch
> > > > > > > > > > > > history. Those should be enough for reconciliation
> > > purposes.
> > > > > > > > > > > >
> > > > > > > > > > > > 1003.3 Having just a serverEndpoint string is
> probably
> > > not
> > > > > enough.
> > > > > > > > > > > > Connecting to a Kafka cluster may need various
> security
> > > > > > > > credentials.
> > > > > > > > > We
> > > > > > > > > > > can
> > > > > > > > > > > > make RLMM configurable and pass in the properties
> > > through the
> > > > > > > > > > configure()
> > > > > > > > > > > > method. Ditto for RSM.
> > > > > > > > > > > >
> > > > > > > > > > > > 1008.1 We started with log.retention.hours and
> > > > > > > > log.retention.minutes,
> > > > > > > > > > and
> > > > > > > > > > > > added log.retention.ms later. If we are adding a new
> > > > > > > > configuration,
> > > > > > > > > ms
> > > > > > > > > > > > level config alone is enough and is simpler. We can
> build
> > > > > tools to
> > > > > > > > > make
> > > > > > > > > > > the
> > > > > > > > > > > > configuration at different granularities easier. The
> > > > > definition of
> > > > > > > > > > > > log.retention.ms is "The number of milliseconds to
> keep
> > > a
> > > > > log file
> > > > > > > > > > > before
> > > > > > > > > > > > deleting it". The deletion is independent of whether
> > > tiering
> > > > > is
> > > > > > > > > enabled
> > > > > > > > > > > or
> > > > > > > > > > > > not. If this changes to just the local portion of the
> > > data,
> > > > > we are
> > > > > > > > > > > changing
> > > > > > > > > > > > the meaning of an existing configuration.
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jul 23, 2020 at 11:04 AM Satish Duggana <
> > > > > > > > > > > satish.dugg...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for the comments! Ying, Harsha and I
> > > discussed
> > > > > and put
> > > > > > > > > our
> > > > > > > > > > > > > comments below.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1001. The KIP described a few scenarios of unclean
> > > leader
> > > > > > > > > elections.
> > > > > > > > > > > This
> > > > > > > > > > > > > is very useful, but I am wondering if this is the
> best
> > > > > approach.
> > > > > > > > My
> > > > > > > > > > > > > understanding of the proposed approach is to allow
> the
> > > new
> > > > > > > > > (unclean)
> > > > > > > > > > > leader
> > > > > > > > > > > > > to take new messages immediately. While this
> increases
> > > > > > > > > availability,
> > > > > > > > > > it
> > > > > > > > > > > > > creates the problem that there could be multiple
> > > > > conflicting
> > > > > > > > > segments
> > > > > > > > > > > in
> > > > > > > > > > > > > the remote store for the same offset range. This
> seems
> > > to
> > > > > make it
> > > > > > > > > > > harder
> > > > > > > > > > > > > for RLMM to determine which archived log segments
> > > contain
> > > > > the
> > > > > > > > > correct
> > > > > > > > > > > data.
> > > > > > > > > > > > > For example, an archived log segment could at one
> time
> > > be
> > > > > the
> > > > > > > > > correct
> > > > > > > > > > > data,
> > > > > > > > > > > > > but be changed to incorrect data after an unclean
> > > leader
> > > > > > > > election.
> > > > > > > > > An
> > > > > > > > > > > > > alternative approach is to let the unclean leader
> use
> > > the
> > > > > > > > archived
> > > > > > > > > > > data as
> > > > > > > > > > > > > the source of truth. So, when the new (unclean)
> leader
> > > > > takes
> > > > > > > > over,
> > > > > > > > > it
> > > > > > > > > > > first
> > > > > > > > > > > > > reconciles the local data based on the archived
> data
> > > before
> > > > > > > > taking
> > > > > > > > > > new
> > > > > > > > > > > > > messages. This makes the job of RLMM a bit easier
> > > since all
> > > > > > > > > archived
> > > > > > > > > > > data
> > > > > > > > > > > > > are considered correct. This increases
> availability a
> > > bit.
> > > > > > > > However,
> > > > > > > > > > > since
> > > > > > > > > > > > > unclean leader elections are rare, this may be ok.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Firstly, We don't want to assume the remote
> storage is
> > > more
> > > > > > > > > reliable
> > > > > > > > > > > than
> > > > > > > > > > > > > Kafka. Kafka unclean leader election usually
> happens
> > > when
> > > > > there
> > > > > > > > is
> > > > > > > > > a
> > > > > > > > > > > large
> > > > > > > > > > > > > scale outage that impacts multiple racks (or even
> > > multiple
> > > > > > > > > > availability
> > > > > > > > > > > > > zones). In such a case, the remote storage may be
> > > > > unavailable or
> > > > > > > > > > > unstable.
> > > > > > > > > > > > > Pulling a large amount of data from the remote
> storage
> > > to
> > > > > > > > reconcile
> > > > > > > > > > the
> > > > > > > > > > > > > local data may also exacerbate the outage. With the
> > > current
> > > > > > > > design,
> > > > > > > > > > > the new
> > > > > > > > > > > > > leader can start working even when the remote
> storage
> > > is
> > > > > > > > > temporarily
> > > > > > > > > > > > > unavailable.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Secondly, it is not easier to implement the
> reconciling
> > > > > logic at
> > > > > > > > > the
> > > > > > > > > > > leader
> > > > > > > > > > > > > side. It can take a long time for the new leader to
> > > > > download the
> > > > > > > > > > remote
> > > > > > > > > > > > > data and rebuild local producer id / leader epoch
> > > > > information.
> > > > > > > > > During
> > > > > > > > > > > this
> > > > > > > > > > > > > period, the leader cannot accept any requests from
> the
> > > > > clients
> > > > > > > > and
> > > > > > > > > > > > > followers. We have to introduce a new state for the
> > > > > leader, and a
> > > > > > > > > new
> > > > > > > > > > > error
> > > > > > > > > > > > > code to let the clients / followers know what is
> > > happening.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1002. RemoteStorageManager.
> > > > > > > > > > > > > 1002.1 There seems to be some inconsistencies in
> > > > > > > > > > RemoteStorageManager.
> > > > > > > > > > > We
> > > > > > > > > > > > > pass in RemoteLogSegmentId copyLogSegment(). For
> all
> > > other
> > > > > > > > methods,
> > > > > > > > > > we
> > > > > > > > > > > pass
> > > > > > > > > > > > > in RemoteLogSegmentMetadata.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Nice catch, we can have the
> RemoteLogSegmentMetadata
> > > for
> > > > > > > > > > copyLogSegment
> > > > > > > > > > > > > too.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1002.2 Is endOffset in RemoteLogSegmentMetadata
> > > inclusive
> > > > > or
> > > > > > > > > > exclusive?
> > > > > > > > > > > > >
> > > > > > > > > > > > > It is inclusive.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1002.3 It seems that we need an api to get the
> > > leaderEpoch
> > > > > > > > history
> > > > > > > > > > for
> > > > > > > > > > > a
> > > > > > > > > > > > > partition.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, updated the KIP with the new method.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1002.4 Could you define the type of
> > > > > RemoteLogSegmentContext?
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is removed in the latest code and it is not
> > > needed.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1003 RemoteLogMetadataManager
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1003.1 I am not sure why we need both of the
> following
> > > > > methods
> > > > > > > > > > > > > in RemoteLogMetadataManager. Could we combine them
> into
> > > > > one that
> > > > > > > > > > takes
> > > > > > > > > > > in
> > > > > > > > > > > > > offset and returns RemoteLogSegmentMetadata?
> > > > > > > > > > > > >     RemoteLogSegmentId
> > > getRemoteLogSegmentId(TopicPartition
> > > > > > > > > > > topicPartition,
> > > > > > > > > > > > > long offset) throws IOException;
> > > > > > > > > > > > >     RemoteLogSegmentMetadata
> > > > > > > > > > > getRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good point, these can be merged for now. I guess we
> > > needed
> > > > > them
> > > > > > > > in
> > > > > > > > > > > earlier
> > > > > > > > > > > > > version of the implementation but it is not needed
> now.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1003.2 There seems to be some inconsistencies in
> the
> > > > > methods
> > > > > > > > > below. I
> > > > > > > > > > > am
> > > > > > > > > > > > > not sure why one takes RemoteLogSegmentMetadata
> and the
> > > > > other
> > > > > > > > > > > > > takes RemoteLogSegmentId.
> > > > > > > > > > > > >     void
> > > putRemoteLogSegmentData(RemoteLogSegmentMetadata
> > > > > > > > > > > > > remoteLogSegmentMetadata) throws IOException;
> > > > > > > > > > > > >     void
> > > deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > >
> > > > > > > > > > > > > RLMM stores RemoteLogSegmentMetadata which is
> > > identified by
> > > > > > > > > > > > > RemoteLogsSegmentId. So, when it is added it takes
> > > > > > > > > > > > > RemoteLogSegmentMetadata. `delete` operation needs
> only
> > > > > > > > > > > RemoteLogsSegmentId
> > > > > > > > > > > > > as RemoteLogSegmentMetadata can be identified with
> > > > > > > > > > RemoteLogsSegmentId.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1003.3 In void onServerStarted(final String
> > > > > serverEndpoint), what
> > > > > > > > > > > > > is serverEndpoint used for?
> > > > > > > > > > > > >
> > > > > > > > > > > > > This can be used by RLMM implementation to connect
> to
> > > the
> > > > > local
> > > > > > > > > Kafka
> > > > > > > > > > > > > cluster. Incase of default  implementation, it is
> used
> > > in
> > > > > > > > > > initializing
> > > > > > > > > > > > > kafka clients connecting to the local cluster.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1004. It would be useful to document how all the
> new
> > > APIs
> > > > > are
> > > > > > > > being
> > > > > > > > > > > used.
> > > > > > > > > > > > > For example, when is
> > > > > RemoteLogSegmentMetadata.markedForDeletion
> > > > > > > > > being
> > > > > > > > > > > set
> > > > > > > > > > > > > and used? How are
> > > > > > > > > > > > >
> > > > > RemoteLogMetadataManager.earliestLogOffset/highestLogOffset being
> > > > > > > > > > used?
> > > > > > > > > > > > >
> > > > > > > > > > > > > RLMM APIs are going through the changes and they
> > > should be
> > > > > ready
> > > > > > > > > in a
> > > > > > > > > > > few
> > > > > > > > > > > > > days. I will update the KIP and the mail  thread
> once
> > > they
> > > > > are
> > > > > > > > > ready.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1005. Handling partition deletion: The KIP says
> "RLMM
> > > will
> > > > > > > > > eventually
> > > > > > > > > > > > > delete these segments by using
> RemoteStorageManager."
> > > Which
> > > > > > > > replica
> > > > > > > > > > > does
> > > > > > > > > > > > > this logic?
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is a good point. When a topic is deleted, it
> will
> > > not
> > > > > have
> > > > > > > > any
> > > > > > > > > > > > > leader/followers to do the cleanup. We will have a
> > > cleaner
> > > > > agent
> > > > > > > > > on a
> > > > > > > > > > > > > single broker in the cluster to do this cleanup, we
> > > plan
> > > > > to add
> > > > > > > > > that
> > > > > > > > > > in
> > > > > > > > > > > > > controller broker.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1006. "If there are any failures in removing
> remote log
> > > > > segments
> > > > > > > > > then
> > > > > > > > > > > those
> > > > > > > > > > > > > are stored in a specific topic (default as
> > > > > > > > > > > __remote_segments_to_be_deleted)
> > > > > > > > > > > > > and user can consume the events(which contain
> > > > > > > > > remote-log-segment-id)
> > > > > > > > > > > from
> > > > > > > > > > > > > that topic and clean them up from remote storage.
> "
> > > Not
> > > > > sure if
> > > > > > > > > it's
> > > > > > > > > > > worth
> > > > > > > > > > > > > the complexity of adding another topic. Could we
> just
> > > > > retry?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Sure, we can keep this simpler for now by logging
> an
> > > error
> > > > > after
> > > > > > > > > > > retries.
> > > > > > > > > > > > > We can give users a better way to process this in
> > > future.
> > > > > Oneway
> > > > > > > > > can
> > > > > > > > > > > be a
> > > > > > > > > > > > > dead letter topic which can be configured by the
> user.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1007. RemoteFetchPurgatory: Could we just reuse the
> > > > > existing
> > > > > > > > > > > > > fetchPurgatory?
> > > > > > > > > > > > >
> > > > > > > > > > > > > We have 2 types of delayed operations waiting for 2
> > > > > different
> > > > > > > > > events.
> > > > > > > > > > > > > DelayedFetch waits for new messages from producers.
> > > > > > > > > > DelayedRemoteFetch
> > > > > > > > > > > > > waits for the remote-storage-read-task to finish.
> When
> > > > > either of
> > > > > > > > > the
> > > > > > > > > > 2
> > > > > > > > > > > > > events happens, we only want to notify one type of
> the
> > > > > delayed
> > > > > > > > > > > operations.
> > > > > > > > > > > > > It would be inefficient to put 2 types of delayed
> > > > > operations in
> > > > > > > > one
> > > > > > > > > > > > > purgatory, as the tryComplete() methods of the
> delayed
> > > > > operations
> > > > > > > > > can
> > > > > > > > > > > be
> > > > > > > > > > > > > triggered by irrelevant events.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1008. Configurations:
> > > > > > > > > > > > > 1008.1 remote.log.retention.ms,
> > > > > remote.log.retention.minutes,
> > > > > > > > > > > > > remote.log.retention.hours: It seems that we just
> need
> > > the
> > > > > ms
> > > > > > > > one.
> > > > > > > > > > > Also,
> > > > > > > > > > > > > are we changing the meaning of existing config
> > > > > log.retention.ms
> > > > > > > > to
> > > > > > > > > > > mean
> > > > > > > > > > > > > the
> > > > > > > > > > > > > local retention? For backward compatibility, it's
> > > better
> > > > > to not
> > > > > > > > > > change
> > > > > > > > > > > the
> > > > > > > > > > > > > meaning of existing configurations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We agree that we only need remote.log.retention.ms
> .
> > > But,
> > > > > the
> > > > > > > > > > existing
> > > > > > > > > > > > > Kafka
> > > > > > > > > > > > > configuration
> > > > > > > > > > > > > has 3 properties (log.retention.ms,
> > > log.retention.minutes,
> > > > > > > > > > > > > log.retention.hours). We just
> > > > > > > > > > > > > want to keep consistent with the existing
> properties.
> > > > > > > > > > > > > Existing log.retention.xxxx config is about log
> > > retention
> > > > > in
> > > > > > > > > broker’s
> > > > > > > > > > > > > storage which is local. It should be easy for
> users to
> > > > > configure
> > > > > > > > > > > partition
> > > > > > > > > > > > > storage with local retention and remote retention
> > > based on
> > > > > their
> > > > > > > > > > usage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1008.2 Should remote.log.storage.enable be at the
> topic
> > > > > level?
> > > > > > > > > > > > >
> > > > > > > > > > > > > We can introduce topic level config for the same
> > > remote.log
> > > > > > > > > settings.
> > > > > > > > > > > User
> > > > > > > > > > > > > can set the desired config while creating the
> topic.
> > > > > > > > > > > > > remote.log.storage.enable property is not allowed
> to be
> > > > > updated
> > > > > > > > > after
> > > > > > > > > > > the
> > > > > > > > > > > > > topic is created. Other remote.log.* properties
> can be
> > > > > modified.
> > > > > > > > We
> > > > > > > > > > > will
> > > > > > > > > > > > > support flipping remote.log.storage.enable in next
> > > > > versions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1009. It would be useful to list all limitations
> in a
> > > > > separate
> > > > > > > > > > section:
> > > > > > > > > > > > > compacted topic, JBOD, etc. Also, is changing a
> topic
> > > from
> > > > > delete
> > > > > > > > > to
> > > > > > > > > > > > > compact and vice versa allowed when tiering is
> enabled?
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 to have limitations in a separate section. We
> will
> > > > > update the
> > > > > > > > > KIP
> > > > > > > > > > > with
> > > > > > > > > > > > > that.
> > > > > > > > > > > > > Topic  created with effective value for
> > > remote.log.enabled
> > > > > as
> > > > > > > > true,
> > > > > > > > > > > can not
> > > > > > > > > > > > > change its retention policy from delete to compact.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1010. Thanks for performance numbers. Are those
> with
> > > > > RocksDB as
> > > > > > > > the
> > > > > > > > > > > cache?
> > > > > > > > > > > > >
> > > > > > > > > > > > > No, We have not yet added RocksDB support. This is
> > > based on
> > > > > > > > > in-memory
> > > > > > > > > > > map
> > > > > > > > > > > > > representation. We will add that support and update
> > > this
> > > > > thread
> > > > > > > > > after
> > > > > > > > > > > > > updating the KIP with the numbers.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Satish.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jul 21, 2020 at 6:49 AM Jun Rao <
> > > j...@confluent.io>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Satish, Ying, Harsha,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the updated KIP. A few more comments
> > > below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1000. Regarding Colin's question on querying the
> > > metadata
> > > > > > > > > directly
> > > > > > > > > > > in the
> > > > > > > > > > > > > > remote block store. One issue is that not all
> block
> > > > > stores
> > > > > > > > offer
> > > > > > > > > > the
> > > > > > > > > > > > > needed
> > > > > > > > > > > > > > api to query the metadata. For example, S3 only
> > > offers
> > > > > an api
> > > > > > > > to
> > > > > > > > > > list
> > > > > > > > > > > > > > objects under a prefix and this api has the
> eventual
> > > > > > > > consistency
> > > > > > > > > > > > > semantic.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1001. The KIP described a few scenarios of
> unclean
> > > leader
> > > > > > > > > > elections.
> > > > > > > > > > > This
> > > > > > > > > > > > > > is very useful, but I am wondering if this is the
> > > best
> > > > > > > > approach.
> > > > > > > > > My
> > > > > > > > > > > > > > understanding of the proposed approach is to
> allow
> > > the
> > > > > new
> > > > > > > > > > (unclean)
> > > > > > > > > > > > > leader
> > > > > > > > > > > > > > to take new messages immediately. While this
> > > increases
> > > > > > > > > > availability,
> > > > > > > > > > > it
> > > > > > > > > > > > > > creates the problem that there could be multiple
> > > > > conflicting
> > > > > > > > > > > segments in
> > > > > > > > > > > > > > the remote store for the same offset range. This
> > > seems
> > > > > to make
> > > > > > > > it
> > > > > > > > > > > harder
> > > > > > > > > > > > > > for RLMM to determine which archived log segments
> > > > > contain the
> > > > > > > > > > correct
> > > > > > > > > > > > > data.
> > > > > > > > > > > > > > For example, an archived log segment could at one
> > > time
> > > > > be the
> > > > > > > > > > correct
> > > > > > > > > > > > > data,
> > > > > > > > > > > > > > but be changed to incorrect data after an unclean
> > > leader
> > > > > > > > > election.
> > > > > > > > > > An
> > > > > > > > > > > > > > alternative approach is to let the unclean leader
> > > use the
> > > > > > > > > archived
> > > > > > > > > > > data
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > the source of truth. So, when the new (unclean)
> > > leader
> > > > > takes
> > > > > > > > > over,
> > > > > > > > > > it
> > > > > > > > > > > > > first
> > > > > > > > > > > > > > reconciles the local data based on the archived
> data
> > > > > before
> > > > > > > > > taking
> > > > > > > > > > > new
> > > > > > > > > > > > > > messages. This makes the job of RLMM a bit easier
> > > since
> > > > > all
> > > > > > > > > > archived
> > > > > > > > > > > data
> > > > > > > > > > > > > > are considered correct. This increases
> availability a
> > > > > bit.
> > > > > > > > > However,
> > > > > > > > > > > since
> > > > > > > > > > > > > > unclean leader elections are rare, this may be
> ok.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1002. RemoteStorageManager.
> > > > > > > > > > > > > > 1002.1 There seems to be some inconsistencies in
> > > > > > > > > > > RemoteStorageManager. We
> > > > > > > > > > > > > > pass in RemoteLogSegmentId copyLogSegment(). For
> all
> > > > > other
> > > > > > > > > methods,
> > > > > > > > > > > we
> > > > > > > > > > > > > pass
> > > > > > > > > > > > > > in RemoteLogSegmentMetadata.
> > > > > > > > > > > > > > 1002.2 Is endOffset in RemoteLogSegmentMetadata
> > > > > inclusive or
> > > > > > > > > > > exclusive?
> > > > > > > > > > > > > > 1002.3 It seems that we need an api to get the
> > > > > leaderEpoch
> > > > > > > > > history
> > > > > > > > > > > for a
> > > > > > > > > > > > > > partition.
> > > > > > > > > > > > > > 1002.4 Could you define the type of
> > > > > RemoteLogSegmentContext?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1003 RemoteLogMetadataManager
> > > > > > > > > > > > > > 1003.1 I am not sure why we need both of the
> > > following
> > > > > methods
> > > > > > > > > > > > > > in RemoteLogMetadataManager. Could we combine
> them
> > > into
> > > > > one
> > > > > > > > that
> > > > > > > > > > > takes in
> > > > > > > > > > > > > > offset and returns RemoteLogSegmentMetadata?
> > > > > > > > > > > > > >     RemoteLogSegmentId
> > > > > getRemoteLogSegmentId(TopicPartition
> > > > > > > > > > > > > topicPartition,
> > > > > > > > > > > > > > long offset) throws IOException;
> > > > > > > > > > > > > >     RemoteLogSegmentMetadata
> > > > > > > > > > > > > getRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > > > 1003.2 There seems to be some inconsistencies in
> the
> > > > > methods
> > > > > > > > > below.
> > > > > > > > > > > I am
> > > > > > > > > > > > > > not sure why one takes RemoteLogSegmentMetadata
> and
> > > the
> > > > > other
> > > > > > > > > > > > > > takes RemoteLogSegmentId.
> > > > > > > > > > > > > >     void
> > > putRemoteLogSegmentData(RemoteLogSegmentMetadata
> > > > > > > > > > > > > > remoteLogSegmentMetadata) throws IOException;
> > > > > > > > > > > > > >     void
> > > > > deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > > > 1003.3 In void onServerStarted(final String
> > > > > serverEndpoint),
> > > > > > > > what
> > > > > > > > > > > > > > is serverEndpoint used for?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1004. It would be useful to document how all the
> new
> > > > > APIs are
> > > > > > > > > being
> > > > > > > > > > > used.
> > > > > > > > > > > > > > For example, when is
> > > > > RemoteLogSegmentMetadata.markedForDeletion
> > > > > > > > > > > being set
> > > > > > > > > > > > > > and used? How are
> > > > > > > > > > > > > >
> > > > > RemoteLogMetadataManager.earliestLogOffset/highestLogOffset
> > > > > > > > being
> > > > > > > > > > > used?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1005. Handling partition deletion: The KIP says
> "RLMM
> > > > > will
> > > > > > > > > > eventually
> > > > > > > > > > > > > > delete these segments by using
> RemoteStorageManager."
> > > > > Which
> > > > > > > > > replica
> > > > > > > > > > > does
> > > > > > > > > > > > > > this logic?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1006. "If there are any failures in removing
> remote
> > > log
> > > > > > > > segments
> > > > > > > > > > then
> > > > > > > > > > > > > those
> > > > > > > > > > > > > > are stored in a specific topic (default as
> > > > > > > > > > > > > __remote_segments_to_be_deleted)
> > > > > > > > > > > > > > and user can consume the events(which contain
> > > > > > > > > > remote-log-segment-id)
> > > > > > > > > > > from
> > > > > > > > > > > > > > that topic and clean them up from remote
> storage.  "
> > > Not
> > > > > sure
> > > > > > > > if
> > > > > > > > > > it's
> > > > > > > > > > > > > worth
> > > > > > > > > > > > > > the complexity of adding another topic. Could we
> just
> > > > > retry?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1007. RemoteFetchPurgatory: Could we just reuse
> the
> > > > > existing
> > > > > > > > > > > > > > fetchPurgatory?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1008. Configurations:
> > > > > > > > > > > > > > 1008.1 remote.log.retention.ms,
> > > > > remote.log.retention.minutes,
> > > > > > > > > > > > > > remote.log.retention.hours: It seems that we just
> > > need
> > > > > the ms
> > > > > > > > > one.
> > > > > > > > > > > Also,
> > > > > > > > > > > > > > are we changing the meaning of existing config
> > > > > > > > log.retention.ms
> > > > > > > > > to
> > > > > > > > > > > mean
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > local retention? For backward compatibility, it's
> > > better
> > > > > to not
> > > > > > > > > > > change
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > meaning of existing configurations.
> > > > > > > > > > > > > > 1008.2 Should remote.log.storage.enable be at the
> > > topic
> > > > > level?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1009. It would be useful to list all limitations
> in a
> > > > > separate
> > > > > > > > > > > section:
> > > > > > > > > > > > > > compacted topic, JBOD, etc. Also, is changing a
> topic
> > > > > from
> > > > > > > > delete
> > > > > > > > > > to
> > > > > > > > > > > > > > compact and vice versa allowed when tiering is
> > > enabled?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1010. Thanks for performance numbers. Are those
> with
> > > > > RocksDB as
> > > > > > > > > the
> > > > > > > > > > > > > cache?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 15, 2020 at 6:12 PM Harsha Ch <
> > > > > harsha...@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > >                Thats not what we said in the
> > > previous
> > > > > email.
> > > > > > > > > RLMM
> > > > > > > > > > > is
> > > > > > > > > > > > > > > pluggable storage and by running numbers even
> 1PB
> > > data
> > > > > you do
> > > > > > > > > not
> > > > > > > > > > > need
> > > > > > > > > > > > > > more
> > > > > > > > > > > > > > > than 10GB local storage.
> > > > > > > > > > > > > > > If in future this becomes a blocker for any
> users
> > > we
> > > > > can
> > > > > > > > > revisit
> > > > > > > > > > > but
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > does not warrant another implementation at this
> > > point
> > > > > to push
> > > > > > > > > the
> > > > > > > > > > > data
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > remote storage.
> > > > > > > > > > > > > > > We can ofcourse implement another RLMM that is
> > > > > optional for
> > > > > > > > > users
> > > > > > > > > > > to
> > > > > > > > > > > > > > > configure to push to remote. But that doesn't
> need
> > > to
> > > > > be
> > > > > > > > > > addressed
> > > > > > > > > > > in
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Harsha
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 15, 2020 at 5:50 PM Colin McCabe <
> > > > > > > > > cmcc...@apache.org
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Ying,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the response.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It sounds like you agree that storing the
> > > metadata
> > > > > in the
> > > > > > > > > > remote
> > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > would be a better design overall.  Given that
> > > that's
> > > > > true,
> > > > > > > > is
> > > > > > > > > > > there
> > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > reason to include the worse implementation
> based
> > > on
> > > > > > > > RocksDB?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Choosing a long-term metadata store is not
> > > something
> > > > > that
> > > > > > > > we
> > > > > > > > > > > should
> > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > lightly.  It can take users years to migrate
> from
> > > > > metadata
> > > > > > > > > > store
> > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > other.  I also don't think it's realistic or
> > > > > desirable for
> > > > > > > > > > users
> > > > > > > > > > > to
> > > > > > > > > > > > > > write
> > > > > > > > > > > > > > > > their own metadata stores.  Even assuming
> that
> > > they
> > > > > could
> > > > > > > > do
> > > > > > > > > a
> > > > > > > > > > > good
> > > > > > > > > > > > > job
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > this, it would create huge fragmentation in
> the
> > > Kafka
> > > > > > > > > > ecosystem.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 14, 2020, at 09:39, Ying Zheng
> wrote:
> > > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Satish and I are still discussing some
> details
> > > > > about how
> > > > > > > > to
> > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > transactions / producer ids. Satish is
> going to
> > > > > make some
> > > > > > > > > > minor
> > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > RLMM API and other parts. Other than that,
> we
> > > have
> > > > > > > > finished
> > > > > > > > > > > > > updating
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I agree with Colin that the current design
> of
> > > using
> > > > > > > > rocksDB
> > > > > > > > > > is
> > > > > > > > > > > not
> > > > > > > > > > > > > > > > > optimal. But this design is simple and
> should
> > > work
> > > > > for
> > > > > > > > > almost
> > > > > > > > > > > all
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > existing Kafka users. RLMM is a plugin.
> Users
> > > can
> > > > > replace
> > > > > > > > > > > rocksDB
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > their own RLMM implementation, if needed.
> So, I
> > > > > think we
> > > > > > > > > can
> > > > > > > > > > > keep
> > > > > > > > > > > > > > > rocksDB
> > > > > > > > > > > > > > > > > for now. What do you think?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Ying
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 10:35 AM Jun Rao <
> > > > > > > > j...@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi, Ying,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the update. It's good to see
> the
> > > > > progress on
> > > > > > > > > > this.
> > > > > > > > > > > > > > Please
> > > > > > > > > > > > > > > > let us
> > > > > > > > > > > > > > > > > > know when you are done updating the KIP
> wiki.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 10:13 AM Ying
> Zheng
> > > > > > > > > > > > > <yi...@uber.com.invalid
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Satish and I have added more design
> > > details in
> > > > > the
> > > > > > > > KIP,
> > > > > > > > > > > > > including
> > > > > > > > > > > > > > > > how to
> > > > > > > > > > > > > > > > > > > keep consistency between replicas
> > > (especially
> > > > > when
> > > > > > > > > there
> > > > > > > > > > is
> > > > > > > > > > > > > > > > leadership
> > > > > > > > > > > > > > > > > > > changes / log truncations) and new
> > > metrics. We
> > > > > also
> > > > > > > > > made
> > > > > > > > > > > some
> > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > minor
> > > > > > > > > > > > > > > > > > > changes in the doc. We will finish the
> KIP
> > > > > changes in
> > > > > > > > > the
> > > > > > > > > > > next
> > > > > > > > > > > > > > > > couple of
> > > > > > > > > > > > > > > > > > > days. We will let you know when we are
> > > done.
> > > > > Most of
> > > > > > > > > the
> > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > already updated to the wiki KIP. You
> can
> > > take
> > > > > a look.
> > > > > > > > > But
> > > > > > > > > > > it's
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > final version yet.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > As for the implementation, the code is
> > > mostly
> > > > > done
> > > > > > > > and
> > > > > > > > > we
> > > > > > > > > > > > > already
> > > > > > > > > > > > > > > had
> > > > > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > feature tests / system tests. I have
> added
> > > the
> > > > > > > > > > performance
> > > > > > > > > > > test
> > > > > > > > > > > > > > > > results
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > the KIP. However the recent design
> changes
> > > > > (e.g.
> > > > > > > > leader
> > > > > > > > > > > epoch
> > > > > > > > > > > > > > info
> > > > > > > > > > > > > > > > > > > management / log truncation / some of
> the
> > > new
> > > > > > > > metrics)
> > > > > > > > > > > have not
> > > > > > > > > > > > > > > been
> > > > > > > > > > > > > > > > > > > implemented yet. It will take about 2
> weeks
> > > > > for us to
> > > > > > > > > > > implement
> > > > > > > > > > > > > > > > after you
> > > > > > > > > > > > > > > > > > > review and agree with those design
> changes.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 9:23 AM Jun Rao
> <
> > > > > > > > > j...@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi, Satish, Harsha,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Any new updates on the KIP? This
> feature
> > > is
> > > > > one of
> > > > > > > > > the
> > > > > > > > > > > most
> > > > > > > > > > > > > > > > important
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > most requested features in Apache
> Kafka
> > > > > right now.
> > > > > > > > It
> > > > > > > > > > > would
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > helpful
> > > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > > > we can make sustained progress on
> this.
> > > > > Could you
> > > > > > > > > share
> > > > > > > > > > > how
> > > > > > > > > > > > > far
> > > > > > > > > > > > > > > > along
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > the design/implementation right now?
> Is
> > > there
> > > > > > > > > anything
> > > > > > > > > > > that
> > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > > people
> > > > > > > > > > > > > > > > > > > > can help to get it across the line?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > As for "transactional support" and
> > > "follower
> > > > > > > > > > > > > > > > requests/replication", no
> > > > > > > > > > > > > > > > > > > > further comments from me as long as
> the
> > > > > producer
> > > > > > > > > state
> > > > > > > > > > > and
> > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > epoch
> > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > be restored properly from the object
> > > store
> > > > > when
> > > > > > > > > needed.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Tue, Jun 9, 2020 at 3:39 AM Satish
> > > > > Duggana <
> > > > > > > > > > > > > > > > > > satish.dugg...@gmail.com>
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > We did not want to add many
> > > implementation
> > > > > > > > details
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > KIP.
> > > > > > > > > > > > > > > > But we
> > > > > > > > > > > > > > > > > > > > > decided to add them in the KIP as
> > > appendix
> > > > > or
> > > > > > > > > > > > > > > > sub-sections(including
> > > > > > > > > > > > > > > > > > > > > follower fetch protocol) to
> describe
> > > the
> > > > > flow
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > main
> > > > > > > > > > > > > > > > cases.
> > > > > > > > > > > > > > > > > > > > > That will answer most of the
> queries. I
> > > > > will
> > > > > > > > update
> > > > > > > > > > on
> > > > > > > > > > > this
> > > > > > > > > > > > > > > mail
> > > > > > > > > > > > > > > > > > > > > thread when the respective
> sections are
> > > > > updated.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Sat, Jun 6, 2020 at 7:49 PM
> > > Alexandre
> > > > > Dupriez
> > > > > > > > > > > > > > > > > > > > > <alexandre.dupr...@gmail.com>
> wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Hi Satish,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > A couple of questions specific
> to the
> > > > > section
> > > > > > > > > > > "Follower
> > > > > > > > > > > > > > > > > > > > > > Requests/Replication", pages
> 16:17
> > > in the
> > > > > > > > design
> > > > > > > > > > > document
> > > > > > > > > > > > > > > [1].
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900. It is mentioned that
> followers
> > > fetch
> > > > > > > > > auxiliary
> > > > > > > > > > > > > states
> > > > > > > > > > > > > > > > from the
> > > > > > > > > > > > > > > > > > > > > > remote storage.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.a Does the consistency model
> of
> > > the
> > > > > > > > external
> > > > > > > > > > > storage
> > > > > > > > > > > > > > > > impacts
> > > > > > > > > > > > > > > > > > > reads
> > > > > > > > > > > > > > > > > > > > > > of leader epochs and other
> auxiliary
> > > > > data?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.b What are the benefits of
> using
> > > a
> > > > > > > > mechanism
> > > > > > > > > to
> > > > > > > > > > > store
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > > > > > > > the leader epochs which is
> different
> > > > > from other
> > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > associated
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > tiered segments? What are the
> > > benefits of
> > > > > > > > > > retrieving
> > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > > > > > > > on-demand from the follower
> rather
> > > than
> > > > > relying
> > > > > > > > > on
> > > > > > > > > > > > > > > propagation
> > > > > > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > topic __remote_log_metadata? What
> > > are the
> > > > > > > > > > advantages
> > > > > > > > > > > over
> > > > > > > > > > > > > > > > using a
> > > > > > > > > > > > > > > > > > > > > > dedicated control structure
> (e.g. a
> > > new
> > > > > record
> > > > > > > > > > type)
> > > > > > > > > > > > > > > > propagated via
> > > > > > > > > > > > > > > > > > > > > > this topic? Since in the
> document,
> > > > > different
> > > > > > > > > > control
> > > > > > > > > > > > > paths
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > operating in the system, how are
> the
> > > > > metadata
> > > > > > > > > > stored
> > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > __remote_log_metadata [which also
> > > > > include the
> > > > > > > > > epoch
> > > > > > > > > > > of
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > > > > > > > which offloaded a segment] and
> the
> > > remote
> > > > > > > > > auxiliary
> > > > > > > > > > > > > states,
> > > > > > > > > > > > > > > > kept in
> > > > > > > > > > > > > > > > > > > > > > sync?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.c A follower can encounter an
> > > > > > > > > > > > > > > > OFFSET_MOVED_TO_TIERED_STORAGE.
> > > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > > > > > this in response to a Fetch or
> > > > > > > > > OffsetForLeaderEpoch
> > > > > > > > > > > > > > request?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.d What happens if, after a
> > > follower
> > > > > > > > > encountered
> > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > > > > OFFSET_MOVED_TO_TIERED_STORAGE
> > > response,
> > > > > its
> > > > > > > > > > > attempts to
> > > > > > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > > > > > > > > > leader epochs fail (for instance,
> > > > > because the
> > > > > > > > > > remote
> > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > temporarily unavailable)? Does
> the
> > > > > follower
> > > > > > > > > > > fallbacks to
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > mode
> > > > > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > > > > > > > it ignores tiered segments, and
> > > applies
> > > > > > > > > truncation
> > > > > > > > > > > using
> > > > > > > > > > > > > > only
> > > > > > > > > > > > > > > > > > locally
> > > > > > > > > > > > > > > > > > > > > > available information? What
> happens
> > > when
> > > > > access
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > > > > > > > is restored? How is the replica
> > > lineage
> > > > > > > > inferred
> > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > > > > > > > epochs reconciled with the
> follower's
> > > > > replica
> > > > > > > > > > > lineage,
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > has
> > > > > > > > > > > > > > > > > > > > > > evolved? Does the follower
> remember
> > > > > fetching
> > > > > > > > > > > auxiliary
> > > > > > > > > > > > > > states
> > > > > > > > > > > > > > > > > > failed
> > > > > > > > > > > > > > > > > > > > > > in the past and attempt
> > > reconciliation?
> > > > > Is
> > > > > > > > there
> > > > > > > > > a
> > > > > > > > > > > plan
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > offer
> > > > > > > > > > > > > > > > > > > > > > different strategies in this
> > > scenario,
> > > > > > > > > configurable
> > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > > configuration?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.e Is the leader epoch cache
> > > > > offloaded with
> > > > > > > > > > every
> > > > > > > > > > > > > > segment?
> > > > > > > > > > > > > > > > Or
> > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > a new checkpoint is detected? If
> that
> > > > > > > > information
> > > > > > > > > > is
> > > > > > > > > > > not
> > > > > > > > > > > > > > > always
> > > > > > > > > > > > > > > > > > > > > > offloaded to avoid duplicating
> data,
> > > how
> > > > > does
> > > > > > > > the
> > > > > > > > > > > remote
> > > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > > > > > > > satisfy the request to retrieve
> it?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.f Since the leader epoch
> cache
> > > > > covers the
> > > > > > > > > > entire
> > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > > lineage,
> > > > > > > > > > > > > > > > > > > > > > what happens if, after a leader
> epoch
> > > > > cache
> > > > > > > > file
> > > > > > > > > is
> > > > > > > > > > > > > > offloaded
> > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > given segment, the local epoch
> cache
> > > is
> > > > > > > > truncated
> > > > > > > > > > > [not
> > > > > > > > > > > > > > > > necessarily
> > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > a range of offset included in
> tiered
> > > > > segments]?
> > > > > > > > > How
> > > > > > > > > > > are
> > > > > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > local leader epoch caches kept
> > > > > consistent?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.g Consumer can also use
> leader
> > > > > epochs (e.g.
> > > > > > > > > to
> > > > > > > > > > > enable
> > > > > > > > > > > > > > > > fencing
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > protect against stale leaders).
> What
> > > > > > > > differences
> > > > > > > > > > > would
> > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > between consumer and follower
> > > fetches?
> > > > > > > > > Especially,
> > > > > > > > > > > would
> > > > > > > > > > > > > > > > consumers
> > > > > > > > > > > > > > > > > > > > > > also fetch leader epoch
> information
> > > from
> > > > > the
> > > > > > > > > remote
> > > > > > > > > > > > > > storage?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 900.h Assume a newly elected
> leader
> > > of a
> > > > > > > > > > > topic-partition
> > > > > > > > > > > > > > > > detects
> > > > > > > > > > > > > > > > > > more
> > > > > > > > > > > > > > > > > > > > > > recent segments are available in
> the
> > > > > external
> > > > > > > > > > > storage,
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > epochs
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > its local epoch. Does it ignore
> these
> > > > > segments
> > > > > > > > > and
> > > > > > > > > > > their
> > > > > > > > > > > > > > > > associated
> > > > > > > > > > > > > > > > > > > > > > epoch-to-offset vectors? Or try
> to
> > > > > reconstruct
> > > > > > > > > its
> > > > > > > > > > > local
> > > > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > > > > > > lineage based on the data
> remotely
> > > > > available?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > Alexandre
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/18tnobSas3mKFZFr8oRguZoj_tkD_sGzivuLRlMloEMs/edit?usp=sharing
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Le jeu. 4 juin 2020 à 19:55,
> Satish
> > > > > Duggana <
> > > > > > > > > > > > > > > > > > > satish.dugg...@gmail.com>
> > > > > > > > > > > > > > > > > > > > > a écrit :
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > > > > > > > > > Please let us know if you have
> any
> > > > > comments
> > > > > > > > on
> > > > > > > > > > > > > > > "transactional
> > > > > > > > > > > > > > > > > > > > support"
> > > > > > > > > > > > > > > > > > > > > > > and "follower
> requests/replication"
> > > > > mentioned
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > wiki.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 2, 2020 at 9:25 PM
> > > Satish
> > > > > > > > Duggana <
> > > > > > > > > > > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Thanks Jun for your comments.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >100. It would be useful to
> > > provide
> > > > > more
> > > > > > > > > > details
> > > > > > > > > > > on
> > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > those
> > > > > > > > > > > > > > > > > > > apis
> > > > > > > > > > > > > > > > > > > > > are used. Otherwise, it's kind of
> hard
> > > to
> > > > > really
> > > > > > > > > > assess
> > > > > > > > > > > > > > whether
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > > > apis are sufficient/redundant. A
> few
> > > > > examples
> > > > > > > > > below.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > We will update the wiki and
> let
> > > you
> > > > > know.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >100.1 deleteRecords seems to
> > > only
> > > > > advance
> > > > > > > > > the
> > > > > > > > > > > > > > > > logStartOffset
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > Log. How does that trigger the
> > > deletion of
> > > > > remote
> > > > > > > > > log
> > > > > > > > > > > > > > segments?
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > RLMTask for leader partition
> > > > > periodically
> > > > > > > > > > checks
> > > > > > > > > > > > > > whether
> > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > > > remote log segments earlier
> to
> > > > > > > > logStartOffset
> > > > > > > > > > > and the
> > > > > > > > > > > > > > > > > > respective
> > > > > > > > > > > > > > > > > > > > > > > > remote log segment metadata
> and
> > > data
> > > > > are
> > > > > > > > > > deleted
> > > > > > > > > > > by
> > > > > > > > > > > > > > using
> > > > > > > > > > > > > > > > RLMM
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > RSM.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >100.2 stopReplica with
> deletion
> > > is
> > > > > used
> > > > > > > > in 2
> > > > > > > > > > > cases
> > > > > > > > > > > > > (a)
> > > > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > > > > > reassignment; (b) topic deletion.
> We
> > > only
> > > > > want to
> > > > > > > > > > > delete
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > tiered
> > > > > > > > > > > > > > > > > > > > > metadata in the second case. Also,
> in
> > > the
> > > > > second
> > > > > > > > > > case,
> > > > > > > > > > > who
> > > > > > > > >
> > >
>
>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to