Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Ying Zheng Tue, 28 Jul 2020 12:05:25 -0700

1001.

We did consider this approach. The concerns are
1)  This makes unclean-leader-election rely on remote storage. In case the
remote storage
 is unavailable, Kafka will not be able to finish the
unclean-leader-election.
2) Since the user set local retention time (or local retention bytes), I
think we are expected to
keep that much local data when possible (avoid truncating all the local
data). But, as you said,
unclean leader elections are very rare, this may not be a big problem.


The current design uses the leader broker as source-of-truth. This is
consistent with the
existing Kafka behavior.

By using remote storage as the source-of-truth, the follower logic can be a
little simpler,
but the leader logic is going to be more complex. Overall, I don't see
there many benefits
of using remote storage as the source-of-truth.



On Tue, Jul 28, 2020 at 10:25 AM Jun Rao <j...@confluent.io> wrote:

> Hi, Satish,
>
> Thanks for the reply.
>
> 1001. In your example, I was thinking that you could just download the
> latest leader epoch from the object store. After that you know the leader
> should end with offset 1100. The leader will delete all its local data
> before offset 1000 and start accepting new messages at offset 1100.
> Consumer requests for messages before offset 1100 will be served from the
> object store. The benefit with this approach is that it's simpler to reason
> about who is the source of truth. The downside is slightly  increased
> unavailability window during unclean leader election. Since unclean leader
> elections are rare, I am not sure if this is a big concern.
>
> 1008. Yes, I think introducing sth like local.retention.ms seems more
> consistent.
>
> Jun
>
> On Tue, Jul 28, 2020 at 2:30 AM Satish Duggana <satish.dugg...@gmail.com>
> wrote:
>
> > HI Jun,
> > Thanks for your comments. We put our inline replies below.
> >
> > 1001. I was thinking that you could just use the tiered metadata to do
> the
> > reconciliation. The tiered metadata contains offset ranges and epoch
> > history. Those should be enough for reconciliation purposes.
> >
> > If we use remote storage as the source-of-truth during
> > unclean-leader-election, it's possible that after reconciliation the
> > remote storage will have more recent data than the new leader's local
> > storage. For example, the new leader's latest message is offset 1000,
> > while the remote storage has message 1100. In such a case, the new
> > leader will have to download the messages from 1001 to 1100, before
> > accepting new messages from producers. Otherwise, there would be a gap
> > in the local data between 1000 and 1101.
> >
> > Moreover, with the current design, leader epoch history is stored in
> > remote storage, rather than the metadata topic. We did consider saving
> > epoch history in remote segment metadata. But the concern is that
> > there is currently no limit for the epoch history size. Theoretically,
> > if a user has a very long remote retention time and there are very
> > frequent leadership changes, the leader epoch history can become too
> > long to fit into a regular Kafka message.
> >
> >
> > 1003.3 Having just a serverEndpoint string is probably not enough.
> > Connecting to a Kafka cluster may need various security credentials. We
> can
> > make RLMM configurable and pass in the properties through the configure()
> > method. Ditto for RSM.
> >
> > RLMM and  RSM are already configurable and they take properties which
> > start with "remote.log.metadata." and "remote.log.storage."
> > respectively and a few others. We have listener-name as the config for
> > RLMM and other properties(like security) can be sent as you suggested.
> > We will update the KIP with the details.
> >
> >
> > 1008.1 We started with log.retention.hours and log.retention.minutes, and
> > added log.retention.ms later. If we are adding a new configuration, ms
> > level config alone is enough and is simpler. We can build tools to make
> the
> > configuration at different granularities easier. The definition of
> > log.retention.ms is "The number of milliseconds to keep a log file
> before
> > deleting it". The deletion is independent of whether tiering is enabled
> or
> > not. If this changes to just the local portion of the data, we are
> changing
> > the meaning of an existing configuration.
> >
> > We are fine with either way. We can go with log.retention.xxxx as the
> > effective log retention instead of local log retention. With this
> > convention, we need to introduce  local.log.retention instead of
> > remote.log.retention.ms that we proposed. If log.retention.ms as -1
> > then remote retention is also considered as unlimited but user should
> > be able to set the local.retention.ms.
> > So, we need to introduce local.log.retention.ms and
> > local.log.retention.bytes which should  always  be <=
> > log.retention.ms/bytes respectively.
> >
> >
> >
> > On Fri, Jul 24, 2020 at 3:37 AM Jun Rao <j...@confluent.io> wrote:
> > >
> > > Hi, Satish,
> > >
> > > Thanks for the reply. A few quick comments below.
> > >
> > > 1001. I was thinking that you could just use the tiered metadata to do
> > the
> > > reconciliation. The tiered metadata contains offset ranges and epoch
> > > history. Those should be enough for reconciliation purposes.
> > >
> > > 1003.3 Having just a serverEndpoint string is probably not enough.
> > > Connecting to a Kafka cluster may need various security credentials. We
> > can
> > > make RLMM configurable and pass in the properties through the
> configure()
> > > method. Ditto for RSM.
> > >
> > > 1008.1 We started with log.retention.hours and log.retention.minutes,
> and
> > > added log.retention.ms later. If we are adding a new configuration, ms
> > > level config alone is enough and is simpler. We can build tools to make
> > the
> > > configuration at different granularities easier. The definition of
> > > log.retention.ms is "The number of milliseconds to keep a log file
> > before
> > > deleting it". The deletion is independent of whether tiering is enabled
> > or
> > > not. If this changes to just the local portion of the data, we are
> > changing
> > > the meaning of an existing configuration.
> > >
> > > Jun
> > >
> > >
> > > On Thu, Jul 23, 2020 at 11:04 AM Satish Duggana <
> > satish.dugg...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thank you for the comments! Ying, Harsha and I discussed and put our
> > > > comments below.
> > > >
> > > >
> > > > 1001. The KIP described a few scenarios of unclean leader elections.
> > This
> > > > is very useful, but I am wondering if this is the best approach. My
> > > > understanding of the proposed approach is to allow the new (unclean)
> > leader
> > > > to take new messages immediately. While this increases availability,
> it
> > > > creates the problem that there could be multiple conflicting segments
> > in
> > > > the remote store for the same offset range. This seems to make it
> > harder
> > > > for RLMM to determine which archived log segments contain the correct
> > data.
> > > > For example, an archived log segment could at one time be the correct
> > data,
> > > > but be changed to incorrect data after an unclean leader election. An
> > > > alternative approach is to let the unclean leader use the archived
> > data as
> > > > the source of truth. So, when the new (unclean) leader takes over, it
> > first
> > > > reconciles the local data based on the archived data before taking
> new
> > > > messages. This makes the job of RLMM a bit easier since all archived
> > data
> > > > are considered correct. This increases availability a bit. However,
> > since
> > > > unclean leader elections are rare, this may be ok.
> > > >
> > > > Firstly, We don't want to assume the remote storage is more reliable
> > than
> > > > Kafka. Kafka unclean leader election usually happens when there is a
> > large
> > > > scale outage that impacts multiple racks (or even multiple
> availability
> > > > zones). In such a case, the remote storage may be unavailable or
> > unstable.
> > > > Pulling a large amount of data from the remote storage to reconcile
> the
> > > > local data may also exacerbate the outage. With the current design,
> > the new
> > > > leader can start working even when the remote storage is temporarily
> > > > unavailable.
> > > >
> > > > Secondly, it is not easier to implement the reconciling logic at the
> > leader
> > > > side. It can take a long time for the new leader to download the
> remote
> > > > data and rebuild local producer id / leader epoch information. During
> > this
> > > > period, the leader cannot accept any requests from the clients and
> > > > followers. We have to introduce a new state for the leader, and a new
> > error
> > > > code to let the clients / followers know what is happening.
> > > >
> > > >
> > > >
> > > > 1002. RemoteStorageManager.
> > > > 1002.1 There seems to be some inconsistencies in
> RemoteStorageManager.
> > We
> > > > pass in RemoteLogSegmentId copyLogSegment(). For all other methods,
> we
> > pass
> > > > in RemoteLogSegmentMetadata.
> > > >
> > > > Nice catch, we can have the RemoteLogSegmentMetadata for
> copyLogSegment
> > > > too.
> > > >
> > > > 1002.2 Is endOffset in RemoteLogSegmentMetadata inclusive or
> exclusive?
> > > >
> > > > It is inclusive.
> > > >
> > > > 1002.3 It seems that we need an api to get the leaderEpoch history
> for
> > a
> > > > partition.
> > > >
> > > > Yes, updated the KIP with the new method.
> > > >
> > > >
> > > > 1002.4 Could you define the type of RemoteLogSegmentContext?
> > > >
> > > > This is removed in the latest code and it is not needed.
> > > >
> > > >
> > > > 1003 RemoteLogMetadataManager
> > > >
> > > > 1003.1 I am not sure why we need both of the following methods
> > > > in RemoteLogMetadataManager. Could we combine them into one that
> takes
> > in
> > > > offset and returns RemoteLogSegmentMetadata?
> > > >     RemoteLogSegmentId getRemoteLogSegmentId(TopicPartition
> > topicPartition,
> > > > long offset) throws IOException;
> > > >     RemoteLogSegmentMetadata
> > getRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > remoteLogSegmentId) throws IOException;
> > > >
> > > > Good point, these can be merged for now. I guess we needed them in
> > earlier
> > > > version of the implementation but it is not needed now.
> > > >
> > > > 1003.2 There seems to be some inconsistencies in the methods below. I
> > am
> > > > not sure why one takes RemoteLogSegmentMetadata and the other
> > > > takes RemoteLogSegmentId.
> > > >     void putRemoteLogSegmentData(RemoteLogSegmentMetadata
> > > > remoteLogSegmentMetadata) throws IOException;
> > > >     void deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > remoteLogSegmentId) throws IOException;
> > > >
> > > > RLMM stores RemoteLogSegmentMetadata which is identified by
> > > > RemoteLogsSegmentId. So, when it is added it takes
> > > > RemoteLogSegmentMetadata. `delete` operation needs only
> > RemoteLogsSegmentId
> > > > as RemoteLogSegmentMetadata can be identified with
> RemoteLogsSegmentId.
> > > >
> > > > 1003.3 In void onServerStarted(final String serverEndpoint), what
> > > > is serverEndpoint used for?
> > > >
> > > > This can be used by RLMM implementation to connect to the local Kafka
> > > > cluster. Incase of default  implementation, it is used in
> initializing
> > > > kafka clients connecting to the local cluster.
> > > >
> > > > 1004. It would be useful to document how all the new APIs are being
> > used.
> > > > For example, when is RemoteLogSegmentMetadata.markedForDeletion being
> > set
> > > > and used? How are
> > > > RemoteLogMetadataManager.earliestLogOffset/highestLogOffset being
> used?
> > > >
> > > > RLMM APIs are going through the changes and they should be ready in a
> > few
> > > > days. I will update the KIP and the mail  thread once they are ready.
> > > >
> > > > 1005. Handling partition deletion: The KIP says "RLMM will eventually
> > > > delete these segments by using RemoteStorageManager." Which replica
> > does
> > > > this logic?
> > > >
> > > > This is a good point. When a topic is deleted, it will not have any
> > > > leader/followers to do the cleanup. We will have a cleaner agent on a
> > > > single broker in the cluster to do this cleanup, we plan to add that
> in
> > > > controller broker.
> > > >
> > > > 1006. "If there are any failures in removing remote log segments then
> > those
> > > > are stored in a specific topic (default as
> > __remote_segments_to_be_deleted)
> > > > and user can consume the events(which contain remote-log-segment-id)
> > from
> > > > that topic and clean them up from remote storage.  " Not sure if it's
> > worth
> > > > the complexity of adding another topic. Could we just retry?
> > > >
> > > > Sure, we can keep this simpler for now by logging an error after
> > retries.
> > > > We can give users a better way to process this in future. Oneway can
> > be a
> > > > dead letter topic which can be configured by the user.
> > > >
> > > > 1007. RemoteFetchPurgatory: Could we just reuse the existing
> > > > fetchPurgatory?
> > > >
> > > > We have 2 types of delayed operations waiting for 2 different events.
> > > > DelayedFetch waits for new messages from producers.
> DelayedRemoteFetch
> > > > waits for the remote-storage-read-task to finish. When either of the
> 2
> > > > events happens, we only want to notify one type of the delayed
> > operations.
> > > > It would be inefficient to put 2 types of delayed operations in one
> > > > purgatory, as the tryComplete() methods of the delayed operations can
> > be
> > > > triggered by irrelevant events.
> > > >
> > > >
> > > > 1008. Configurations:
> > > > 1008.1 remote.log.retention.ms, remote.log.retention.minutes,
> > > > remote.log.retention.hours: It seems that we just need the ms one.
> > Also,
> > > > are we changing the meaning of existing config log.retention.ms to
> > mean
> > > > the
> > > > local retention? For backward compatibility, it's better to not
> change
> > the
> > > > meaning of existing configurations.
> > > >
> > > > We agree that we only need remote.log.retention.ms. But, the
> existing
> > > > Kafka
> > > > configuration
> > > > has 3 properties (log.retention.ms, log.retention.minutes,
> > > > log.retention.hours). We just
> > > > want to keep consistent with the existing properties.
> > > > Existing log.retention.xxxx config is about log retention in broker’s
> > > > storage which is local. It should be easy for users to configure
> > partition
> > > > storage with local retention and remote retention based on their
> usage.
> > > >
> > > > 1008.2 Should remote.log.storage.enable be at the topic level?
> > > >
> > > > We can introduce topic level config for the same remote.log settings.
> > User
> > > > can set the desired config while creating the topic.
> > > > remote.log.storage.enable property is not allowed to be updated after
> > the
> > > > topic is created. Other remote.log.* properties can be modified. We
> > will
> > > > support flipping remote.log.storage.enable in next versions.
> > > >
> > > > 1009. It would be useful to list all limitations in a separate
> section:
> > > > compacted topic, JBOD, etc. Also, is changing a topic from delete to
> > > > compact and vice versa allowed when tiering is enabled?
> > > >
> > > > +1 to have limitations in a separate section. We will update the KIP
> > with
> > > > that.
> > > > Topic  created with effective value for remote.log.enabled as true,
> > can not
> > > > change its retention policy from delete to compact.
> > > >
> > > > 1010. Thanks for performance numbers. Are those with RocksDB as the
> > cache?
> > > >
> > > > No, We have not yet added RocksDB support. This is based on in-memory
> > map
> > > > representation. We will add that support and update this thread after
> > > > updating the KIP with the numbers.
> > > >
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > >
> > > > On Tue, Jul 21, 2020 at 6:49 AM Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hi, Satish, Ying, Harsha,
> > > > >
> > > > > Thanks for the updated KIP. A few more comments below.
> > > > >
> > > > > 1000. Regarding Colin's question on querying the metadata directly
> > in the
> > > > > remote block store. One issue is that not all block stores offer
> the
> > > > needed
> > > > > api to query the metadata. For example, S3 only offers an api to
> list
> > > > > objects under a prefix and this api has the eventual consistency
> > > > semantic.
> > > > >
> > > > > 1001. The KIP described a few scenarios of unclean leader
> elections.
> > This
> > > > > is very useful, but I am wondering if this is the best approach. My
> > > > > understanding of the proposed approach is to allow the new
> (unclean)
> > > > leader
> > > > > to take new messages immediately. While this increases
> availability,
> > it
> > > > > creates the problem that there could be multiple conflicting
> > segments in
> > > > > the remote store for the same offset range. This seems to make it
> > harder
> > > > > for RLMM to determine which archived log segments contain the
> correct
> > > > data.
> > > > > For example, an archived log segment could at one time be the
> correct
> > > > data,
> > > > > but be changed to incorrect data after an unclean leader election.
> An
> > > > > alternative approach is to let the unclean leader use the archived
> > data
> > > > as
> > > > > the source of truth. So, when the new (unclean) leader takes over,
> it
> > > > first
> > > > > reconciles the local data based on the archived data before taking
> > new
> > > > > messages. This makes the job of RLMM a bit easier since all
> archived
> > data
> > > > > are considered correct. This increases availability a bit. However,
> > since
> > > > > unclean leader elections are rare, this may be ok.
> > > > >
> > > > > 1002. RemoteStorageManager.
> > > > > 1002.1 There seems to be some inconsistencies in
> > RemoteStorageManager. We
> > > > > pass in RemoteLogSegmentId copyLogSegment(). For all other methods,
> > we
> > > > pass
> > > > > in RemoteLogSegmentMetadata.
> > > > > 1002.2 Is endOffset in RemoteLogSegmentMetadata inclusive or
> > exclusive?
> > > > > 1002.3 It seems that we need an api to get the leaderEpoch history
> > for a
> > > > > partition.
> > > > > 1002.4 Could you define the type of RemoteLogSegmentContext?
> > > > >
> > > > > 1003 RemoteLogMetadataManager
> > > > > 1003.1 I am not sure why we need both of the following methods
> > > > > in RemoteLogMetadataManager. Could we combine them into one that
> > takes in
> > > > > offset and returns RemoteLogSegmentMetadata?
> > > > >     RemoteLogSegmentId getRemoteLogSegmentId(TopicPartition
> > > > topicPartition,
> > > > > long offset) throws IOException;
> > > > >     RemoteLogSegmentMetadata
> > > > getRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > remoteLogSegmentId) throws IOException;
> > > > > 1003.2 There seems to be some inconsistencies in the methods below.
> > I am
> > > > > not sure why one takes RemoteLogSegmentMetadata and the other
> > > > > takes RemoteLogSegmentId.
> > > > >     void putRemoteLogSegmentData(RemoteLogSegmentMetadata
> > > > > remoteLogSegmentMetadata) throws IOException;
> > > > >     void deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > remoteLogSegmentId) throws IOException;
> > > > > 1003.3 In void onServerStarted(final String serverEndpoint), what
> > > > > is serverEndpoint used for?
> > > > >
> > > > > 1004. It would be useful to document how all the new APIs are being
> > used.
> > > > > For example, when is RemoteLogSegmentMetadata.markedForDeletion
> > being set
> > > > > and used? How are
> > > > > RemoteLogMetadataManager.earliestLogOffset/highestLogOffset being
> > used?
> > > > >
> > > > > 1005. Handling partition deletion: The KIP says "RLMM will
> eventually
> > > > > delete these segments by using RemoteStorageManager." Which replica
> > does
> > > > > this logic?
> > > > >
> > > > > 1006. "If there are any failures in removing remote log segments
> then
> > > > those
> > > > > are stored in a specific topic (default as
> > > > __remote_segments_to_be_deleted)
> > > > > and user can consume the events(which contain
> remote-log-segment-id)
> > from
> > > > > that topic and clean them up from remote storage.  " Not sure if
> it's
> > > > worth
> > > > > the complexity of adding another topic. Could we just retry?
> > > > >
> > > > > 1007. RemoteFetchPurgatory: Could we just reuse the existing
> > > > > fetchPurgatory?
> > > > >
> > > > > 1008. Configurations:
> > > > > 1008.1 remote.log.retention.ms, remote.log.retention.minutes,
> > > > > remote.log.retention.hours: It seems that we just need the ms one.
> > Also,
> > > > > are we changing the meaning of existing config log.retention.ms to
> > mean
> > > > > the
> > > > > local retention? For backward compatibility, it's better to not
> > change
> > > > the
> > > > > meaning of existing configurations.
> > > > > 1008.2 Should remote.log.storage.enable be at the topic level?
> > > > >
> > > > > 1009. It would be useful to list all limitations in a separate
> > section:
> > > > > compacted topic, JBOD, etc. Also, is changing a topic from delete
> to
> > > > > compact and vice versa allowed when tiering is enabled?
> > > > >
> > > > > 1010. Thanks for performance numbers. Are those with RocksDB as the
> > > > cache?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Wed, Jul 15, 2020 at 6:12 PM Harsha Ch <harsha...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Colin,
> > > > > >                Thats not what we said in the previous email. RLMM
> > is
> > > > > > pluggable storage and by running numbers even 1PB data you do not
> > need
> > > > > more
> > > > > > than 10GB local storage.
> > > > > > If in future this becomes a blocker for any users we can revisit
> > but
> > > > this
> > > > > > does not warrant another implementation at this point to push the
> > data
> > > > to
> > > > > > remote storage.
> > > > > > We can ofcourse implement another RLMM that is optional for users
> > to
> > > > > > configure to push to remote. But that doesn't need to be
> addressed
> > in
> > > > > this
> > > > > > KIP.
> > > > > >
> > > > > > Thanks,
> > > > > > Harsha
> > > > > >
> > > > > > On Wed, Jul 15, 2020 at 5:50 PM Colin McCabe <cmcc...@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Ying,
> > > > > > >
> > > > > > > Thanks for the response.
> > > > > > >
> > > > > > > It sounds like you agree that storing the metadata in the
> remote
> > > > > storage
> > > > > > > would be a better design overall.  Given that that's true, is
> > there
> > > > any
> > > > > > > reason to include the worse implementation based on RocksDB?
> > > > > > >
> > > > > > > Choosing a long-term metadata store is not something that we
> > should
> > > > do
> > > > > > > lightly.  It can take users years to migrate from metadata
> store
> > to
> > > > the
> > > > > > > other.  I also don't think it's realistic or desirable for
> users
> > to
> > > > > write
> > > > > > > their own metadata stores.  Even assuming that they could do a
> > good
> > > > job
> > > > > > at
> > > > > > > this, it would create huge fragmentation in the Kafka
> ecosystem.
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jul 14, 2020, at 09:39, Ying Zheng wrote:
> > > > > > > > Hi Jun,
> > > > > > > > Hi Colin,
> > > > > > > >
> > > > > > > > Satish and I are still discussing some details about how to
> > handle
> > > > > > > > transactions / producer ids. Satish is going to make some
> minor
> > > > > changes
> > > > > > > to
> > > > > > > > RLMM API and other parts. Other than that, we have finished
> > > > updating
> > > > > > the
> > > > > > > KIP
> > > > > > > >
> > > > > > > > I agree with Colin that the current design of using rocksDB
> is
> > not
> > > > > > > > optimal. But this design is simple and should work for almost
> > all
> > > > the
> > > > > > > > existing Kafka users. RLMM is a plugin. Users can replace
> > rocksDB
> > > > > with
> > > > > > > > their own RLMM implementation, if needed. So, I think we can
> > keep
> > > > > > rocksDB
> > > > > > > > for now. What do you think?
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ying
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jul 7, 2020 at 10:35 AM Jun Rao <j...@confluent.io>
> > wrote:
> > > > > > > >
> > > > > > > > > Hi, Ying,
> > > > > > > > >
> > > > > > > > > Thanks for the update. It's good to see the progress on
> this.
> > > > > Please
> > > > > > > let us
> > > > > > > > > know when you are done updating the KIP wiki.
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng
> > > > <yi...@uber.com.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jun,
> > > > > > > > > >
> > > > > > > > > > Satish and I have added more design details in the KIP,
> > > > including
> > > > > > > how to
> > > > > > > > > > keep consistency between replicas (especially when there
> is
> > > > > > > leadership
> > > > > > > > > > changes / log truncations) and new metrics. We also made
> > some
> > > > > other
> > > > > > > minor
> > > > > > > > > > changes in the doc. We will finish the KIP changes in the
> > next
> > > > > > > couple of
> > > > > > > > > > days. We will let you know when we are done. Most of the
> > > > changes
> > > > > > are
> > > > > > > > > > already updated to the wiki KIP. You can take a look. But
> > it's
> > > > > not
> > > > > > > the
> > > > > > > > > > final version yet.
> > > > > > > > > >
> > > > > > > > > > As for the implementation, the code is mostly done and we
> > > > already
> > > > > > had
> > > > > > > > > some
> > > > > > > > > > feature tests / system tests. I have added the
> performance
> > test
> > > > > > > results
> > > > > > > > > in
> > > > > > > > > > the KIP. However the recent design changes (e.g. leader
> > epoch
> > > > > info
> > > > > > > > > > management / log truncation / some of the new metrics)
> > have not
> > > > > > been
> > > > > > > > > > implemented yet. It will take about 2 weeks for us to
> > implement
> > > > > > > after you
> > > > > > > > > > review and agree with those design changes.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 7, 2020 at 9:23 AM Jun Rao <j...@confluent.io
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Satish, Harsha,
> > > > > > > > > > >
> > > > > > > > > > > Any new updates on the KIP? This feature is one of the
> > most
> > > > > > > important
> > > > > > > > > and
> > > > > > > > > > > most requested features in Apache Kafka right now. It
> > would
> > > > be
> > > > > > > helpful
> > > > > > > > > if
> > > > > > > > > > > we can make sustained progress on this. Could you share
> > how
> > > > far
> > > > > > > along
> > > > > > > > > is
> > > > > > > > > > > the design/implementation right now? Is there anything
> > that
> > > > > other
> > > > > > > > > people
> > > > > > > > > > > can help to get it across the line?
> > > > > > > > > > >
> > > > > > > > > > > As for "transactional support" and "follower
> > > > > > > requests/replication", no
> > > > > > > > > > > further comments from me as long as the producer state
> > and
> > > > > leader
> > > > > > > epoch
> > > > > > > > > > can
> > > > > > > > > > > be restored properly from the object store when needed.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana <
> > > > > > > > > satish.dugg...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > We did not want to add many implementation details in
> > the
> > > > > KIP.
> > > > > > > But we
> > > > > > > > > > > > decided to add them in the KIP as appendix or
> > > > > > > sub-sections(including
> > > > > > > > > > > > follower fetch protocol) to describe the flow with
> the
> > main
> > > > > > > cases.
> > > > > > > > > > > > That will answer most of the queries. I will update
> on
> > this
> > > > > > mail
> > > > > > > > > > > > thread when the respective sections are updated.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Satish.
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
> > > > > > > > > > > > <alexandre.dupr...@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Satish,
> > > > > > > > > > > > >
> > > > > > > > > > > > > A couple of questions specific to the section
> > "Follower
> > > > > > > > > > > > > Requests/Replication", pages 16:17 in the design
> > document
> > > > > > [1].
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900. It is mentioned that followers fetch auxiliary
> > > > states
> > > > > > > from the
> > > > > > > > > > > > > remote storage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.a Does the consistency model of the external
> > storage
> > > > > > > impacts
> > > > > > > > > > reads
> > > > > > > > > > > > > of leader epochs and other auxiliary data?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.b What are the benefits of using a mechanism to
> > store
> > > > > and
> > > > > > > > > access
> > > > > > > > > > > > > the leader epochs which is different from other
> > metadata
> > > > > > > associated
> > > > > > > > > > to
> > > > > > > > > > > > > tiered segments? What are the benefits of
> retrieving
> > this
> > > > > > > > > information
> > > > > > > > > > > > > on-demand from the follower rather than relying on
> > > > > > propagation
> > > > > > > via
> > > > > > > > > > the
> > > > > > > > > > > > > topic __remote_log_metadata? What are the
> advantages
> > over
> > > > > > > using a
> > > > > > > > > > > > > dedicated control structure (e.g. a new record
> type)
> > > > > > > propagated via
> > > > > > > > > > > > > this topic? Since in the document, different
> control
> > > > paths
> > > > > > are
> > > > > > > > > > > > > operating in the system, how are the metadata
> stored
> > in
> > > > > > > > > > > > > __remote_log_metadata [which also include the epoch
> > of
> > > > the
> > > > > > > leader
> > > > > > > > > > > > > which offloaded a segment] and the remote auxiliary
> > > > states,
> > > > > > > kept in
> > > > > > > > > > > > > sync?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.c A follower can encounter an
> > > > > > > OFFSET_MOVED_TO_TIERED_STORAGE.
> > > > > > > > > Is
> > > > > > > > > > > > > this in response to a Fetch or OffsetForLeaderEpoch
> > > > > request?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.d What happens if, after a follower encountered
> > an
> > > > > > > > > > > > > OFFSET_MOVED_TO_TIERED_STORAGE response, its
> > attempts to
> > > > > > > retrieve
> > > > > > > > > > > > > leader epochs fail (for instance, because the
> remote
> > > > > storage
> > > > > > is
> > > > > > > > > > > > > temporarily unavailable)? Does the follower
> > fallbacks to
> > > > a
> > > > > > mode
> > > > > > > > > where
> > > > > > > > > > > > > it ignores tiered segments, and applies truncation
> > using
> > > > > only
> > > > > > > > > locally
> > > > > > > > > > > > > available information? What happens when access to
> > the
> > > > > remote
> > > > > > > > > storage
> > > > > > > > > > > > > is restored? How is the replica lineage inferred by
> > the
> > > > > > remote
> > > > > > > > > leader
> > > > > > > > > > > > > epochs reconciled with the follower's replica
> > lineage,
> > > > > which
> > > > > > > has
> > > > > > > > > > > > > evolved? Does the follower remember fetching
> > auxiliary
> > > > > states
> > > > > > > > > failed
> > > > > > > > > > > > > in the past and attempt reconciliation? Is there a
> > plan
> > > > to
> > > > > > > offer
> > > > > > > > > > > > > different strategies in this scenario, configurable
> > via
> > > > > > > > > > configuration?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.e Is the leader epoch cache offloaded with
> every
> > > > > segment?
> > > > > > > Or
> > > > > > > > > when
> > > > > > > > > > > > > a new checkpoint is detected? If that information
> is
> > not
> > > > > > always
> > > > > > > > > > > > > offloaded to avoid duplicating data, how does the
> > remote
> > > > > > > storage
> > > > > > > > > > > > > satisfy the request to retrieve it?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.f Since the leader epoch cache covers the
> entire
> > > > > replica
> > > > > > > > > lineage,
> > > > > > > > > > > > > what happens if, after a leader epoch cache file is
> > > > > offloaded
> > > > > > > with
> > > > > > > > > a
> > > > > > > > > > > > > given segment, the local epoch cache is truncated
> > [not
> > > > > > > necessarily
> > > > > > > > > > for
> > > > > > > > > > > > > a range of offset included in tiered segments]? How
> > are
> > > > > > remote
> > > > > > > and
> > > > > > > > > > > > > local leader epoch caches kept consistent?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.g Consumer can also use leader epochs (e.g. to
> > enable
> > > > > > > fencing
> > > > > > > > > to
> > > > > > > > > > > > > protect against stale leaders). What differences
> > would
> > > > > there
> > > > > > be
> > > > > > > > > > > > > between consumer and follower fetches? Especially,
> > would
> > > > > > > consumers
> > > > > > > > > > > > > also fetch leader epoch information from the remote
> > > > > storage?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 900.h Assume a newly elected leader of a
> > topic-partition
> > > > > > > detects
> > > > > > > > > more
> > > > > > > > > > > > > recent segments are available in the external
> > storage,
> > > > with
> > > > > > > epochs
> > > > > > > > > >
> > > > > > > > > > > > > its local epoch. Does it ignore these segments and
> > their
> > > > > > > associated
> > > > > > > > > > > > > epoch-to-offset vectors? Or try to reconstruct its
> > local
> > > > > > > replica
> > > > > > > > > > > > > lineage based on the data remotely available?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Alexandre
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/18tnobSas3mKFZFr8oRguZoj_tkD_sGzivuLRlMloEMs/edit?usp=sharing
> > > > > > > > > > > > >
> > > > > > > > > > > > > Le jeu. 4 juin 2020 à 19:55, Satish Duggana <
> > > > > > > > > > satish.dugg...@gmail.com>
> > > > > > > > > > > > a écrit :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > Please let us know if you have any comments on
> > > > > > "transactional
> > > > > > > > > > > support"
> > > > > > > > > > > > > > and "follower requests/replication" mentioned in
> > the
> > > > > wiki.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jun 2, 2020 at 9:25 PM Satish Duggana <
> > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks Jun for your comments.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100. It would be useful to provide more
> details
> > on
> > > > how
> > > > > > > those
> > > > > > > > > > apis
> > > > > > > > > > > > are used. Otherwise, it's kind of hard to really
> assess
> > > > > whether
> > > > > > > the
> > > > > > > > > new
> > > > > > > > > > > > apis are sufficient/redundant. A few examples below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We will update the wiki and let you know.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100.1 deleteRecords seems to only advance the
> > > > > > > logStartOffset
> > > > > > > > > in
> > > > > > > > > > > > Log. How does that trigger the deletion of remote log
> > > > > segments?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > RLMTask for leader partition periodically
> checks
> > > > > whether
> > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > > > > > > remote log segments earlier to logStartOffset
> > and the
> > > > > > > > > respective
> > > > > > > > > > > > > > > remote log segment metadata and data are
> deleted
> > by
> > > > > using
> > > > > > > RLMM
> > > > > > > > > > and
> > > > > > > > > > > > > > > RSM.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100.2 stopReplica with deletion is used in 2
> > cases
> > > > (a)
> > > > > > > replica
> > > > > > > > > > > > reassignment; (b) topic deletion. We only want to
> > delete
> > > > the
> > > > > > > tiered
> > > > > > > > > > > > metadata in the second case. Also, in the second
> case,
> > who
> > > > > > > initiates
> > > > > > > > > > the
> > > > > > > > > > > > deletion of the remote segment since the leader may
> not
> > > > > exist?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Right, it is deleted only incase of topic
> > deletion
> > > > > only.
> > > > > > We
> > > > > > > > > will
> > > > > > > > > > > > cover
> > > > > > > > > > > > > > > the details in the KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100.3 "LogStartOffset of a topic can be either
> > in
> > > > > local
> > > > > > > or in
> > > > > > > > > > > > remote storage." If LogStartOffset exists in both
> > places,
> > > > > which
> > > > > > > one
> > > > > > > > > is
> > > > > > > > > > > the
> > > > > > > > > > > > source of truth?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I meant the logStartOffset can point to either
> of
> > > > local
> > > > > > > segment
> > > > > > > > > > or
> > > > > > > > > > > > > > > remote segment but it is initialised and
> > maintained
> > > > in
> > > > > > the
> > > > > > > Log
> > > > > > > > > > > class
> > > > > > > > > > > > > > > like now.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100.4 List<RemoteLogSegmentMetadata>
> > > > > > > > > > > > listRemoteLogSegments(TopicPartition topicPartition,
> > long
> > > > > > > minOffset):
> > > > > > > > > > How
> > > > > > > > > > > > is minOffset supposed to be used?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Returns list of remote segments, sorted by
> > baseOffset
> > > > > in
> > > > > > > > > > ascending
> > > > > > > > > > > > > > > order that have baseOffset >= the given min
> > Offset.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100.5 When copying a segment to remote
> storage,
> > it
> > > > > seems
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > > > calling the same RLMM.putRemoteLogSegmentData() twice
> > > > before
> > > > > > and
> > > > > > > > > after
> > > > > > > > > > > > copyLogSegment(). Could you explain why?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is more about prepare/commit/rollback as
> you
> > > > > > > suggested.
> > > > > > > > > We
> > > > > > > > > > > will
> > > > > > > > > > > > > > > update the wiki with the new APIs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >100.6 LogSegmentData includes
> leaderEpochCache,
> > but
> > > > > > there
> > > > > > > is
> > > > > > > > > no
> > > > > > > > > > > api
> > > > > > > > > > > > in RemoteStorageManager to retrieve it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Nice catch, copy/paste issue. There is an API
> to
> > > > > retrieve
> > > > > > > it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >101. If the __remote_log_metadata is for
> > production
> > > > > > usage,
> > > > > > > > > could
> > > > > > > > > > > > you provide more details? For example, what is the
> > schema
> > > > of
> > > > > > the
> > > > > > > data
> > > > > > > > > > > (both
> > > > > > > > > > > > key and value)? How is the topic maintained,delete or
> > > > > compact?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It is with delete config and it’s retention
> > period is
> > > > > > > suggested
> > > > > > > > > > to
> > > > > > > > > > > be
> > > > > > > > > > > > > > > more than the remote retention period.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >110. Is the cache implementation in
> > > > > > > RemoteLogMetadataManager
> > > > > > > > > > meant
> > > > > > > > > > > > for production usage? If so, could you provide more
> > details
> > > > > on
> > > > > > > the
> > > > > > > > > > schema
> > > > > > > > > > > > and how/where the data is stored?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The proposal is to have a cache (with default
> > > > > > > implementation
> > > > > > > > > > backed
> > > > > > > > > > > > by
> > > > > > > > > > > > > > > rocksdb) but it will be added in later
> versions.
> > We
> > > > > will
> > > > > > > add
> > > > > > > > > this
> > > > > > > > > > > to
> > > > > > > > > > > > > > > future work items.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >111. "Committed offsets can be stored in a
> local
> > > > > file".
> > > > > > > Could
> > > > > > > > > > you
> > > > > > > > > > > > describe the format of the file and where it's
> stored?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We will cover this in the KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >112. Truncation of remote segments under
> unclean
> > > > > leader
> > > > > > > > > > election:
> > > > > > > > > > > I
> > > > > > > > > > > > am not sure who figures out the truncated remote
> > segments
> > > > and
> > > > > > how
> > > > > > > > > that
> > > > > > > > > > > > information is propagated to all replicas?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We will add this in detail in the KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >113. "If there are any failures in removing
> > remote
> > > > log
> > > > > > > > > segments
> > > > > > > > > > > > then those are stored in a specific topic (default as
> > > > > > > > > > > > __remote_segments_to_be_deleted)". Is it necessary to
> > add
> > > > yet
> > > > > > > another
> > > > > > > > > > > > internal topic? Could we just keep retrying?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is not really an internal topic, it will
> be
> > > > > exposed
> > > > > > > as a
> > > > > > > > > > user
> > > > > > > > > > > > > > > configurable topic. After a few retries, we
> want
> > user
> > > > > to
> > > > > > > know
> > > > > > > > > > about
> > > > > > > > > > > > > > > the failure so that they can take an action
> > later by
> > > > > > > consuming
> > > > > > > > > > from
> > > > > > > > > > > > > > > this topic. We want to keep this simple instead
> > of
> > > > > > retrying
> > > > > > > > > > > > > > > continuously and maintaining the deletion state
> > etc.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >114. "We may not need to copy
> > producer-id-snapshot
> > > > as
> > > > > we
> > > > > > > are
> > > > > > > > > > > > copying only segments earlier to last-stable-offset."
> > Hmm,
> > > > > not
> > > > > > > sure
> > > > > > > > > > about
> > > > > > > > > > > > that. The producer snapshot includes things like the
> > last
> > > > > > > timestamp
> > > > > > > > > of
> > > > > > > > > > > each
> > > > > > > > > > > > open producer id and can affect when those producer
> > ids are
> > > > > > > expired.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Sure, this will be added as part of the
> > > > LogSegmentData.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, May 29, 2020 at 6:39 AM Jun Rao <
> > > > > > j...@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi, Satish,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Made another pass on the wiki. A few more
> > comments
> > > > > > below.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 100. It would be useful to provide more
> > details on
> > > > > how
> > > > > > > those
> > > > > > > > > > apis
> > > > > > > > > > > > are used. Otherwise, it's kind of hard to really
> assess
> > > > > whether
> > > > > > > the
> > > > > > > > > new
> > > > > > > > > > > > apis are sufficient/redundant. A few examples below.
> > > > > > > > > > > > > > > > 100.1 deleteRecords seems to only advance the
> > > > > > > logStartOffset
> > > > > > > > > in
> > > > > > > > > > > > Log. How does that trigger the deletion of remote log
> > > > > segments?
> > > > > > > > > > > > > > > > 100.2 stopReplica with deletion is used in 2
> > cases
> > > > > (a)
> > > > > > > > > replica
> > > > > > > > > > > > reassignment; (b) topic deletion. We only want to
> > delete
> > > > the
> > > > > > > tiered
> > > > > > > > > > > > metadata in the second case. Also, in the second
> case,
> > who
> > > > > > > initiates
> > > > > > > > > > the
> > > > > > > > > > > > deletion of the remote segment since the leader may
> not
> > > > > exist?
> > > > > > > > > > > > > > > > 100.3 "LogStartOffset of a topic can be
> either
> > in
> > > > > local
> > > > > > > or in
> > > > > > > > > > > > remote storage." If LogStartOffset exists in both
> > places,
> > > > > which
> > > > > > > one
> > > > > > > > > is
> > > > > > > > > > > the
> > > > > > > > > > > > source of truth?
> > > > > > > > > > > > > > > > 100.4 List<RemoteLogSegmentMetadata>
> > > > > > > > > > > > listRemoteLogSegments(TopicPartition topicPartition,
> > long
> > > > > > > minOffset):
> > > > > > > > > > How
> > > > > > > > > > > > is minOffset supposed to be used?
> > > > > > > > > > > > > > > > 100.5 When copying a segment to remote
> > storage, it
> > > > > > seems
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > > > calling the same RLMM.putRemoteLogSegmentData() twice
> > > > before
> > > > > > and
> > > > > > > > > after
> > > > > > > > > > > > copyLogSegment(). Could you explain why?
> > > > > > > > > > > > > > > > 100.6 LogSegmentData includes
> > leaderEpochCache, but
> > > > > > > there is
> > > > > > > > > no
> > > > > > > > > > > > api in RemoteStorageManager to retrieve it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 101. If the __remote_log_metadata is for
> > production
> > > > > > > usage,
> > > > > > > > > > could
> > > > > > > > > > > > you provide more details? For example, what is the
> > schema
> > > > of
> > > > > > the
> > > > > > > data
> > > > > > > > > > > (both
> > > > > > > > > > > > key and value)? How is the topic maintained,delete or
> > > > > compact?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 110. Is the cache implementation in
> > > > > > > RemoteLogMetadataManager
> > > > > > > > > > > meant
> > > > > > > > > > > > for production usage? If so, could you provide more
> > details
> > > > > on
> > > > > > > the
> > > > > > > > > > schema
> > > > > > > > > > > > and how/where the data is stored?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 111. "Committed offsets can be stored in a
> > local
> > > > > file".
> > > > > > > Could
> > > > > > > > > > you
> > > > > > > > > > > > describe the format of the file and where it's
> stored?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 112. Truncation of remote segments under
> > unclean
> > > > > leader
> > > > > > > > > > election:
> > > > > > > > > > > > I am not sure who figures out the truncated remote
> > segments
> > > > > and
> > > > > > > how
> > > > > > > > > > that
> > > > > > > > > > > > information is propagated to all replicas?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 113. "If there are any failures in removing
> > remote
> > > > > log
> > > > > > > > > segments
> > > > > > > > > > > > then those are stored in a specific topic (default as
> > > > > > > > > > > > __remote_segments_to_be_deleted)". Is it necessary to
> > add
> > > > yet
> > > > > > > another
> > > > > > > > > > > > internal topic? Could we just keep retrying?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 114. "We may not need to copy
> > producer-id-snapshot
> > > > as
> > > > > > we
> > > > > > > are
> > > > > > > > > > > > copying only segments earlier to last-stable-offset."
> > Hmm,
> > > > > not
> > > > > > > sure
> > > > > > > > > > about
> > > > > > > > > > > > that. The producer snapshot includes things like the
> > last
> > > > > > > timestamp
> > > > > > > > > of
> > > > > > > > > > > each
> > > > > > > > > > > > open producer id and can affect when those producer
> > ids are
> > > > > > > expired.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, May 28, 2020 at 5:38 AM Satish
> Duggana
> > <
> > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Hi Jun,
> > > > > > > > > > > > > > > >> Gentle reminder. Please go through the
> updated
> > > > wiki
> > > > > > and
> > > > > > > let
> > > > > > > > > us
> > > > > > > > > > > > know your comments.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Thanks,
> > > > > > > > > > > > > > > >> Satish.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On Tue, May 19, 2020 at 3:50 PM Satish
> > Duggana <
> > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Hi Jun,
> > > > > > > > > > > > > > > >>> Please go through the wiki which has the
> > latest
> > > > > > > updates.
> > > > > > > > > > Google
> > > > > > > > > > > > doc is updated frequently to be in sync with wiki.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Thanks,
> > > > > > > > > > > > > > > >>> Satish.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> On Tue, May 19, 2020 at 12:30 AM Jun Rao <
> > > > > > > j...@confluent.io
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Hi, Satish,
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Thanks for the update. Just to clarify.
> > Which
> > > > doc
> > > > > > has
> > > > > > > the
> > > > > > > > > > > > latest updates, the wiki or the google doc?
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Jun
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> On Thu, May 14, 2020 at 10:38 AM Satish
> > Duggana
> > > > <
> > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>> Thanks for your comments.  We updated the
> > KIP
> > > > > with
> > > > > > > more
> > > > > > > > > > > > details.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> >100. For each of the operations related
> to
> > > > > > tiering,
> > > > > > > it
> > > > > > > > > > would
> > > > > > > > > > > > be useful to provide a description on how it works
> > with the
> > > > > new
> > > > > > > API.
> > > > > > > > > > > These
> > > > > > > > > > > > include things like consumer fetch, replica fetch,
> > > > > > > > > offsetForTimestamp,
> > > > > > > > > > > > retention (remote and local) by size, time and
> > > > > logStartOffset,
> > > > > > > topic
> > > > > > > > > > > > deletion, etc. This will tell us if the proposed APIs
> > are
> > > > > > > sufficient.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> We addressed most of these APIs in the
> > KIP. We
> > > > > can
> > > > > > > add
> > > > > > > > > more
> > > > > > > > > > > > details if needed.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> >101. For the default implementation
> based
> > on
> > > > > > > internal
> > > > > > > > > > topic,
> > > > > > > > > > > > is it meant as a proof of concept or for production
> > usage?
> > > > I
> > > > > > > assume
> > > > > > > > > > that
> > > > > > > > > > > > it's the former. However, if it's the latter, then
> the
> > KIP
> > > > > > needs
> > > > > > > to
> > > > > > > > > > > > describe the design in more detail.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> It is production usage as was mentioned
> in
> > an
> > > > > > earlier
> > > > > > > > > mail.
> > > > > > > > > > > We
> > > > > > > > > > > > plan to update this section in the next few days.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> >102. When tiering a segment, the segment
> > is
> > > > > first
> > > > > > > > > written
> > > > > > > > > > to
> > > > > > > > > > > > the object store and then its metadata is written to
> > RLMM
> > > > > using
> > > > > > > the
> > > > > > > > > api
> > > > > > > > > > > > "void putRemoteLogSegmentData()". One potential issue
> > with
> > > > > this
> > > > > > > > > > approach
> > > > > > > > > > > is
> > > > > > > > > > > > that if the system fails after the first operation,
> it
> > > > > leaves a
> > > > > > > > > garbage
> > > > > > > > > > > in
> > > > > > > > > > > > the object store that's never reclaimed. One way to
> > improve
> > > > > > this
> > > > > > > is
> > > > > > > > > to
> > > > > > > > > > > have
> > > > > > > > > > > > two separate APIs, sth like
> > > > preparePutRemoteLogSegmentData()
> > > > > > and
> > > > > > > > > > > > commitPutRemoteLogSegmentData().
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> That is a good point. We currently have a
> > > > > different
> > > > > > > way
> > > > > > > > > > using
> > > > > > > > > > > > markers in the segment but your suggestion is much
> > better.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> >103. It seems that the transactional
> > support
> > > > and
> > > > > > the
> > > > > > > > > > ability
> > > > > > > > > > > > to read from follower are missing.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> KIP is updated with transactional
> support,
> > > > > follower
> > > > > > > fetch
> > > > > > > > > > > > semantics, and reading from a follower.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> >104. It would be useful to provide a
> > testing
> > > > > plan
> > > > > > > for
> > > > > > > > > this
> > > > > > > > > > > > KIP.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> We added a few tests by introducing test
> > util
> > > > for
> > > > > > > tiered
> > > > > > > > > > > > storage in the PR. We will provide the testing plan
> in
> > the
> > > > > next
> > > > > > > few
> > > > > > > > > > days.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > > >>>>> Satish.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> On Wed, Feb 26, 2020 at 9:43 PM Harsha
> > > > > > Chintalapani <
> > > > > > > > > > > > ka...@harsha.io> wrote:
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>> On Tue, Feb 25, 2020 at 12:46 PM, Jun
> Rao
> > <
> > > > > > > > > > j...@confluent.io
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi, Satish,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks for the updated doc. The new API
> > seems
> > > > > to
> > > > > > > be an
> > > > > > > > > > > > improvement overall. A few more comments below.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 100. For each of the operations related
> > to
> > > > > > > tiering, it
> > > > > > > > > > > would
> > > > > > > > > > > > be useful to provide a description on how it works
> > with the
> > > > > new
> > > > > > > API.
> > > > > > > > > > > These
> > > > > > > > > > > > include things like consumer fetch, replica fetch,
> > > > > > > > > offsetForTimestamp,
> > > > > > > > > > > > retention
> > > > > > > > > > > > > > > >>>>>>> (remote and local) by size, time and
> > > > > > > logStartOffset,
> > > > > > > > > > topic
> > > > > > > > > > > > deletion, etc. This will tell us if the proposed APIs
> > are
> > > > > > > sufficient.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>> Thanks for the feedback Jun. We will add
> > more
> > > > > > > details
> > > > > > > > > > around
> > > > > > > > > > > > this.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 101. For the default implementation
> > based on
> > > > > > > internal
> > > > > > > > > > > topic,
> > > > > > > > > > > > is it meant as a proof of concept or for production
> > usage?
> > > > I
> > > > > > > assume
> > > > > > > > > > that
> > > > > > > > > > > > it's the former. However, if it's the latter, then
> the
> > KIP
> > > > > > needs
> > > > > > > to
> > > > > > > > > > > > describe the design in more detail.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>> Yes it meant to be for production use.
> > > > Ideally
> > > > > it
> > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > > > good to merge this in as the default implementation
> for
> > > > > > metadata
> > > > > > > > > > service.
> > > > > > > > > > > > We can add more details around design and testing.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>> 102. When tiering a segment, the
> segment
> > is
> > > > > first
> > > > > > > > > written
> > > > > > > > > > > to
> > > > > > > > > > > > the object store and then its metadata is written to
> > RLMM
> > > > > using
> > > > > > > the
> > > > > > > > > api
> > > > > > > > > > > > "void putRemoteLogSegmentData()".
> > > > > > > > > > > > > > > >>>>>>> One potential issue with this approach
> is
> > > > that
> > > > > if
> > > > > > > the
> > > > > > > > > > > system
> > > > > > > > > > > > fails after the first operation, it leaves a garbage
> > in the
> > > > > > > object
> > > > > > > > > > store
> > > > > > > > > > > > that's never reclaimed. One way to improve this is to
> > have
> > > > > two
> > > > > > > > > separate
> > > > > > > > > > > > APIs, sth like preparePutRemoteLogSegmentData() and
> > > > > > > > > > > > commitPutRemoteLogSegmentData().
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 103. It seems that the transactional
> > support
> > > > > and
> > > > > > > the
> > > > > > > > > > > ability
> > > > > > > > > > > > to read from follower are missing.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 104. It would be useful to provide a
> > testing
> > > > > plan
> > > > > > > for
> > > > > > > > > > this
> > > > > > > > > > > > KIP.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>> We are working on adding more details
> > around
> > > > > > > > > transactional
> > > > > > > > > > > > support and coming up with test plan.
> > > > > > > > > > > > > > > >>>>>> Add system tests and integration tests.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Jun
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Mon, Feb 24, 2020 at 8:10 AM Satish
> > > > Duggana
> > > > > <
> > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>>>> Please look at the earlier reply and
> let
> > us
> > > > > know
> > > > > > > your
> > > > > > > > > > > > comments.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>> Satish.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Wed, Feb 12, 2020 at 4:06 PM Satish
> > > > Duggana
> > > > > <
> > > > > > > > > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>>>> Thanks for your comments on the
> > separation of
> > > > > > > remote
> > > > > > > > > log
> > > > > > > > > > > > metadata storage and remote log storage.
> > > > > > > > > > > > > > > >>>>>>> We had a few discussions since early
> Jan
> > on
> > > > how
> > > > > > to
> > > > > > > > > > support
> > > > > > > > > > > > eventually consistent stores like S3 by uncoupling
> > remote
> > > > log
> > > > > > > segment
> > > > > > > > > > > > metadata and remote log storage. It is written with
> > details
> > > > > in
> > > > > > > the
> > > > > > > > > doc
> > > > > > > > > > > > here(1). Below is the brief summary of the discussion
> > from
> > > > > that
> > > > > > > doc.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> The current approach consists of
> pulling
> > the
> > > > > > > remote log
> > > > > > > > > > > > segment metadata from remote log storage APIs. It
> > worked
> > > > fine
> > > > > > for
> > > > > > > > > > > storages
> > > > > > > > > > > > like HDFS. But one of the problems of relying on the
> > remote
> > > > > > > storage
> > > > > > > > > to
> > > > > > > > > > > > maintain metadata is that tiered-storage needs to be
> > > > strongly
> > > > > > > > > > consistent,
> > > > > > > > > > > > with an impact not only on the metadata(e.g. LIST in
> > S3)
> > > > but
> > > > > > > also on
> > > > > > > > > > the
> > > > > > > > > > > > segment data(e.g. GET after a DELETE in S3). The cost
> > of
> > > > > > > maintaining
> > > > > > > > > > > > metadata in remote storage needs to be factored in.
> > This is
> > > > > > true
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > case of S3, LIST APIs incur huge costs as you raised
> > > > earlier.
> > > > > > > > > > > > > > > >>>>>>> So, it is good to separate the remote
> > storage
> > > > > > from
> > > > > > > the
> > > > > > > > > > > > remote log metadata store. We refactored the existing
> > > > > > > > > > > RemoteStorageManager
> > > > > > > > > > > > and introduced RemoteLogMetadataManager. Remote log
> > > > metadata
> > > > > > > store
> > > > > > > > > > should
> > > > > > > > > > > > give strong consistency semantics but remote log
> > storage
> > > > can
> > > > > be
> > > > > > > > > > > eventually
> > > > > > > > > > > > consistent.
> > > > > > > > > > > > > > > >>>>>>> We can have a default implementation
> for
> > > > > > > > > > > > RemoteLogMetadataManager which uses an internal
> > topic(as
> > > > > > > mentioned in
> > > > > > > > > > one
> > > > > > > > > > > > of our earlier emails) as storage. But users can
> always
> > > > > plugin
> > > > > > > their
> > > > > > > > > > own
> > > > > > > > > > > > RemoteLogMetadataManager implementation based on
> their
> > > > > > > environment.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Please go through the updated KIP and
> > let us
> > > > > know
> > > > > > > your
> > > > > > > > > > > > comments. We have started refactoring for the changes
> > > > > mentioned
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > KIP
> > > > > > > > > > > > and there may be a few more updates to the APIs.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> [1]
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Fri, Dec 27, 2019 at 5:43 PM Ivan
> > > > Yurchenko
> > > > > <
> > > > > > > > > > > > ivan0yurche...@gmail.com>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi all,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Jun:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (a) Cost: S3 list object requests cost
> > $0.005
> > > > > per
> > > > > > > 1000
> > > > > > > > > > > > requests. If
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> you
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> have 100,000 partitions and want to
> pull
> > the
> > > > > > > metadata
> > > > > > > > > for
> > > > > > > > > > > > each
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> partition
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> at
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the rate of 1/sec. It can cost
> $0.5/sec,
> > > > which
> > > > > is
> > > > > > > > > roughly
> > > > > > > > > > > > $40K per
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> day.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> I want to note here, that no reasonably
> > > > durable
> > > > > > > storage
> > > > > > > > > > > will
> > > > > > > > > > > > be cheap at 100k RPS. For example, DynamoDB might
> give
> > the
> > > > > same
> > > > > > > > > > ballpark
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> figures.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> If we want to keep the pull-based
> > approach,
> > > > we
> > > > > > can
> > > > > > > try
> > > > > > > > > to
> > > > > > > > > > > > reduce this
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> number
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in several ways: doing listings less
> > > > frequently
> > > > > > (as
> > > > > > > > > > Satish
> > > > > > > > > > > > mentioned, with the current defaults it's ~3.33k RPS
> > for
> > > > your
> > > > > > > > > example),
> > > > > > > > > > > > batching listing operations in some way (depending on
> > the
> > > > > > > storage; it
> > > > > > > > > > > might
> > > > > > > > > > > > require the change of RSM's interface).
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> There are different ways for doing push
> > based
> > > > > > > metadata
> > > > > > > > > > > > propagation.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Some
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> object stores may support that already.
> > For
> > > > > > > example, S3
> > > > > > > > > > > > supports
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> events
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> notification
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> This sounds interesting. However, I
> see a
> > > > > couple
> > > > > > of
> > > > > > > > > > issues
> > > > > > > > > > > > using it:
> > > > > > > > > > > > > > > >>>>>>> 1. As I understand the documentation,
> > > > > > notification
> > > > > > > > > > delivery
> > > > > > > > > > > > is not guaranteed
> > > > > > > > > > > > > > > >>>>>>> and it's recommended to periodically do
> > LIST
> > > > to
> > > > > > > fill
> > > > > > > > > the
> > > > > > > > > > > > gaps. Which brings us back to the same LIST
> consistency
> > > > > > > guarantees
> > > > > > > > > > issue.
> > > > > > > > > > > > > > > >>>>>>> 2. The same goes for the broker start:
> > to get
> > > > > the
> > > > > > > > > current
> > > > > > > > > > > > state, we
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> need
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to LIST.
> > > > > > > > > > > > > > > >>>>>>> 3. The dynamic set of multiple
> consumers
> > > > > (RSMs):
> > > > > > > AFAIK
> > > > > > > > > > SQS
> > > > > > > > > > > > and SNS
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> aren't
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> designed for such a case.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Alexandre:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> A.1 As commented on PR 7561, S3
> > consistency
> > > > > model
> > > > > > > > > [1][2]
> > > > > > > > > > > > implies RSM
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> cannot
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> relies solely on S3 APIs to guarantee
> the
> > > > > > expected
> > > > > > > > > strong
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> consistency. The
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> proposed implementation [3] would need
> > to be
> > > > > > > updated to
> > > > > > > > > > > take
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> into
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> account. Let’s talk more about this.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thank you for the feedback. I clearly
> > see the
> > > > > > need
> > > > > > > for
> > > > > > > > > > > > changing the S3 implementation
> > > > > > > > > > > > > > > >>>>>>> to provide stronger consistency
> > guarantees.
> > > > As
> > > > > it
> > > > > > > see
> > > > > > > > > > from
> > > > > > > > > > > > this thread, there are
> > > > > > > > > > > > > > > >>>>>>> several possible approaches to this.
> > Let's
> > > > > > discuss
> > > > > > > > > > > > RemoteLogManager's contract and
> > > > > > > > > > > > > > > >>>>>>> behavior (like pull vs push model)
> > further
> > > > > before
> > > > > > > > > picking
> > > > > > > > > > > > one (or
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> several -
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> ?) of them.
> > > > > > > > > > > > > > > >>>>>>> I'm going to do some evaluation of
> > DynamoDB
> > > > for
> > > > > > the
> > > > > > > > > > > > pull-based
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> approach,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> if it's possible to apply it paying a
> > > > > reasonable
> > > > > > > bill.
> > > > > > > > > > > Also,
> > > > > > > > > > > > of the push-based approach
> > > > > > > > > > > > > > > >>>>>>> with a Kafka topic as the medium.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> A.2.3 Atomicity – what does an
> > implementation
> > > > > of
> > > > > > > RSM
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > provide
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> with
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> respect to atomicity of the APIs
> > > > > copyLogSegment,
> > > > > > > > > > > > cleanupLogUntil and deleteTopicPartition? If a
> partial
> > > > > failure
> > > > > > > > > happens
> > > > > > > > > > in
> > > > > > > > > > > > any of those
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (e.g.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the S3 implementation, if one of the
> > multiple
> > > > > > > uploads
> > > > > > > > > > fails
> > > > > > > > > > > > [4]),
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> The S3 implementation is going to
> > change, but
> > > > > > it's
> > > > > > > > > worth
> > > > > > > > > > > > clarifying
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> anyway.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> The segment log file is being uploaded
> > after
> > > > S3
> > > > > > has
> > > > > > > > > acked
> > > > > > > > > > > > uploading of all other files associated with the
> > segment
> > > > and
> > > > > > only
> > > > > > > > > after
> > > > > > > > > > > > this the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> whole
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> segment file set becomes visible
> > remotely for
> > > > > > > > > operations
> > > > > > > > > > > > like listRemoteSegments [1].
> > > > > > > > > > > > > > > >>>>>>> In case of upload failure, the files
> > that has
> > > > > > been
> > > > > > > > > > > > successfully
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> uploaded
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> stays
> > > > > > > > > > > > > > > >>>>>>> as invisible garbage that is collected
> by
> > > > > > > > > cleanupLogUntil
> > > > > > > > > > > (or
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> overwritten
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> successfully later).
> > > > > > > > > > > > > > > >>>>>>> And the opposite happens during the
> > deletion:
> > > > > log
> > > > > > > files
> > > > > > > > > > are
> > > > > > > > > > > > deleted
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> first.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> This approach should generally work
> when
> > we
> > > > > solve
> > > > > > > > > > > > consistency issues by adding a strongly consistent
> > > > storage: a
> > > > > > > > > segment's
> > > > > > > > > > > > uploaded files
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> remain
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> invisible garbage until some metadata
> > about
> > > > > them
> > > > > > is
> > > > > > > > > > > written.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> A.3 Caching – storing locally the
> > segments
> > > > > > > retrieved
> > > > > > > > > from
> > > > > > > > > > > > the remote storage is excluded as it does not align
> > with
> > > > the
> > > > > > > original
> > > > > > > > > > > intent
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and even
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> defeat some of its purposes (save disk
> > space
> > > > > > etc.).
> > > > > > > > > That
> > > > > > > > > > > > said, could
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> there
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> be other types of use cases where the
> > pattern
> > > > > of
> > > > > > > access
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> remotely
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> stored segments would benefit from
> local
> > > > > caching
> > > > > > > (and
> > > > > > > > > > > > potentially read-ahead)? Consider the use case of a
> > large
> > > > > pool
> > > > > > of
> > > > > > > > > > > consumers
> > > > > > > > > > > > which
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> start
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> a backfill at the same time for one day
> > worth
> > > > > of
> > > > > > > data
> > > > > > > > > > from
> > > > > > > > > > > > one year
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> ago
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> stored remotely. Caching the segments
> > locally
> > > > > > would
> > > > > > > > > allow
> > > > > > > > > > > to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> uncouple the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> load on the remote storage from the
> load
> > on
> > > > the
> > > > > > > Kafka
> > > > > > > > > > > > cluster. Maybe
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> RLM could expose a configuration
> > parameter to
> > > > > > > switch
> > > > > > > > > that
> > > > > > > > > > > > feature
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> on/off?
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> I tend to agree here, caching remote
> > segments
> > > > > > > locally
> > > > > > > > > and
> > > > > > > > > > > > making this configurable sounds pretty practical to
> > me. We
> > > > > > should
> > > > > > > > > > > implement
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> maybe not in the first iteration.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Br,
> > > > > > > > > > > > > > > >>>>>>> Ivan
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> [1]
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Thu, 19 Dec 2019 at 19:49, Alexandre
> > > > > Dupriez <
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> alexandre.dupr...@gmail.com>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thank you for the feedback. I am trying
> > to
> > > > > > > understand
> > > > > > > > > > how a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> push-based
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> approach would work.
> > > > > > > > > > > > > > > >>>>>>> In order for the metadata to be
> > propagated
> > > > > (under
> > > > > > > the
> > > > > > > > > > > > assumption you stated), would you plan to add a new
> > API in
> > > > > > Kafka
> > > > > > > to
> > > > > > > > > > allow
> > > > > > > > > > > > the metadata store to send them directly to the
> > brokers?
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>> Alexandre
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Le mer. 18 déc. 2019 à 20:14, Jun Rao <
> > > > > > > > > j...@confluent.io>
> > > > > > > > > > a
> > > > > > > > > > > > écrit :
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi, Satish and Ivan,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> There are different ways for doing push
> > based
> > > > > > > metadata
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> propagation. Some
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> object stores may support that already.
> > For
> > > > > > > example, S3
> > > > > > > > > > > > supports
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> events
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> notification (
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >
> > > > > > > > >
> > > > > >
> > https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
> > > > > > > > > > ).
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Otherwise one could use a separate
> > metadata
> > > > > store
> > > > > > > that
> > > > > > > > > > > > supports
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> push-based
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> change propagation. Other people have
> > > > mentioned
> > > > > > > using a
> > > > > > > > > > > Kafka
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> topic. The
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> best approach may depend on the object
> > store
> > > > > and
> > > > > > > the
> > > > > > > > > > > > operational environment (e.g. whether an external
> > metadata
> > > > > > store
> > > > > > > is
> > > > > > > > > > > already
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> available).
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> The above discussion is based on the
> > > > assumption
> > > > > > > that we
> > > > > > > > > > > need
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> cache the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> object metadata locally in every
> broker.
> > I
> > > > > > > mentioned
> > > > > > > > > > > earlier
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> an
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> alternative is to just store/retrieve
> > those
> > > > > > > metadata in
> > > > > > > > > > an
> > > > > > > > > > > > external metadata store. That may simplify the
> > > > implementation
> > > > > > in
> > > > > > > some
> > > > > > > > > > > cases.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Jun
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Thu, Dec 5, 2019 at 7:01 AM Satish
> > > > Duggana <
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> satish.dugg...@gmail.com>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>>>> Thanks for your reply.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Currently, `listRemoteSegments` is
> > called at
> > > > > the
> > > > > > > > > > configured
> > > > > > > > > > > > interval(not every second, defaults to 30secs).
> Storing
> > > > > remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> log
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> metadata in a strongly consistent store
> > for
> > > > S3
> > > > > > RSM
> > > > > > > is
> > > > > > > > > > > raised
> > > > > > > > > > > > in PR-comment[1].
> > > > > > > > > > > > > > > >>>>>>> RLM invokes RSM at regular intervals
> and
> > RSM
> > > > > can
> > > > > > > give
> > > > > > > > > > > remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> segment
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> metadata if it is available. RSM is
> > > > responsible
> > > > > > for
> > > > > > > > > > > > maintaining
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> fetching those entries. It should be
> > based on
> > > > > > > whatever
> > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> consistent and efficient with the
> > respective
> > > > > > remote
> > > > > > > > > > > storage.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Can you give more details about push
> > based
> > > > > > > mechanism
> > > > > > > > > from
> > > > > > > > > > > > RSM?
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 1.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >
> > > > > > https://github.com/apache/kafka/pull/7561#discussion_r344576223
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>> Satish.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao
> <
> > > > > > > > > j...@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi, Harsha,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks for the reply.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 40/41. I am curious which block
> storages
> > you
> > > > > have
> > > > > > > > > tested.
> > > > > > > > > > > S3
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> seems
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to be
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> one of the popular block stores. The
> > concerns
> > > > > > that
> > > > > > > I
> > > > > > > > > have
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> pull
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> based
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> approach are the following.
> > > > > > > > > > > > > > > >>>>>>> (a) Cost: S3 list object requests cost
> > $0.005
> > > > > per
> > > > > > > 1000
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> requests. If
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> you
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> have 100,000 partitions and want to
> pull
> > the
> > > > > > > metadata
> > > > > > > > > for
> > > > > > > > > > > > each
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> partition
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> at
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the rate of 1/sec. It can cost
> $0.5/sec,
> > > > which
> > > > > is
> > > > > > > > > roughly
> > > > > > > > > > > > $40K
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> per
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> day.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (b) Semantics: S3 list objects are
> > eventually
> > > > > > > > > consistent.
> > > > > > > > > > > So,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> when
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> you
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> do a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> list object request, there is no
> > guarantee
> > > > that
> > > > > > > you can
> > > > > > > > > > see
> > > > > > > > > > > > all
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> uploaded
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> objects. This could impact the
> > correctness of
> > > > > > > > > subsequent
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> logics.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (c) Efficiency: Blindly pulling
> metadata
> > when
> > > > > > > there is
> > > > > > > > > no
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> change adds
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> unnecessary overhead in the broker as
> > well as
> > > > > in
> > > > > > > the
> > > > > > > > > > block
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> store.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> So, have you guys tested S3? If so,
> > could you
> > > > > > share
> > > > > > > > > your
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> experience
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> terms of cost, semantics and
> efficiency?
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Jun
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Tue, Dec 3, 2019 at 10:11 PM Harsha
> > > > > > > Chintalapani <
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> ka...@harsha.io
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>>>> Thanks for the reply.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Tue, Nov 26, 2019 at 3:46 PM, Jun
> Rao
> > <
> > > > > > > > > > j...@confluent.io
> > > > > > > > > > > >
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi, Satish and Ying,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks for the reply.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 40/41. There are two different ways
> that
> > we
> > > > can
> > > > > > > > > approach
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> One is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> what
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> you said. We can have an opinionated
> way
> > of
> > > > > > > storing and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> populating
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> tier
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> metadata that we think is good enough
> for
> > > > > > > everyone. I
> > > > > > > > > am
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> not
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> sure if
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> is the case based on what's currently
> > > > proposed
> > > > > in
> > > > > > > the
> > > > > > > > > > KIP.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> For
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> example, I
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> am not sure that (1) everyone always
> > needs
> > > > > local
> > > > > > > > > > metadata;
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (2)
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> current
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> local storage format is general enough
> > and
> > > > (3)
> > > > > > > everyone
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wants to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> use
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> pull based approach to propagate the
> > > > metadata.
> > > > > > > Another
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> approach
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> is to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> make
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this pluggable and let the implementor
> > > > > implements
> > > > > > > the
> > > > > > > > > > best
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> approach
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> for a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> particular block storage. I haven't
> seen
> > any
> > > > > > > comments
> > > > > > > > > > from
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Slack/AirBnb
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the mailing list on this topic. It
> would
> > be
> > > > > great
> > > > > > > if
> > > > > > > > > they
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> can
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> provide
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> feedback directly here.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> The current interfaces are designed
> with
> > most
> > > > > > > popular
> > > > > > > > > > block
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storages
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> available today and we did 2
> > implementations
> > > > > with
> > > > > > > these
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> interfaces and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> they both are yielding good results as
> we
> > > > going
> > > > > > > through
> > > > > > > > > > the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> testing of
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> it.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> If there is ever a need for pull based
> > > > approach
> > > > > > we
> > > > > > > can
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> definitely
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> evolve
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the interface.
> > > > > > > > > > > > > > > >>>>>>> In the past we did mark interfaces to
> be
> > > > > evolving
> > > > > > > to
> > > > > > > > > make
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> room for
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> unknowns
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in the future.
> > > > > > > > > > > > > > > >>>>>>> If you have any suggestions around the
> > > > current
> > > > > > > > > interfaces
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> please
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> propose we
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> are happy to see if we can work them
> > into it.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 43. To offer tier storage as a general
> > > > feature,
> > > > > > > ideally
> > > > > > > > > > all
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> existing
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> capabilities should still be supported.
> > It's
> > > > > fine
> > > > > > > if
> > > > > > > > > the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> uber
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> implementation doesn't support all
> > > > capabilities
> > > > > > for
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> internal
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> usage.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> However, the framework should be
> general
> > > > > enough.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> We agree on that as a principle. But
> all
> > of
> > > > > these
> > > > > > > major
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> features
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> mostly
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> coming right now and to have a new big
> > > > feature
> > > > > > > such as
> > > > > > > > > > > tiered
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> support all the new features will be a
> > big
> > > > ask.
> > > > > > We
> > > > > > > can
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> document on
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> how
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> do
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> we approach solving these in future
> > > > iterations.
> > > > > > > > > > > > > > > >>>>>>> Our goal is to make this tiered storage
> > > > feature
> > > > > > > work
> > > > > > > > > for
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> everyone.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 43.3 This is more than just serving the
> > > > tier-ed
> > > > > > > data
> > > > > > > > > from
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> block
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> With KIP-392, the consumer now can
> > resolve
> > > > the
> > > > > > > > > conflicts
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> with the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> replica
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> based on leader epoch. So, we need to
> > make
> > > > sure
> > > > > > > that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> leader epoch
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> can be
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> recovered properly from tier storage.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> We are working on testing our approach
> > and we
> > > > > > will
> > > > > > > > > update
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the KIP
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> with
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> design details.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 43.4 For JBOD, if tier storage stores
> the
> > > > tier
> > > > > > > metadata
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> locally, we
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> need to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> support moving such metadata across
> disk
> > > > > > > directories
> > > > > > > > > > since
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> JBOD
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> supports
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> moving data across disks.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> KIP is updated with JBOD details.
> Having
> > said
> > > > > > that
> > > > > > > JBOD
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> tooling
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> needs
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> evolve to support production loads.
> Most
> > of
> > > > the
> > > > > > > users
> > > > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> interested in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> using tiered storage without JBOD
> support
> > > > > support
> > > > > > > on
> > > > > > > > > day
> > > > > > > > > > 1.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>> Harsha
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> As for meeting, we could have a KIP
> > e-meeting
> > > > > on
> > > > > > > this
> > > > > > > > > if
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> needed,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> but it
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> will be open to everyone and will be
> > recorded
> > > > > and
> > > > > > > > > shared.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Often,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> details are still resolved through the
> > > > mailing
> > > > > > > list.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Jun
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Tue, Nov 19, 2019 at 6:48 PM Ying
> > Zheng
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> <yi...@uber.com.invalid>
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Please ignore my previous email
> > > > > > > > > > > > > > > >>>>>>> I didn't know Apache requires all the
> > > > > discussions
> > > > > > > to be
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> "open"
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Tue, Nov 19, 2019, 5:40 PM Ying
> Zheng
> > <
> > > > > > > > > yi...@uber.com
> > > > > > > > > > >
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi Jun,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thank you very much for your feedback!
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Can we schedule a meeting in your Palo
> > Alto
> > > > > > office
> > > > > > > in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> December? I
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> think a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> face to face discussion is much more
> > > > efficient
> > > > > > than
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> emails. Both
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Harsha
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> I can visit you. Satish may be able to
> > join
> > > > us
> > > > > > > > > remotely.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Fri, Nov 15, 2019 at 11:04 AM Jun
> Rao
> > <
> > > > > > > > > > j...@confluent.io
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Hi, Satish and Harsha,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> The following is a more detailed high
> > level
> > > > > > > feedback
> > > > > > > > > for
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the KIP.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Overall,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the KIP seems useful. The challenge is
> > how to
> > > > > > > design it
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> such that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> it’s
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> general enough to support different
> ways
> > of
> > > > > > > > > implementing
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> feature
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> support existing features.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 40. Local segment metadata storage: The
> > KIP
> > > > > makes
> > > > > > > the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> assumption
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> metadata for the archived log segments
> > are
> > > > > cached
> > > > > > > > > locally
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> every
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> broker
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and provides a specific implementation
> > for
> > > > the
> > > > > > > local
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> framework. We probably should discuss
> > this
> > > > > more.
> > > > > > > For
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> example,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> some
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> tier
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage providers may not want to cache
> > the
> > > > > > > metadata
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> locally and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> just
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> rely
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> upon a remote key/value store if such a
> > store
> > > > > is
> > > > > > > > > already
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> present. If
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> local store is used, there could be
> > different
> > > > > > ways
> > > > > > > of
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> implementing it
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (e.g., based on customized local files,
> > an
> > > > > > embedded
> > > > > > > > > local
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> store
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> like
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> RocksDB, etc). An alternative of
> > designing
> > > > this
> > > > > > is
> > > > > > > to
> > > > > > > > > > just
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> provide an
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> interface for retrieving the tier
> segment
> > > > > > metadata
> > > > > > > and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> leave the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> details
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> of
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> how to get the metadata outside of the
> > > > > framework.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 41. RemoteStorageManager interface and
> > the
> > > > > usage
> > > > > > > of the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> interface in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> framework: I am not sure if the
> > interface is
> > > > > > > general
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> enough. For
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> example,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> it seems that RemoteLogIndexEntry is
> > tied to
> > > > a
> > > > > > > specific
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> way of
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storing
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> metadata in remote storage. The
> framework
> > > > uses
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> listRemoteSegments()
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> api
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> a pull based approach. However, in some
> > other
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> implementations, a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> push
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> based
> > > > > > > > > > > > > > > >>>>>>> approach may be more preferred. I don’t
> > have
> > > > a
> > > > > > > concrete
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> proposal
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> yet.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> But,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> it would be useful to give this area
> some
> > > > more
> > > > > > > thoughts
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and see
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> if we
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> can
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> make the interface more general.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 42. In the diagram, the
> RemoteLogManager
> > is
> > > > > side
> > > > > > by
> > > > > > > > > side
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> with
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> LogManager.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> This KIP only discussed how the fetch
> > request
> > > > > is
> > > > > > > > > handled
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> between
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> two
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> layer. However, we should also consider
> > how
> > > > > other
> > > > > > > > > > requests
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> touch
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> log can be handled. e.g., list offsets
> by
> > > > > > > timestamp,
> > > > > > > > > > delete
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> records,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> etc.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Also, in this model, it's not clear
> which
> > > > > > > component is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> responsible
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> for
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> managing the log start offset. It seems
> > that
> > > > > the
> > > > > > > log
> > > > > > > > > > start
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> offset
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> could
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> be
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> changed by both RemoteLogManager and
> > > > > LogManager.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 43. There are quite a few existing
> > features
> > > > not
> > > > > > > covered
> > > > > > > > > > by
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> KIP.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> It
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> would be useful to discuss each of
> those.
> > > > > > > > > > > > > > > >>>>>>> 43.1 I won’t say that compacted topics
> > are
> > > > > rarely
> > > > > > > used
> > > > > > > > > > and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> always
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> small.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> For example, KStreams uses compacted
> > topics
> > > > for
> > > > > > > storing
> > > > > > > > > > the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> states
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> sometimes the size of the topic could
> be
> > > > large.
> > > > > > > While
> > > > > > > > > it
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> might
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> be ok
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> not
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> support compacted topics initially, it
> > would
> > > > be
> > > > > > > useful
> > > > > > > > > to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> have a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> high
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> level
> > > > > > > > > > > > > > > >>>>>>> idea on how this might be supported
> down
> > the
> > > > > road
> > > > > > > so
> > > > > > > > > that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> we
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> don’t
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> have
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> make incompatible API changes in the
> > future.
> > > > > > > > > > > > > > > >>>>>>> 43.2 We need to discuss how EOS is
> > supported.
> > > > > In
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> particular, how
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> producer state integrated with the
> remote
> > > > > > storage.
> > > > > > > 43.3
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Now that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> KIP-392
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (allow consumers to fetch from closest
> > > > replica)
> > > > > > is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> implemented,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> we
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> need
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> discuss how reading from a follower
> > replica
> > > > is
> > > > > > > > > supported
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> with
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> tier
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> 43.4 We need to discuss how JBOD is
> > supported
> > > > > > with
> > > > > > > tier
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Jun
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Fri, Nov 8, 2019 at 12:06 AM Tom
> > Bentley <
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> tbent...@redhat.com
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks for those insights Ying.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> On Thu, Nov 7, 2019 at 9:26 PM Ying
> Zheng
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> <yi...@uber.com.invalid
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> wrote:
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Thanks, I missed that point. However,
> > there's
> > > > > > > still a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> point at
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> which
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> consumer fetches start getting served
> > from
> > > > > remote
> > > > > > > > > storage
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (even
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> if
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> point isn't as soon as the local log
> > > > retention
> > > > > > > > > > time/size).
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> This
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> represents
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> a kind of performance cliff edge and
> > what I'm
> > > > > > > really
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> interested
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> in
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> how
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> easy it is for a consumer which falls
> off
> > > > that
> > > > > > > cliff to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> catch up
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> and so
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> its
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> fetches again come from local storage.
> > > > > Obviously
> > > > > > > this
> > > > > > > > > can
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> depend
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> on
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> all
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> sorts of factors (like production rate,
> > > > > > consumption
> > > > > > > > > > rate),
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> so
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> it's
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> not
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> guaranteed (just like it's not
> > guaranteed for
> > > > > > Kafka
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> today), but
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> would
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> represent a new failure mode.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> As I have explained in the last mail,
> > it's a
> > > > > very
> > > > > > > rare
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> case that
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> a
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> consumer
> > > > > > > > > > > > > > > >>>>>>> need to read remote data. With our
> > experience
> > > > > at
> > > > > > > Uber,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> this only
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> happens
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> when the consumer service had an outage
> > for
> > > > > > several
> > > > > > > > > > hours.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> There is not a "performance cliff" as
> you
> > > > > assume.
> > > > > > > The
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> is
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> even faster than local disks in terms
> of
> > > > > > bandwidth.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Reading from
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage is going to have higher latency
> > than
> > > > > > local
> > > > > > > > > disk.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> But
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> since
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> consumer
> > > > > > > > > > > > > > > >>>>>>> is catching up several hours data, it's
> > not
> > > > > > > sensitive
> > > > > > > > > to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> sub-second
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> level
> > > > > > > > > > > > > > > >>>>>>> latency, and each remote read request
> > will
> > > > > read a
> > > > > > > large
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> amount of
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> data to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> make the overall performance better
> than
> > > > > reading
> > > > > > > from
> > > > > > > > > > local
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> disks.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Another aspect I'd like to understand
> > better
> > > > is
> > > > > > the
> > > > > > > > > > effect
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> of
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> serving
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> fetch
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> request from remote storage has on the
> > > > broker's
> > > > > > > network
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> utilization. If
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> we're just trimming the amount of data
> > held
> > > > > > locally
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> (without
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> increasing
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> overall local+remote retention), then
> > we're
> > > > > > > effectively
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> trading
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> disk
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> bandwidth for network bandwidth when
> > serving
> > > > > > fetch
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> requests from
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage (which I understand to be a
> good
> > > > thing,
> > > > > > > since
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> brokers are
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> often/usually disk bound). But if we're
> > > > > > increasing
> > > > > > > the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> overall
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> local+remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> retention then it's more likely that
> > network
> > > > > > itself
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> becomes the
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> bottleneck.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> I appreciate this is all rather hand
> > wavy,
> > > > I'm
> > > > > > just
> > > > > > > > > > trying
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> understand
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> how this would affect broker
> > performance, so
> > > > > I'd
> > > > > > be
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> grateful for
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> any
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> insights you can offer.
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> Network bandwidth is a function of
> > produce
> > > > > speed,
> > > > > > > it
> > > > > > > > > has
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> nothing
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> to
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> do
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> with
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> remote retention. As long as the data
> is
> > > > > shipped
> > > > > > to
> > > > > > > > > > remote
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> storage,
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> you
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> can
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> keep the data there for 1 day or 1 year
> > or
> > > > 100
> > > > > > > years,
> > > > > > > > > it
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> doesn't
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> consume
> > > > > > > > > > > > > > > >>>>>>>
> > > > > > > > > > > > > > > >>>>>>> any
> > > > > > > > > > > > > > > >>>>>>> network resources.
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > > > > > >>>>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to