Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Divij Vaidya Fri, 14 Jul 2023 03:08:35 -0700

Thank you for spotting that Luke. I have fixed the snippet now.

*Satish*, I am waiting for your review on this one since you provided some
comments earlier in this discussion. Please check the KIP once again when
you get a chance and vote at
https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk if you
don't have any further concerns. Thank you!!


--
Divij Vaidya



On Thu, Jul 13, 2023 at 2:39 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks Divij.
>
> I was confusing with the metric tags used by clients that are based on
> topic and partition. Ideally partition label could be at a DEBUG recording
> level, but that's outside the scope of this KIP.
>
> Looks good to me, thanks again!
>
> Jorge.
>
> On Wed, 12 Jul 2023 at 15:55, Divij Vaidya <divijvaidy...@gmail.com>
> wrote:
>
> > Jorge,
> > About API name: Good point. I have changed it to remoteLogSize instead of
> > getRemoteLogSize
> >
> > About partition tag in the metric: We don't use partition tag across any
> of
> > the RemoteStorage metrics and I would like to keep this metric aligned
> with
> > the rest. I will change the metric though to type=BrokerTopicMetrics
> > instead of type=RemoteLogManager, since this is topic level information
> and
> > not specific to RemoteLogManager.
> >
> >
> > Satish,
> > Ah yes! Updated from "This would increase the broker start-up time." to
> > "This would increase the bootstrap time for the remote storage thread
> pool
> > before the first eligible segment is archived."
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Mon, Jul 3, 2023 at 2:07 PM Satish Duggana <satish.dugg...@gmail.com>
> > wrote:
> >
> > > Thanks Divij for taking the feedback and updating the motivation
> > > section in the KIP.
> > >
> > > One more comment on Alternative solution-3, The con is not valid as
> > > that will not affect the broker restart times as discussed in the
> > > earlier email in this thread. You may want to update that.
> > >
> > > ~Satish.
> > >
> > > On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <divijvaidy...@gmail.com>
> > wrote:
> > > >
> > > > Thank you folks for reviewing this KIP.
> > > >
> > > > Satish, I have modified the motivation to make it more clear. Now it
> > > says,
> > > > "Since the main feature of tiered storage is storing a large amount
> of
> > > > data, we expect num_remote_segments to be large. A frequent linear
> scan
> > > > (i.e. listing all segment metadata) could be expensive/slower because
> > of
> > > > the underlying storage used by RemoteLogMetadataManager. This
> slowness
> > to
> > > > list all segment metadata could result in the loss of
> availability...."
> > > >
> > > > Jun, Kamal, Satish, if you don't have any further concerns, I would
> > > > appreciate a vote for this KIP in the voting thread -
> > > > https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> > > > kamal.chandraprak...@gmail.com> wrote:
> > > >
> > > > > Hi Divij,
> > > > >
> > > > > Thanks for the explanation. LGTM.
> > > > >
> > > > > --
> > > > > Kamal
> > > > >
> > > > > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <
> > > satish.dugg...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Divij,
> > > > > > I am fine with having an API to compute the size as I mentioned
> in
> > my
> > > > > > earlier reply in this mail thread. But I have the below comment
> for
> > > > > > the motivation for this KIP.
> > > > > >
> > > > > > As you discussed offline, the main issue here is listing calls
> for
> > > > > > remote log segment metadata is slower because of the storage used
> > for
> > > > > > RLMM. These can be avoided with this new API.
> > > > > >
> > > > > > Please add this in the motivation section as it is one of the
> main
> > > > > > motivations for the KIP.
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <j...@confluent.io.invalid>
> > > wrote:
> > > > > > >
> > > > > > > Hi, Divij,
> > > > > > >
> > > > > > > Sorry for the late reply.
> > > > > > >
> > > > > > > Given your explanation, the new API sounds reasonable to me. Is
> > > that
> > > > > > enough
> > > > > > > to build the external metadata layer for the remote segments or
> > do
> > > you
> > > > > > need
> > > > > > > some additional API changes?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <
> > > divijvaidy...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thank you for looking into this Kamal.
> > > > > > > >
> > > > > > > > You are right in saying that a cold start (i.e. leadership
> > > failover
> > > > > or
> > > > > > > > broker startup) does not impact the broker startup duration.
> > But
> > > it
> > > > > > does
> > > > > > > > have the following impact:
> > > > > > > > 1. It leads to a burst of full-scan requests to RLMM in case
> > > multiple
> > > > > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > > > > implementation has the capability to serve the total size
> from
> > an
> > > > > index
> > > > > > > > (and hence handle this burst), we wouldn't be able to use it
> > > since
> > > > > the
> > > > > > > > current API necessarily calls for a full scan.
> > > > > > > > 2. The archival (copying of data to tiered storage) process
> > will
> > > > > have a
> > > > > > > > delayed start. The delayed start of archival could lead to
> > local
> > > > > build
> > > > > > up
> > > > > > > > of data which may lead to disk full.
> > > > > > > >
> > > > > > > > The disadvantage of adding this new API is that every
> provider
> > > will
> > > > > > have to
> > > > > > > > implement it, agreed. But I believe that this tradeoff is
> > > worthwhile
> > > > > > since
> > > > > > > > the default implementation could be the same as you
> mentioned,
> > > i.e.
> > > > > > keeping
> > > > > > > > cumulative in-memory count.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > > > > kamal.chandraprak...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi Divij,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > > > > >
> > > > > > > > > Can you explain the rejected alternative-3?
> > > > > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > > > > RemoteLogManager
> > > > > > > > > "*Cons*: Every time a broker starts-up, it will scan
> through
> > > all
> > > > > the
> > > > > > > > > segments in the remote tier to initialise the in-memory
> > value.
> > > This
> > > > > > would
> > > > > > > > > increase the broker start-up time."
> > > > > > > > >
> > > > > > > > > Keeping the source of truth to determine the
> remote-log-size
> > > in the
> > > > > > > > leader
> > > > > > > > > would be consistent across different implementations of the
> > > plugin.
> > > > > > The
> > > > > > > > > concern posted in the KIP is that we are calculating the
> > > > > > remote-log-size
> > > > > > > > on
> > > > > > > > > each iteration of the cleaner thread (say 5 mins). If we
> > > calculate
> > > > > > only
> > > > > > > > > once during broker startup or during the leadership
> > > reassignment,
> > > > > do
> > > > > > we
> > > > > > > > > still need the cache?
> > > > > > > > >
> > > > > > > > > The broker startup-time won't be affected by the remote log
> > > manager
> > > > > > > > > initialisation. The broker continue to start accepting the
> > new
> > > > > > > > > produce/fetch requests, while the RLM thread in the
> > background
> > > can
> > > > > > > > > determine the remote-log-size once and start
> copying/deleting
> > > the
> > > > > > > > segments.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Kamal
> > > > > > > > >
> > > > > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > > > > divijvaidy...@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Satish / Jun
> > > > > > > > > >
> > > > > > > > > > Do you have any thoughts on this?
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Divij Vaidya
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Jun
> > > > > > > > > > >
> > > > > > > > > > > It has been a while since this KIP got some attention.
> > > While we
> > > > > > wait
> > > > > > > > > for
> > > > > > > > > > > Satish to chime in here, perhaps I can answer your
> > > question.
> > > > > > > > > > >
> > > > > > > > > > > > Could you explain how you exposed the log size in
> your
> > > > > KIP-405
> > > > > > > > > > > implementation?
> > > > > > > > > > >
> > > > > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > > > > updateRemoteLogSegmentMetadata(),
> > > > > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > > > > putRemotePartitionDeleteMetadata(),
> > listRemoteLogSegments(),
> > > > > > > > > > onPartitionLeadershipChanges()
> > > > > > > > > > > and onStopPartitions(). None of these APIs allow us to
> > > expose
> > > > > > the log
> > > > > > > > > > size,
> > > > > > > > > > > hence, the only option that remains is to list all
> > segments
> > > > > using
> > > > > > > > > > > listRemoteLogSegments() and aggregate them every time
> we
> > > > > require
> > > > > > to
> > > > > > > > > > > calculate the size. Based on our prior discussion, this
> > > > > requires
> > > > > > > > > reading
> > > > > > > > > > > all segment metadata which won't work for non-local
> RLMM
> > > > > > > > > implementations.
> > > > > > > > > > > Satish's implementation also performs a full scan and
> > > > > calculates
> > > > > > the
> > > > > > > > > > > aggregate. see:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Does this answer your question?
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > > > > <j...@confluent.io.invalid
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Hi, Divij,
> > > > > > > > > > >>
> > > > > > > > > > >> Thanks for the explanation.
> > > > > > > > > > >>
> > > > > > > > > > >> Good question.
> > > > > > > > > > >>
> > > > > > > > > > >> Hi, Satish,
> > > > > > > > > > >>
> > > > > > > > > > >> Could you explain how you exposed the log size in your
> > > KIP-405
> > > > > > > > > > >> implementation?
> > > > > > > > > > >>
> > > > > > > > > > >> Thanks,
> > > > > > > > > > >>
> > > > > > > > > > >> Jun
> > > > > > > > > > >>
> > > > > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > > > > divijvaidy...@gmail.com
> > > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hey Jun
> > > > > > > > > > >> >
> > > > > > > > > > >> > Yes, it is possible to maintain the log size in the
> > > cache
> > > > > (see
> > > > > > > > > > rejected
> > > > > > > > > > >> > alternative#3 in the KIP) but I did not understand
> how
> > > it is
> > > > > > > > > possible
> > > > > > > > > > to
> > > > > > > > > > >> > retrieve it without the new API. The log size could
> be
> > > > > > calculated
> > > > > > > > on
> > > > > > > > > > >> > startup by scanning through the segments (though I
> > would
> > > > > > disagree
> > > > > > > > > that
> > > > > > > > > > >> this
> > > > > > > > > > >> > is the right approach since scanning itself takes
> > order
> > > of
> > > > > > minutes
> > > > > > > > > and
> > > > > > > > > > >> > hence delay the start of archive process), and
> > > incrementally
> > > > > > > > > > maintained
> > > > > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > > >> so
> > > > > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > > > > >> >
> > > > > > > > > > >> > If we wish to cache the size without adding a new
> API,
> > > then
> > > > > we
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > > > > implementation)
> > > > > > and
> > > > > > > > > > >> > incrementally manage it. The downside of longer
> > archive
> > > time
> > > > > > at
> > > > > > > > > > startup
> > > > > > > > > > >> > (due to initial scale) still remains valid in this
> > > > > situation.
> > > > > > > > > > >> >
> > > > > > > > > > >> > --
> > > > > > > > > > >> > Divij Vaidya
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > > > > <j...@confluent.io.invalid
> > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Hi, Divij,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks for the explanation.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > If there is in-memory cache, could we maintain the
> > log
> > > > > size
> > > > > > in
> > > > > > > > the
> > > > > > > > > > >> cache
> > > > > > > > > > >> > > with the existing API? For example, a replica
> could
> > > make a
> > > > > > > > > > >> > > listRemoteLogSegments(TopicIdPartition
> > > topicIdPartition)
> > > > > > call on
> > > > > > > > > > >> startup
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > get the remote segment size before the current
> > > > > leaderEpoch.
> > > > > > The
> > > > > > > > > > leader
> > > > > > > > > > >> > > could then maintain the size incrementally
> > > afterwards. On
> > > > > > leader
> > > > > > > > > > >> change,
> > > > > > > > > > >> > > other replicas can make a
> > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the
> > > size of
> > > > > > newly
> > > > > > > > > > >> > generated
> > > > > > > > > > >> > > segments.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Jun
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > > > > divijvaidy...@gmail.com
> > > > > > > > > > >> >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > > retention?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Yes. You are right in assuming that this API
> only
> > > > > > provides the
> > > > > > > > > > >> Remote
> > > > > > > > > > >> > > > storage size (for current epoch chain). We would
> > use
> > > > > this
> > > > > > API
> > > > > > > > > for
> > > > > > > > > > >> size
> > > > > > > > > > >> > > > based retention along with a value of
> > > > > > localOnlyLogSegmentSize
> > > > > > > > > > which
> > > > > > > > > > >> is
> > > > > > > > > > >> > > > computed as
> > > > > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence,
> > > (total_log_size =
> > > > > > > > > > >> > > > remoteLogSizeBytes +
> > log.localOnlyLogSegmentSize). I
> > > > > have
> > > > > > > > > updated
> > > > > > > > > > >> the
> > > > > > > > > > >> > KIP
> > > > > > > > > > >> > > > with this information. You can also check an
> > example
> > > > > > > > > > implementation
> > > > > > > > > > >> at
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Do you imagine all accesses to remote metadata
> > > will be
> > > > > > > > across
> > > > > > > > > > the
> > > > > > > > > > >> > > network
> > > > > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > I would expect a disk-less implementation to
> > > maintain a
> > > > > > finite
> > > > > > > > > > >> > in-memory
> > > > > > > > > > >> > > > cache for segment metadata to optimize the
> number
> > of
> > > > > > network
> > > > > > > > > calls
> > > > > > > > > > >> made
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > fetch the data. In future, we can think about
> > > bringing
> > > > > > this
> > > > > > > > > finite
> > > > > > > > > > >> size
> > > > > > > > > > >> > > > cache into RLM itself but that's probably a
> > > conversation
> > > > > > for a
> > > > > > > > > > >> > different
> > > > > > > > > > >> > > > KIP. There are many other things we would like
> to
> > > do to
> > > > > > > > optimize
> > > > > > > > > > the
> > > > > > > > > > >> > > Tiered
> > > > > > > > > > >> > > > storage interface such as introducing a circular
> > > buffer
> > > > > /
> > > > > > > > > > streaming
> > > > > > > > > > >> > > > interface from RSM (so that we don't have to
> wait
> > to
> > > > > > fetch the
> > > > > > > > > > >> entire
> > > > > > > > > > >> > > > segment before starting to send records to the
> > > > > consumer),
> > > > > > > > > caching
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > segments fetched from RSM locally (I would
> assume
> > > all
> > > > > RSM
> > > > > > > > plugin
> > > > > > > > > > >> > > > implementations to do this, might as well add it
> > to
> > > RLM)
> > > > > > etc.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > --
> > > > > > > > > > >> > > > Divij Vaidya
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > > > > <j...@confluent.io.invalid
> > > > > > > > > > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Hi, Divij,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks for the reply.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > > > retention? It
> > > > > > > > > > gives
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > total
> > > > > > > > > > >> > > > > size of the remote segments, but it seems that
> > we
> > > > > still
> > > > > > > > don't
> > > > > > > > > > know
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > exact total size for a log since there could
> be
> > > > > > overlapping
> > > > > > > > > > >> segments
> > > > > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > You mentioned a disk-less implementation. Do
> you
> > > > > > imagine all
> > > > > > > > > > >> accesses
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > remote metadata will be across the network or
> > will
> > > > > > there be
> > > > > > > > > some
> > > > > > > > > > >> > local
> > > > > > > > > > >> > > > > in-memory cache?
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Jun
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > > > > >> divijvaidy...@gmail.com
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > The method is needed for RLMM
> implementations
> > > which
> > > > > > fetch
> > > > > > > > > the
> > > > > > > > > > >> > > > information
> > > > > > > > > > >> > > > > > over the network and not for the disk based
> > > > > > > > implementations
> > > > > > > > > > >> (such
> > > > > > > > > > >> > as
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > I would argue that adding this API makes the
> > > > > interface
> > > > > > > > more
> > > > > > > > > > >> generic
> > > > > > > > > > >> > > > than
> > > > > > > > > > >> > > > > > what it is today. This is because, with the
> > > current
> > > > > > APIs
> > > > > > > > an
> > > > > > > > > > >> > > implementor
> > > > > > > > > > >> > > > > is
> > > > > > > > > > >> > > > > > restricted to use disk based RLMM solutions
> > only
> > > > > > (i.e. the
> > > > > > > > > > >> default
> > > > > > > > > > >> > > > > > solution) whereas if we add this new API, we
> > > unblock
> > > > > > usage
> > > > > > > > > of
> > > > > > > > > > >> > network
> > > > > > > > > > >> > > > > based
> > > > > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > > > > <j...@confluent.io.invalid
> > > > > > > > > > >> >
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Point#2. My high level question is that is
> > > the new
> > > > > > > > method
> > > > > > > > > > >> needed
> > > > > > > > > > >> > > for
> > > > > > > > > > >> > > > > > every
> > > > > > > > > > >> > > > > > > implementation of remote storage or just
> > for a
> > > > > > specific
> > > > > > > > > > >> > > > implementation.
> > > > > > > > > > >> > > > > > The
> > > > > > > > > > >> > > > > > > issues that you pointed out exist for the
> > > default
> > > > > > > > > > >> implementation
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > > RLMM
> > > > > > > > > > >> > > > > > as
> > > > > > > > > > >> > > > > > > well and so far, the default
> implementation
> > > hasn't
> > > > > > > > found a
> > > > > > > > > > >> need
> > > > > > > > > > >> > > for a
> > > > > > > > > > >> > > > > > > similar new method. For public interface,
> > > ideally
> > > > > we
> > > > > > > > want
> > > > > > > > > to
> > > > > > > > > > >> make
> > > > > > > > > > >> > > it
> > > > > > > > > > >> > > > > more
> > > > > > > > > > >> > > > > > > general.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Thanks,
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Jun
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij
> > Vaidya <
> > > > > > > > > > >> > > > divijvaidy...@gmail.com>
> > > > > > > > > > >> > > > > > > wrote:
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > Thank you Jun and Alex for your
> comments.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex
> > > mentioned,
> > > > > the
> > > > > > > > > > "derived
> > > > > > > > > > >> > > > metadata"
> > > > > > > > > > >> > > > > > can
> > > > > > > > > > >> > > > > > > > increase the size of cached metadata by
> a
> > > factor
> > > > > > of 10
> > > > > > > > > but
> > > > > > > > > > >> it
> > > > > > > > > > >> > > > should
> > > > > > > > > > >> > > > > be
> > > > > > > > > > >> > > > > > > ok
> > > > > > > > > > >> > > > > > > > to cache just the actual metadata. My
> > point
> > > > > about
> > > > > > size
> > > > > > > > > > >> being a
> > > > > > > > > > >> > > > > > limitation
> > > > > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Point#2: For a new replica, it would
> still
> > > have
> > > > > to
> > > > > > > > fetch
> > > > > > > > > > the
> > > > > > > > > > >> > > > metadata
> > > > > > > > > > >> > > > > > > over
> > > > > > > > > > >> > > > > > > > the network to initiate the warm up of
> the
> > > cache
> > > > > > and
> > > > > > > > > > hence,
> > > > > > > > > > >> > > > increase
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > start time of the archival process.
> Please
> > > also
> > > > > > note
> > > > > > > > the
> > > > > > > > > > >> > > > > repercussions
> > > > > > > > > > >> > > > > > of
> > > > > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in
> > this
> > > > > > thread as
> > > > > > > > > > part
> > > > > > > > > > >> of
> > > > > > > > > > >> > > > > #102.2.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying
> > > that.
> > > > > My
> > > > > > > > point
> > > > > > > > > > >> about
> > > > > > > > > > >> > > size
> > > > > > > > > > >> > > > > > being
> > > > > > > > > > >> > > > > > > a
> > > > > > > > > > >> > > > > > > > limitation for using cache is not valid
> > > anymore.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly,
> you
> > > are
> > > > > > > > > suggesting
> > > > > > > > > > to
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > total size at the leader and update it
> on
> > > > > > archival.
> > > > > > > > This
> > > > > > > > > > >> > wouldn't
> > > > > > > > > > >> > > > > work
> > > > > > > > > > >> > > > > > > for
> > > > > > > > > > >> > > > > > > > cases when the leader restarts where we
> > > would
> > > > > > have to
> > > > > > > > > > make a
> > > > > > > > > > >> > full
> > > > > > > > > > >> > > > > scan
> > > > > > > > > > >> > > > > > > > to update the total size entry on
> startup.
> > > We
> > > > > > expect
> > > > > > > > > users
> > > > > > > > > > >> to
> > > > > > > > > > >> > > store
> > > > > > > > > > >> > > > > > data
> > > > > > > > > > >> > > > > > > > over longer duration in remote storage
> > which
> > > > > > increases
> > > > > > > > > the
> > > > > > > > > > >> > > > likelihood
> > > > > > > > > > >> > > > > > of
> > > > > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 102#.1: I don't think that the current
> > > design
> > > > > > > > > accommodates
> > > > > > > > > > >> the
> > > > > > > > > > >> > > fact
> > > > > > > > > > >> > > > > > that
> > > > > > > > > > >> > > > > > > > data corruption could happen at the RLMM
> > > plugin
> > > > > > (we
> > > > > > > > > don't
> > > > > > > > > > >> have
> > > > > > > > > > >> > > > > checksum
> > > > > > > > > > >> > > > > > > as
> > > > > > > > > > >> > > > > > > > a field in metadata as part of KIP405).
> If
> > > data
> > > > > > > > > corruption
> > > > > > > > > > >> > > occurs,
> > > > > > > > > > >> > > > w/
> > > > > > > > > > >> > > > > > or
> > > > > > > > > > >> > > > > > > > w/o the cache, it would be a different
> > > problem
> > > > > to
> > > > > > > > > solve. I
> > > > > > > > > > >> > would
> > > > > > > > > > >> > > > like
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main
> > > concern
> > > > > > for
> > > > > > > > > using
> > > > > > > > > > >> the
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > to
> > > > > > > > > > >> > > > > > > > fetch total size.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Regards,
> > > > > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM
> Alexandre
> > > > > > Dupriez <
> > > > > > > > > > >> > > > > > > > alexandre.dupr...@gmail.com> wrote:
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some
> > > comments
> > > > > > based
> > > > > > > > on
> > > > > > > > > > >> what I
> > > > > > > > > > >> > > > read
> > > > > > > > > > >> > > > > on
> > > > > > > > > > >> > > > > > > > > this thread so far - apologies for the
> > > repeats
> > > > > > and
> > > > > > > > the
> > > > > > > > > > >> late
> > > > > > > > > > >> > > > reply.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > If I understand correctly, one of the
> > main
> > > > > > elements
> > > > > > > > of
> > > > > > > > > > >> > > discussion
> > > > > > > > > > >> > > > > is
> > > > > > > > > > >> > > > > > > > > about caching in Kafka versus
> delegation
> > > of
> > > > > > > > providing
> > > > > > > > > > the
> > > > > > > > > > >> > > remote
> > > > > > > > > > >> > > > > size
> > > > > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > A few comments:
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 100. The size of the “derived
> metadata”
> > > which
> > > > > is
> > > > > > > > > managed
> > > > > > > > > > >> by
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > plugin
> > > > > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed
> > be
> > > > > close
> > > > > > to 1
> > > > > > > > > kB
> > > > > > > > > > on
> > > > > > > > > > >> > > > average
> > > > > > > > > > >> > > > > > > > > depending on its own internal
> structure,
> > > e.g.
> > > > > > the
> > > > > > > > > > >> redundancy
> > > > > > > > > > >> > it
> > > > > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > > > > duplication),
> > > > > > > > > > >> additional
> > > > > > > > > > >> > > > > > > > > information such as checksums and
> > primary
> > > and
> > > > > > > > > secondary
> > > > > > > > > > >> > > indexable
> > > > > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is
> > > itself a
> > > > > > > > lighter
> > > > > > > > > > data
> > > > > > > > > > >> > > > > structure
> > > > > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead
> > of
> > > > > > caching
> > > > > > > > the
> > > > > > > > > > >> > “derived
> > > > > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could
> > be,
> > > > > which
> > > > > > > > should
> > > > > > > > > > >> > address
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > > concern regarding the memory occupancy
> > of
> > > the
> > > > > > cache.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 101. I am not sure I fully understand
> > why
> > > we
> > > > > > would
> > > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > list of rlmMetadata to retain the
> remote
> > > size
> > > > > > of a
> > > > > > > > > > >> > > > topic-partition.
> > > > > > > > > > >> > > > > > > > > Since the leader of a topic-partition
> > is,
> > > in
> > > > > > > > > > >> non-degenerated
> > > > > > > > > > >> > > > cases,
> > > > > > > > > > >> > > > > > > > > the only actor which can mutate the
> > remote
> > > > > part
> > > > > > of
> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > topic-partition, hence its size, it
> > could
> > > in
> > > > > > theory
> > > > > > > > > only
> > > > > > > > > > >> > cache
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > > size of the remote log once it has
> > > calculated
> > > > > > it? In
> > > > > > > > > > which
> > > > > > > > > > >> > case
> > > > > > > > > > >> > > > > there
> > > > > > > > > > >> > > > > > > > > would not be any problem regarding the
> > > size of
> > > > > > the
> > > > > > > > > > caching
> > > > > > > > > > >> > > > > strategy.
> > > > > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 102. There may be a few challenges to
> > > consider
> > > > > > with
> > > > > > > > > > >> caching:
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > > > > strategy
> > > > > > > > > assumes
> > > > > > > > > > no
> > > > > > > > > > >> > > > mutation
> > > > > > > > > > >> > > > > > > > > outside the lifetime of a leader.
> While
> > > this
> > > > > is
> > > > > > true
> > > > > > > > > in
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > normal
> > > > > > > > > > >> > > > > > > > > course of operation, there could be
> > > accidental
> > > > > > > > > mutation
> > > > > > > > > > >> > outside
> > > > > > > > > > >> > > > of
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > > leader and a loss of consistency
> between
> > > the
> > > > > > cached
> > > > > > > > > > state
> > > > > > > > > > >> and
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > > > > actual remote representation of the
> log.
> > > E.g.
> > > > > > > > > > split-brain
> > > > > > > > > > >> > > > > scenarios,
> > > > > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external
> > > systems
> > > > > > with
> > > > > > > > > > >> mutating
> > > > > > > > > > >> > > > access
> > > > > > > > > > >> > > > > on
> > > > > > > > > > >> > > > > > > > > the derived metadata. In the worst
> > case, a
> > > > > drift
> > > > > > > > > between
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > cached
> > > > > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > > > > over-deleting
> > > > > > > > > > >> remote
> > > > > > > > > > >> > > data
> > > > > > > > > > >> > > > > > which
> > > > > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > The alternative you propose, by making
> > the
> > > > > > plugin
> > > > > > > > the
> > > > > > > > > > >> source
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > > truth
> > > > > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log,
> > can
> > > make
> > > > > > it
> > > > > > > > > easier
> > > > > > > > > > >> to
> > > > > > > > > > >> > > avoid
> > > > > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > > > > metadata
> > > > > > and
> > > > > > > > > the
> > > > > > > > > > >> > remote
> > > > > > > > > > >> > > > log
> > > > > > > > > > >> > > > > > > > > from the perspective of Kafka. On the
> > > other
> > > > > > hand,
> > > > > > > > > plugin
> > > > > > > > > > >> > > vendors
> > > > > > > > > > >> > > > > > would
> > > > > > > > > > >> > > > > > > > > have to implement it with the expected
> > > > > > efficiency to
> > > > > > > > > > have
> > > > > > > > > > >> it
> > > > > > > > > > >> > > > yield
> > > > > > > > > > >> > > > > > > > > benefits.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching
> > > strategy
> > > > > in
> > > > > > > > Kafka
> > > > > > > > > > >> would
> > > > > > > > > > >> > > > still
> > > > > > > > > > >> > > > > > > > > require one iteration over the list of
> > > > > > rlmMetadata
> > > > > > > > > when
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > leadership
> > > > > > > > > > >> > > > > > > > > of a topic-partition is assigned to a
> > > broker,
> > > > > > while
> > > > > > > > > the
> > > > > > > > > > >> > plugin
> > > > > > > > > > >> > > > can
> > > > > > > > > > >> > > > > > > > > offer alternative constant-time
> > > approaches.
> > > > > This
> > > > > > > > > > >> calculation
> > > > > > > > > > >> > > > cannot
> > > > > > > > > > >> > > > > > be
> > > > > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would
> > be
> > > > > > performed
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > >> > > > > > background.
> > > > > > > > > > >> > > > > > > > > In case of bulk leadership migration,
> > > listing
> > > > > > the
> > > > > > > > > > >> rlmMetadata
> > > > > > > > > > >> > > > could
> > > > > > > > > > >> > > > > > a)
> > > > > > > > > > >> > > > > > > > > result in request bursts to any
> backend
> > > system
> > > > > > the
> > > > > > > > > > plugin
> > > > > > > > > > >> may
> > > > > > > > > > >> > > use
> > > > > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > > > > high-throughput
> > > > > > > > data
> > > > > > > > > > >> stores
> > > > > > > > > > >> > > but
> > > > > > > > > > >> > > > > > > > > could have cost implications] b)
> > increase
> > > > > > > > utilisation
> > > > > > > > > > >> > timespan
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > RLM threads for these calculations
> > > potentially
> > > > > > > > leading
> > > > > > > > > > to
> > > > > > > > > > >> > > > transient
> > > > > > > > > > >> > > > > > > > > starvation of tasks queued for,
> > typically,
> > > > > > > > offloading
> > > > > > > > > > >> > > operations
> > > > > > > > > > >> > > > c)
> > > > > > > > > > >> > > > > > > > > could have a non-marginal CPU
> footprint
> > on
> > > > > > hardware
> > > > > > > > > with
> > > > > > > > > > >> > strict
> > > > > > > > > > >> > > > > > > > > resource constraints. All these
> elements
> > > could
> > > > > > have
> > > > > > > > an
> > > > > > > > > > >> impact
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > some
> > > > > > > > > > >> > > > > > > > > degree depending on the operational
> > > > > environment.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > From a design perspective, one
> question
> > is
> > > > > > where we
> > > > > > > > > want
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > source
> > > > > > > > > > >> > > > > > of
> > > > > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be
> > during
> > > the
> > > > > > > > lifetime
> > > > > > > > > > of
> > > > > > > > > > >> a
> > > > > > > > > > >> > > > leader.
> > > > > > > > > > >> > > > > > > > > The responsibility of maintaining a
> > > consistent
> > > > > > > > > > >> representation
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > remote log is shared by Kafka and the
> > > plugin.
> > > > > > Which
> > > > > > > > > > >> system is
> > > > > > > > > > >> > > > best
> > > > > > > > > > >> > > > > > > > > placed to maintain such a state while
> > > > > providing
> > > > > > the
> > > > > > > > > > >> highest
> > > > > > > > > > >> > > > > > > > > consistency guarantees is something
> both
> > > Kafka
> > > > > > and
> > > > > > > > > > plugin
> > > > > > > > > > >> > > > designers
> > > > > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > > > > >> > > > > > > > > Alexandre
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > > > > >> > <j...@confluent.io.invalid
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > a
> > > > > > > > > > >> > > > > > > > écrit :
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Point #1. Is the average remote
> > segment
> > > > > > metadata
> > > > > > > > > > really
> > > > > > > > > > >> > 1KB?
> > > > > > > > > > >> > > > > What's
> > > > > > > > > > >> > > > > > > > > listed
> > > > > > > > > > >> > > > > > > > > > in the public interface is probably
> > well
> > > > > > below 100
> > > > > > > > > > >> bytes.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming
> > that
> > > each
> > > > > > > > broker
> > > > > > > > > > only
> > > > > > > > > > >> > > caches
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > > >> > > > > > > > > > segment metadata in memory. An
> > > alternative
> > > > > > > > approach
> > > > > > > > > is
> > > > > > > > > > >> to
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > > them
> > > > > > > > > > >> > > > > > > in
> > > > > > > > > > >> > > > > > > > > > both memory and local disk. That
> way,
> > on
> > > > > > broker
> > > > > > > > > > restart,
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > > just
> > > > > > > > > > >> > > > > > > need
> > > > > > > > > > >> > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > fetch the new remote segments'
> > metadata
> > > > > using
> > > > > > the
> > > > > > > > > > >> > > > > > > > > >
> listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > topicIdPartition,
> > > > > > > > > > >> > int
> > > > > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation
> > > and it
> > > > > > sounds
> > > > > > > > > > good.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM
> Divij
> > > > > Vaidya <
> > > > > > > > > > >> > > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > >> > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > There are three points that I
> would
> > > like
> > > > > to
> > > > > > > > > present
> > > > > > > > > > >> here:
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > 1. We would require a large cache
> > > size to
> > > > > > > > > > efficiently
> > > > > > > > > > >> > cache
> > > > > > > > > > >> > > > all
> > > > > > > > > > >> > > > > > > > segment
> > > > > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at
> > > broker
> > > > > > startup
> > > > > > > > > to
> > > > > > > > > > >> > > populate
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > cache
> > > > > > > > > > >> > > > > > > > > will
> > > > > > > > > > >> > > > > > > > > > > be slow and will impact the
> archival
> > > > > > process.
> > > > > > > > > > >> > > > > > > > > > > 3. There is no other use case
> where
> > a
> > > full
> > > > > > scan
> > > > > > > > of
> > > > > > > > > > >> > segment
> > > > > > > > > > >> > > > > > metadata
> > > > > > > > > > >> > > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > required.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Let's start by quantifying 1.
> Here's
> > > my
> > > > > > estimate
> > > > > > > > > for
> > > > > > > > > > >> the
> > > > > > > > > > >> > > size
> > > > > > > > > > >> > > > > of
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > > >> > > > > > > > > > > Average size of segment metadata =
> > > 1KB.
> > > > > This
> > > > > > > > could
> > > > > > > > > > be
> > > > > > > > > > >> > more
> > > > > > > > > > >> > > if
> > > > > > > > > > >> > > > > we
> > > > > > > > > > >> > > > > > > have
> > > > > > > > > > >> > > > > > > > > > > frequent leader failover with a
> > large
> > > > > > number of
> > > > > > > > > > leader
> > > > > > > > > > >> > > epochs
> > > > > > > > > > >> > > > > > being
> > > > > > > > > > >> > > > > > > > > stored
> > > > > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will
> > > prefer to
> > > > > > > > reduce
> > > > > > > > > > the
> > > > > > > > > > >> > > segment
> > > > > > > > > > >> > > > > > size
> > > > > > > > > > >> > > > > > > > > from the
> > > > > > > > > > >> > > > > > > > > > > default value of 1GB to ensure
> > timely
> > > > > > archival
> > > > > > > > of
> > > > > > > > > > data
> > > > > > > > > > >> > > since
> > > > > > > > > > >> > > > > data
> > > > > > > > > > >> > > > > > > > from
> > > > > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > > > > >> > > > > > > > > > > Cache size = num segments * avg.
> > > segment
> > > > > > > > metadata
> > > > > > > > > > >> size =
> > > > > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound
> > > like a
> > > > > > large
> > > > > > > > > > number
> > > > > > > > > > >> for
> > > > > > > > > > >> > > > > larger
> > > > > > > > > > >> > > > > > > > > machines,
> > > > > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > > > > additional
> > > > > > > > cache
> > > > > > > > > > and
> > > > > > > > > > >> > > makes
> > > > > > > > > > >> > > > > use
> > > > > > > > > > >> > > > > > > > cases
> > > > > > > > > > >> > > > > > > > > with
> > > > > > > > > > >> > > > > > > > > > > large data retention with low
> > > throughout
> > > > > > > > expensive
> > > > > > > > > > >> (where
> > > > > > > > > > >> > > > such
> > > > > > > > > > >> > > > > > use
> > > > > > > > > > >> > > > > > > > case
> > > > > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > > > > >> > > > > > > > > > > Even if we say that all segment
> > > metadata
> > > > > > can fit
> > > > > > > > > > into
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > cache,
> > > > > > > > > > >> > > > > > we
> > > > > > > > > > >> > > > > > > > > will
> > > > > > > > > > >> > > > > > > > > > > need to populate the cache on
> broker
> > > > > > startup. It
> > > > > > > > > > would
> > > > > > > > > > >> > not
> > > > > > > > > > >> > > be
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > critical patch of broker startup
> and
> > > hence
> > > > > > won't
> > > > > > > > > > >> impact
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > startup
> > > > > > > > > > >> > > > > > > > > time.
> > > > > > > > > > >> > > > > > > > > > > But it will impact the time when
> we
> > > could
> > > > > > start
> > > > > > > > > the
> > > > > > > > > > >> > > archival
> > > > > > > > > > >> > > > > > > process
> > > > > > > > > > >> > > > > > > > > since
> > > > > > > > > > >> > > > > > > > > > > the RLM thread pool will be
> blocked
> > > on the
> > > > > > first
> > > > > > > > > > call
> > > > > > > > > > >> to
> > > > > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan
> > > metadata
> > > > > > for
> > > > > > > > 1MM
> > > > > > > > > > >> > segments
> > > > > > > > > > >> > > > > > > (computed
> > > > > > > > > > >> > > > > > > > > above)
> > > > > > > > > > >> > > > > > > > > > > and transfer 1GB data over the
> > network
> > > > > from
> > > > > > a
> > > > > > > > RLMM
> > > > > > > > > > >> such
> > > > > > > > > > >> > as
> > > > > > > > > > >> > > a
> > > > > > > > > > >> > > > > > remote
> > > > > > > > > > >> > > > > > > > > > > database would be in the order of
> > > minutes
> > > > > > > > > (depending
> > > > > > > > > > >> on
> > > > > > > > > > >> > how
> > > > > > > > > > >> > > > > > > efficient
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > scan is with the RLMM
> > implementation).
> > > > > > > > Although, I
> > > > > > > > > > >> would
> > > > > > > > > > >> > > > > concede
> > > > > > > > > > >> > > > > > > that
> > > > > > > > > > >> > > > > > > > > > > having RLM threads blocked for a
> few
> > > > > > minutes is
> > > > > > > > > > >> perhaps
> > > > > > > > > > >> > OK
> > > > > > > > > > >> > > > but
> > > > > > > > > > >> > > > > if
> > > > > > > > > > >> > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > introduce the new API proposed in
> > the
> > > KIP,
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > >> have
> > > > > > > > > > >> > a
> > > > > > > > > > >> > > > > > > > > > > deterministic startup time for
> RLM.
> > > Adding
> > > > > > the
> > > > > > > > API
> > > > > > > > > > >> comes
> > > > > > > > > > >> > > at a
> > > > > > > > > > >> > > > > low
> > > > > > > > > > >> > > > > > > > cost
> > > > > > > > > > >> > > > > > > > > and
> > > > > > > > > > >> > > > > > > > > > > I believe the trade off is worth
> it.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > > > > >> > > > > > > > > > > We can use
> > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > > > > topicIdPartition,
> > > > > > > > > > >> > > > > > > > int
> > > > > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the
> > segments
> > > > > > eligible
> > > > > > > > > for
> > > > > > > > > > >> > > deletion
> > > > > > > > > > >> > > > > > (based
> > > > > > > > > > >> > > > > > > > on
> > > > > > > > > > >> > > > > > > > > size
> > > > > > > > > > >> > > > > > > > > > > retention) where leader epoch(s)
> > > belong to
> > > > > > the
> > > > > > > > > > current
> > > > > > > > > > >> > > leader
> > > > > > > > > > >> > > > > > epoch
> > > > > > > > > > >> > > > > > > > > chain.
> > > > > > > > > > >> > > > > > > > > > > I understand that it may lead to
> > > segments
> > > > > > > > > belonging
> > > > > > > > > > to
> > > > > > > > > > >> > > other
> > > > > > > > > > >> > > > > > epoch
> > > > > > > > > > >> > > > > > > > > lineage
> > > > > > > > > > >> > > > > > > > > > > not getting deleted and would
> > require
> > > a
> > > > > > separate
> > > > > > > > > > >> > mechanism
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > > > delete
> > > > > > > > > > >> > > > > > > > > them.
> > > > > > > > > > >> > > > > > > > > > > The separate mechanism would
> anyways
> > > be
> > > > > > required
> > > > > > > > > to
> > > > > > > > > > >> > delete
> > > > > > > > > > >> > > > > these
> > > > > > > > > > >> > > > > > > > > "leaked"
> > > > > > > > > > >> > > > > > > > > > > segments as there are other cases
> > > which
> > > > > > could
> > > > > > > > lead
> > > > > > > > > > to
> > > > > > > > > > >> > leaks
> > > > > > > > > > >> > > > > such
> > > > > > > > > > >> > > > > > as
> > > > > > > > > > >> > > > > > > > > network
> > > > > > > > > > >> > > > > > > > > > > problems with RSM mid way writing
> > > through.
> > > > > > > > segment
> > > > > > > > > > >> etc.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Thank you for the replies so far.
> > They
> > > > > have
> > > > > > made
> > > > > > > > > me
> > > > > > > > > > >> > > re-think
> > > > > > > > > > >> > > > my
> > > > > > > > > > >> > > > > > > > > assumptions
> > > > > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > > > > constructive for
> > > > > > > > > me.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM
> Jun
> > > Rao
> > > > > > > > > > >> > > > > > <j...@confluent.io.invalid
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka
> > > could
> > > > > be
> > > > > > kept
> > > > > > > > > > >> longer
> > > > > > > > > > >> > > with
> > > > > > > > > > >> > > > > > > KIP-405.
> > > > > > > > > > >> > > > > > > > > How
> > > > > > > > > > >> > > > > > > > > > > > much data do you envision to
> have
> > > per
> > > > > > broker?
> > > > > > > > > For
> > > > > > > > > > >> 100TB
> > > > > > > > > > >> > > > data
> > > > > > > > > > >> > > > > > per
> > > > > > > > > > >> > > > > > > > > broker,
> > > > > > > > > > >> > > > > > > > > > > > with 1GB segment and segment
> > > metadata of
> > > > > > 100
> > > > > > > > > > bytes,
> > > > > > > > > > >> it
> > > > > > > > > > >> > > > > requires
> > > > > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which
> should
> > > fit
> > > > > in
> > > > > > > > > memory.
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > > > > >> > listRemoteLogSegments()
> > > > > > > > > > >> > > > > > methods.
> > > > > > > > > > >> > > > > > > > > The one
> > > > > > > > > > >> > > > > > > > > > > > you listed
> > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > > > > >> > > > > > > > > int
> > > > > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in
> > > offset
> > > > > > order.
> > > > > > > > > > >> However,
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > other
> > > > > > > > > > >> > > > > > > > > > > > one
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > > topicIdPartition)
> > > > > > > > > > >> > > > > > > > doesn't
> > > > > > > > > > >> > > > > > > > > > > > specify the return order. I
> assume
> > > that
> > > > > > you
> > > > > > > > need
> > > > > > > > > > the
> > > > > > > > > > >> > > latter
> > > > > > > > > > >> > > > > to
> > > > > > > > > > >> > > > > > > > > calculate
> > > > > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM
> > > Divij
> > > > > > Vaidya
> > > > > > > > <
> > > > > > > > > > >> > > > > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *"the default implementation
> of
> > > RLMM
> > > > > > does
> > > > > > > > > local
> > > > > > > > > > >> > > caching,
> > > > > > > > > > >> > > > > > > right?"*
> > > > > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default
> > > implementation
> > > > > of
> > > > > > RLMM
> > > > > > > > > > does
> > > > > > > > > > >> > > indeed
> > > > > > > > > > >> > > > > > cache
> > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > segment
> > > > > > > > > > >> > > > > > > > > > > > > metadata today, hence, it
> won't
> > > work
> > > > > > for use
> > > > > > > > > > cases
> > > > > > > > > > >> > when
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > number
> > > > > > > > > > >> > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > segments in remote storage is
> > > large
> > > > > > enough
> > > > > > > > to
> > > > > > > > > > >> exceed
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > size
> > > > > > > > > > >> > > > > > > of
> > > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > > >> > > > > > > > > > > > As
> > > > > > > > > > >> > > > > > > > > > > > > part of this KIP, I will
> > > implement the
> > > > > > new
> > > > > > > > > > >> proposed
> > > > > > > > > > >> > API
> > > > > > > > > > >> > > > in
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > > default
> > > > > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > > > > underlying
> > > > > > > > > > >> > > implementation
> > > > > > > > > > >> > > > > will
> > > > > > > > > > >> > > > > > > > > still be
> > > > > > > > > > >> > > > > > > > > > > a
> > > > > > > > > > >> > > > > > > > > > > > > scan. I will pick up
> optimizing
> > > that
> > > > > in
> > > > > > a
> > > > > > > > > > separate
> > > > > > > > > > >> > PR.
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *"we also cache all segment
> > > metadata
> > > > > in
> > > > > > the
> > > > > > > > > > >> brokers
> > > > > > > > > > >> > > > without
> > > > > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > > > > >> > > > > > > > > > > > you
> > > > > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > > > > >> > > > > > > > > > > > > Please correct me if I am
> wrong
> > > here
> > > > > > but we
> > > > > > > > > > cache
> > > > > > > > > > >> > > > metadata
> > > > > > > > > > >> > > > > > for
> > > > > > > > > > >> > > > > > > > > segments
> > > > > > > > > > >> > > > > > > > > > > > > "residing in local storage".
> The
> > > size
> > > > > > of the
> > > > > > > > > > >> current
> > > > > > > > > > >> > > > cache
> > > > > > > > > > >> > > > > > > works
> > > > > > > > > > >> > > > > > > > > fine
> > > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > the scale of the number of
> > > segments
> > > > > > that we
> > > > > > > > > > >> expect to
> > > > > > > > > > >> > > > store
> > > > > > > > > > >> > > > > > in
> > > > > > > > > > >> > > > > > > > > local
> > > > > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that
> > cache
> > > > > will
> > > > > > > > > continue
> > > > > > > > > > >> to
> > > > > > > > > > >> > > store
> > > > > > > > > > >> > > > > > > > metadata
> > > > > > > > > > >> > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > segments which are residing in
> > > local
> > > > > > storage
> > > > > > > > > and
> > > > > > > > > > >> > hence,
> > > > > > > > > > >> > > > we
> > > > > > > > > > >> > > > > > > don't
> > > > > > > > > > >> > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > change that. For segments
> which
> > > have
> > > > > > been
> > > > > > > > > > >> offloaded
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > > > remote
> > > > > > > > > > >> > > > > > > > > storage,
> > > > > > > > > > >> > > > > > > > > > > it
> > > > > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that
> > the
> > > > > scale
> > > > > > of
> > > > > > > > > data
> > > > > > > > > > >> > stored
> > > > > > > > > > >> > > in
> > > > > > > > > > >> > > > > > RLMM
> > > > > > > > > > >> > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > > different
> > > > > > > > > > >> > > > > > > > > > > > > from local cache because the
> > > number of
> > > > > > > > > segments
> > > > > > > > > > is
> > > > > > > > > > >> > > > expected
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > be
> > > > > > > > > > >> > > > > > > > > much
> > > > > > > > > > >> > > > > > > > > > > > > larger than what current
> > > > > implementation
> > > > > > > > stores
> > > > > > > > > > in
> > > > > > > > > > >> > local
> > > > > > > > > > >> > > > > > > storage.
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > does
> > > > > > > > > > >> > > > > > > > > specify
> > > > > > > > > > >> > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > order i.e. it returns the
> > segments
> > > > > > sorted by
> > > > > > > > > > first
> > > > > > > > > > >> > > offset
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > > > ascending
> > > > > > > > > > >> > > > > > > > > > > > > order. I am copying the API
> docs
> > > for
> > > > > > KIP-405
> > > > > > > > > > here
> > > > > > > > > > >> for
> > > > > > > > > > >> > > > your
> > > > > > > > > > >> > > > > > > > > reference
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote
> log
> > > > > segment
> > > > > > > > > > metadata,
> > > > > > > > > > >> > > sorted
> > > > > > > > > > >> > > > by
> > > > > > > > > > >> > > > > > > > {@link
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > > > > >> inascending
> > > > > > > > > > >> > > order
> > > > > > > > > > >> > > > > > which
> > > > > > > > > > >> > > > > > > > > > > contains
> > > > > > > > > > >> > > > > > > > > > > > > the given leader epoch. This
> is
> > > used
> > > > > by
> > > > > > > > remote
> > > > > > > > > > log
> > > > > > > > > > >> > > > > retention
> > > > > > > > > > >> > > > > > > > > management
> > > > > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment
> > > metadata
> > > > > > for a
> > > > > > > > > > given
> > > > > > > > > > >> > > leader
> > > > > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > > > > >> > > > > > > > > > > > > topicIdPartition topic
> > > partition@param
> > > > > > > > > > >> leaderEpoch
> > > > > > > > > > >> > > > > > leader
> > > > > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > > > > >> > > > > > > > > > > > > Iterator of remote segments,
> > > sorted by
> > > > > > start
> > > > > > > > > > >> offset
> > > > > > > > > > >> > in
> > > > > > > > > > >> > > > > > > ascending
> > > > > > > > > > >> > > > > > > > > > > order. *
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to
> > > optimize
> > > > > > the
> > > > > > > > > > >> efficiency
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > size
> > > > > > > > > > >> > > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > > retention for remote storage.
> > > KIP-405
> > > > > > does
> > > > > > > > not
> > > > > > > > > > >> > > introduce
> > > > > > > > > > >> > > > a
> > > > > > > > > > >> > > > > > new
> > > > > > > > > > >> > > > > > > > > config
> > > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > periodically checking remote
> > > similar
> > > > > to
> > > > > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > > > > storage.
> > > > > > > > Hence,
> > > > > > > > > > the
> > > > > > > > > > >> > > metric
> > > > > > > > > > >> > > > > > will
> > > > > > > > > > >> > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > updated
> > > > > > > > > > >> > > > > > > > > > > > > at the time of invoking log
> > > retention
> > > > > > check
> > > > > > > > > for
> > > > > > > > > > >> > remote
> > > > > > > > > > >> > > > tier
> > > > > > > > > > >> > > > > > > which
> > > > > > > > > > >> > > > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > > > pending implementation today.
> We
> > > can
> > > > > > perhaps
> > > > > > > > > > come
> > > > > > > > > > >> > back
> > > > > > > > > > >> > > > and
> > > > > > > > > > >> > > > > > > update
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > metric description after the
> > > > > > implementation
> > > > > > > > of
> > > > > > > > > > log
> > > > > > > > > > >> > > > > retention
> > > > > > > > > > >> > > > > > > > check
> > > > > > > > > > >> > > > > > > > > in
> > > > > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > --
> > > > > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16
> AM
> > > Luke
> > > > > > Chen <
> > > > > > > > > > >> > > > > show...@gmail.com
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > One more question about the
> > > metric:
> > > > > > > > > > >> > > > > > > > > > > > > > I think the metric will be
> > > updated
> > > > > > when
> > > > > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > > > > retention
> > > > > > > > check
> > > > > > > > > > >> (that
> > > > > > > > > > >> > > is,
> > > > > > > > > > >> > > > > > > > > > > > > >
> > log.retention.check.interval.ms
> > > )
> > > > > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly
> call
> > > > > > > > > getRemoteLogSize
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note
> in
> > > metric
> > > > > > > > > > >> description,
> > > > > > > > > > >> > > > > > otherwise,
> > > > > > > > > > >> > > > > > > > when
> > > > > > > > > > >> > > > > > > > > > > user
> > > > > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > > > > >> > > > > > > > > > > > > > let's say 0 of
> > > RemoteLogSizeBytes,
> > > > > > will be
> > > > > > > > > > >> > surprised.
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55
> > AM
> > > Jun
> > > > > > Rao
> > > > > > > > > > >> > > > > > > > <j...@confluent.io.invalid
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Thanks for the
> explanation.
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default
> > > implementation
> > > > > > of
> > > > > > > > RLMM
> > > > > > > > > > >> does
> > > > > > > > > > >> > > local
> > > > > > > > > > >> > > > > > > > caching,
> > > > > > > > > > >> > > > > > > > > > > right?
> > > > > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache
> all
> > > > > segment
> > > > > > > > > > metadata
> > > > > > > > > > >> in
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > > brokers
> > > > > > > > > > >> > > > > > > > > > > without
> > > > > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need
> > to
> > > > > change
> > > > > > > > that?
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your
> explanation
> > > makes
> > > > > > > > sense.
> > > > > > > > > > >> > However,
> > > > > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > > > > >> > > > > >
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > doesn't
> > > > > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > > > > iterator.
> > > > > > Do
> > > > > > > > you
> > > > > > > > > > >> intend
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > > change
> > > > > > > > > > >> > > > > > > > > that?
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> 3:31
> > AM
> > > > > Divij
> > > > > > > > > Vaidya
> > > > > > > > > > <
> > > > > > > > > > >> > > > > > > > > > > divijvaidy...@gmail.com
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > Thank you for your
> > comments.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor
> > could
> > > > > ensure
> > > > > > > > that
> > > > > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > > > > pragmatically,
> > > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > >> > > > > difficult
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > > ensure
> > > > > > > > > > >> > > > > > > > > > > > that
> > > > > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments()
> is
> > > fast.
> > > > > > This
> > > > > > > > is
> > > > > > > > > > >> > because
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > > > possibility
> > > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > > > > >> > > > > > > > > > > > > > > > large number of segments
> > > (much
> > > > > > larger
> > > > > > > > > than
> > > > > > > > > > >> what
> > > > > > > > > > >> > > > Kafka
> > > > > > > > > > >> > > > > > > > > currently
> > > > > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > > > > >> > > > > > > > > > > > > > > > with local storage
> today)
> > > would
> > > > > > make
> > > > > > > > it
> > > > > > > > > > >> > > infeasible
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > > > adopt
> > > > > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > > > > >> > > > > > > > > > > > > > > > as local caching to
> > improve
> > > the
> > > > > > > > > > performance
> > > > > > > > > > >> of
> > > > > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > > > > >> > > > > > > > > > > > > > > > from caching (which
> won't
> > > work
> > > > > > due to
> > > > > > > > > size
> > > > > > > > > > >> > > > > > limitations) I
> > > > > > > > > > >> > > > > > > > > can't
> > > > > > > > > > >> > > > > > > > > > > > think
> > > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > other strategies which
> may
> > > > > > eliminate
> > > > > > > > the
> > > > > > > > > > >> need
> > > > > > > > > > >> > for
> > > > > > > > > > >> > > > IO
> > > > > > > > > > >> > > > > > > > > > > > > > > > operations proportional
> to
> > > the
> > > > > > number
> > > > > > > > of
> > > > > > > > > > >> total
> > > > > > > > > > >> > > > > > segments.
> > > > > > > > > > >> > > > > > > > > Please
> > > > > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > > > > >> > > > > > > > > > > > > > > > you have something in
> > mind.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size
> exceeds
> > > the
> > > > > > > > retention
> > > > > > > > > > >> size,
> > > > > > > > > > >> > we
> > > > > > > > > > >> > > > need
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > > > > determine
> > > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > subset of segments to
> > > delete to
> > > > > > bring
> > > > > > > > > the
> > > > > > > > > > >> size
> > > > > > > > > > >> > > > within
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > > > > retention
> > > > > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > > > > >> listRemoteLogSegments() to
> > > > > > > > > > >> > > > > > determine
> > > > > > > > > > >> > > > > > > > > which
> > > > > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But
> > > there is
> > > > > a
> > > > > > > > > > difference
> > > > > > > > > > >> > with
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > use
> > > > > > > > > > >> > > > > > > > > case we
> > > > > > > > > > >> > > > > > > > > > > > are
> > > > > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with
> > this
> > > > > KIP.
> > > > > > To
> > > > > > > > > > >> determine
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > subset
> > > > > > > > > > >> > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > segments
> > > > > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we
> only
> > > read
> > > > > > > > metadata
> > > > > > > > > > for
> > > > > > > > > > >> > > > segments
> > > > > > > > > > >> > > > > > > which
> > > > > > > > > > >> > > > > > > > > would
> > > > > > > > > > >> > > > > > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > > > > >> > > > > > > > > > > > > > > > via the
> > > listRemoteLogSegments().
> > > > > > But
> > > > > > > > to
> > > > > > > > > > >> > determine
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > > > > >> > > > > > > > > > > > > > > > is required every time
> > > retention
> > > > > > logic
> > > > > > > > > > >> based on
> > > > > > > > > > >> > > > size
> > > > > > > > > > >> > > > > > > > > executes, we
> > > > > > > > > > >> > > > > > > > > > > > > read
> > > > > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the
> > > segments
> > > > > in
> > > > > > > > remote
> > > > > > > > > > >> > storage.
> > > > > > > > > > >> > > > > > Hence,
> > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > number
> > > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > > > > calculating
> > > > > > > > > > >> totalLogSize
> > > > > > > > > > >> > > vs.
> > > > > > > > > > >> > > > > when
> > > > > > > > > > >> > > > > > > we
> > > > > > > > > > >> > > > > > > > > are
> > > > > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > > > > >> > > > > > > > > > > > > > > > the subset of segments
> to
> > > > > delete.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about
> > > time-based
> > > > > > > > retention?
> > > > > > > > > > To
> > > > > > > > > > >> > make
> > > > > > > > > > >> > > > that
> > > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > > > > interface
> > > > > > > > > > >> changes?"*No.
> > > > > > > > > > >> > > > Note
> > > > > > > > > > >> > > > > > that
> > > > > > > > > > >> > > > > > > > > time
> > > > > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > > > > >> > > > > > > > > > > > > > > > to determine the
> segments
> > > for
> > > > > > > > retention
> > > > > > > > > is
> > > > > > > > > > >> > > > different
> > > > > > > > > > >> > > > > > for
> > > > > > > > > > >> > > > > > > > time
> > > > > > > > > > >> > > > > > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > > > > >> > > > > > > > > > > > > > > > size based. For time
> > based,
> > > the
> > > > > > time
> > > > > > > > > > >> complexity
> > > > > > > > > > >> > > is
> > > > > > > > > > >> > > > a
> > > > > > > > > > >> > > > > > > > > function of
> > > > > > > > > > >> > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > > > > >> > > > > > > > > > > > > > > > of segments which are
> > > "eligible
> > > > > > for
> > > > > > > > > > >> deletion"
> > > > > > > > > > >> > > > (since
> > > > > > > > > > >> > > > > we
> > > > > > > > > > >> > > > > > > > only
> > > > > > > > > > >> > > > > > > > > read
> > > > > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > > > > >> > > > > > > > > > > > > > > > for segments which would
> > be
> > > > > > deleted)
> > > > > > > > > > >> whereas in
> > > > > > > > > > >> > > > size
> > > > > > > > > > >> > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > time complexity is a
> > > function of
> > > > > > "all
> > > > > > > > > > >> segments"
> > > > > > > > > > >> > > > > > available
> > > > > > > > > > >> > > > > > > > in
> > > > > > > > > > >> > > > > > > > > > > remote
> > > > > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > > > > >> > > > > > > > > > > > > > > > (metadata of all
> segments
> > > needs
> > > > > > to be
> > > > > > > > > read
> > > > > > > > > > >> to
> > > > > > > > > > >> > > > > calculate
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > total
> > > > > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > > > > >> > > > > > > > > > > > > > > > you may observe, this
> KIP
> > > will
> > > > > > bring
> > > > > > > > the
> > > > > > > > > > >> time
> > > > > > > > > > >> > > > > > complexity
> > > > > > > > > > >> > > > > > > > for
> > > > > > > > > > >> > > > > > > > > both
> > > > > > > > > > >> > > > > > > > > > > > > time
> > > > > > > > > > >> > > > > > > > > > > > > > > > based retention & size
> > based
> > > > > > retention
> > > > > > > > > to
> > > > > > > > > > >> the
> > > > > > > > > > >> > > same
> > > > > > > > > > >> > > > > > > > function.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note
> that
> > > this
> > > > > > new API
> > > > > > > > > > >> > introduced
> > > > > > > > > > >> > > > in
> > > > > > > > > > >> > > > > > this
> > > > > > > > > > >> > > > > > > > KIP
> > > > > > > > > > >> > > > > > > > > > > also
> > > > > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric
> for
> > > total
> > > > > > size
> > > > > > > > of
> > > > > > > > > > >> data
> > > > > > > > > > >> > > > stored
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > > > > >> > > > > > > > > > > > > > > > Without the API,
> > > calculation of
> > > > > > this
> > > > > > > > > > metric
> > > > > > > > > > >> > will
> > > > > > > > > > >> > > > > become
> > > > > > > > > > >> > > > > > > > very
> > > > > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> *listRemoteLogSegments().*
> > > > > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > > > > motivation
> > > > > > here
> > > > > > > > > is
> > > > > > > > > > to
> > > > > > > > > > >> > > avoid
> > > > > > > > > > >> > > > > > > > polluting
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > > > > >> > > > > > > > > > > > > > > > with optimization
> specific
> > > APIs
> > > > > > and I
> > > > > > > > > will
> > > > > > > > > > >> > agree
> > > > > > > > > > >> > > > with
> > > > > > > > > > >> > > > > > > that
> > > > > > > > > > >> > > > > > > > > goal.
> > > > > > > > > > >> > > > > > > > > > > > But
> > > > > > > > > > >> > > > > > > > > > > > > I
> > > > > > > > > > >> > > > > > > > > > > > > > > > believe that this new
> API
> > > > > > proposed in
> > > > > > > > > the
> > > > > > > > > > >> KIP
> > > > > > > > > > >> > > > brings
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > > > > > significant
> > > > > > > > > > >> > > > > > > > > > > > > > > > improvement and there is
> > no
> > > > > other
> > > > > > work
> > > > > > > > > > >> around
> > > > > > > > > > >> > > > > available
> > > > > > > > > > >> > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > achieve
> > > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> > > 12:12 AM
> > > > > > Jun
> > > > > > > > Rao
> > > > > > > > > > >> > > > > > > > > <j...@confluent.io.invalid
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP.
> > Sorry
> > > for
> > > > > > the
> > > > > > > > late
> > > > > > > > > > >> reply.
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the
> > KIP
> > > is
> > > > > to
> > > > > > > > > improve
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > efficiency
> > > > > > > > > > >> > > > > > > of
> > > > > > > > > > >> > > > > > > > > size
> > > > > > > > > > >> > > > > > > > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not
> sure
> > > the
> > > > > > > > proposed
> > > > > > > > > > >> changes
> > > > > > > > > > >> > > are
> > > > > > > > > > >> > > > > > > enough.
> > > > > > > > > > >> > > > > > > > > For
> > > > > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the
> > > retention
> > > > > > size,
> > > > > > > > > we
> > > > > > > > > > >> need
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > > > > > determine
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to
> > > bring
> > > > > the
> > > > > > size
> > > > > > > > > > >> within
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > > retention
> > > > > > > > > > >> > > > > > > > > > > limit.
> > > > > > > > > > >> > > > > > > > > > > > Do
> > > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > > > > >> > > > > > >
> > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Also, what about
> > > time-based
> > > > > > > > retention?
> > > > > > > > > > To
> > > > > > > > > > >> > make
> > > > > > > > > > >> > > > that
> > > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > > > > > > > to make some
> additional
> > > > > > interface
> > > > > > > > > > changes?
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > An alternative
> approach
> > > is for
> > > > > > the
> > > > > > > > > RLMM
> > > > > > > > > > >> > > > implementor
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > make
> > > > > > > > > > >> > > > > > > > > > > sure
> > > > > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > > > > >> > > > >
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > is
> > > > > > > > > > >> > > > > > > > > fast
> > > > > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > > >> > > > > > > > > > > > > > > > > local caching). This
> > way,
> > > we
> > > > > > could
> > > > > > > > > keep
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > interface
> > > > > > > > > > >> > > > > > > > > simple.
> > > > > > > > > > >> > > > > > > > > > > > Have
> > > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022
> at
> > > 6:28
> > > > > AM
> > > > > > > > Divij
> > > > > > > > > > >> Vaidya
> > > > > > > > > > >> > <
> > > > > > > > > > >> > > > > > > > > > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else
> have
> > > any
> > > > > > thoughts
> > > > > > > > > on
> > > > > > > > > > >> this
> > > > > > > > > > >> > > > > before I
> > > > > > > > > > >> > > > > > > > > propose
> > > > > > > > > > >> > > > > > > > > > > > this
> > > > > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022
> at
> > > 12:57
> > > > > > PM
> > > > > > > > > Satish
> > > > > > > > > > >> > > Duggana
> > > > > > > > > > >> > > > <
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> satish.dugg...@gmail.com
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP
> > > Divij!
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice
> > > improvement
> > > > > > to
> > > > > > > > > avoid
> > > > > > > > > > >> > > > > recalculation
> > > > > > > > > > >> > > > > > > of
> > > > > > > > > > >> > > > > > > > > size.
> > > > > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the
> > best
> > > > > > possible
> > > > > > > > > > >> approach
> > > > > > > > > > >> > by
> > > > > > > > > > >> > > > > > caching
> > > > > > > > > > >> > > > > > > > or
> > > > > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient
> way.
> > > But
> > > > > > this is
> > > > > > > > > > not a
> > > > > > > > > > >> > big
> > > > > > > > > > >> > > > > > concern
> > > > > > > > > > >> > > > > > > > for
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as
> > > mentioned in
> > > > > > the
> > > > > > > > > KIP.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul
> 2022
> > at
> > > > > > 18:48,
> > > > > > > > > Divij
> > > > > > > > > > >> > Vaidya
> > > > > > > > > > >> > > <
> > > > > > > > > > >> > > > > > > > > > > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for
> your
> > > > > review
> > > > > > > > Luke.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that
> > > would the
> > > > > > new
> > > > > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > > >> > > > > > > > > > > > be a
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead?
> Although
> > > we
> > > > > > move the
> > > > > > > > > > >> > > calculation
> > > > > > > > > > >> > > > > to a
> > > > > > > > > > >> > > > > > > > > seperate
> > > > > > > > > > >> > > > > > > > > > > > API,
> > > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume
> users
> > > will
> > > > > > > > > implement
> > > > > > > > > > a
> > > > > > > > > > >> > > > > > light-weight
> > > > > > > > > > >> > > > > > > > > method,
> > > > > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric
> would
> > be
> > > > > > logged
> > > > > > > > > using
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > information
> > > > > > > > > > >> > > > > > > > > that is
> > > > > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for
> > > handling
> > > > > > remote
> > > > > > > > > > >> > retention
> > > > > > > > > > >> > > > > logic,
> > > > > > > > > > >> > > > > > > > > hence, no
> > > > > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to
> > > calculate
> > > > > > this
> > > > > > > > > > >> metric.
> > > > > > > > > > >> > > More
> > > > > > > > > > >> > > > > > > > > specifically,
> > > > > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager
> > > calls
> > > > > > > > > > >> getRemoteLogSize
> > > > > > > > > > >> > > > API,
> > > > > > > > > > >> > > > > > this
> > > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > > >> > > > > > > > > > > > > would
> > > > > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is
> > > made
> > > > > > every
> > > > > > > > time
> > > > > > > > > > >> > > > > > RemoteLogManager
> > > > > > > > > > >> > > > > > > > > wants
> > > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log
> > segments
> > > > > (which
> > > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > >> > > > > periodic).
> > > > > > > > > > >> > > > > > > > Does
> > > > > > > > > > >> > > > > > > > > that
> > > > > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12,
> > > 2022 at
> > > > > > 11:01
> > > > > > > > AM
> > > > > > > > > > >> Luke
> > > > > > > > > > >> > > Chen
> > > > > > > > > > >> > > > <
> > > > > > > > > > >> > > > > > > > > > > > > show...@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the
> > > KIP!
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it
> makes
> > > sense
> > > > > > to
> > > > > > > > > > delegate
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > > > > responsibility
> > > > > > > > > > >> > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > > >> > > > > > > implementation.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing
> > I'm
> > > not
> > > > > > quite
> > > > > > > > > > sure,
> > > > > > > > > > >> is
> > > > > > > > > > >> > > that
> > > > > > > > > > >> > > > > > would
> > > > > > > > > > >> > > > > > > > > the new
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > `RemoteLogSizeBytes`
> > > > > > metric
> > > > > > > > > be a
> > > > > > > > > > >> > > > > performance
> > > > > > > > > > >> > > > > > > > > overhead?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we
> move
> > > the
> > > > > > > > > calculation
> > > > > > > > > > >> to a
> > > > > > > > > > >> > > > > > seperate
> > > > > > > > > > >> > > > > > > > > API, we
> > > > > > > > > > >> > > > > > > > > > > > > still
> > > > > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will
> > > implement a
> > > > > > > > > > >> light-weight
> > > > > > > > > > >> > > > method,
> > > > > > > > > > >> > > > > > > > right?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1,
> > > 2022 at
> > > > > > 5:47
> > > > > > > > PM
> > > > > > > > > > >> Divij
> > > > > > > > > > >> > > > > Vaidya <
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> divijvaidy...@gmail.com
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take
> a
> > > look
> > > > > at
> > > > > > this
> > > > > > > > > KIP
> > > > > > > > > > >> > which
> > > > > > > > > > >> > > > > > proposes
> > > > > > > > > > >> > > > > > > > an
> > > > > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first
> > KIP
> > > with
> > > > > > > > Apache
> > > > > > > > > > >> Kafka
> > > > > > > > > > >> > > > > community
> > > > > > > > > > >> > > > > > > so
> > > > > > > > > > >> > > > > > > > > any
> > > > > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > > > > Engineer
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Reply via email to