Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2021-01-09 Thread Satish Duggana
Hi Jun,
Thanks for your comments. Please find the inline replies below.

6022. For packages used for server plugins, the convention is to
use org.apache.kafka.server. See java-based Authorizer as an example.

Sure, ‘org.apache.kafka.common.log.remote.storage’ renamed to
‘org.apache.kafka.server.log.remote.storage’.  Updated in the KIP.

9100. Do we need DeletePartitionStateRecord in flat_file_format? The flat
file captures the state of the remote segments. After a partition is
deleted, it seems that we just need to remove the partitions's remote
segments from the flat file.

DeletePartitionState might not yet have been processed by RPM and not
completed. We  will not have  that in flat  file format  once it
reaches DELETE_PARTITION_FINISHED state.

9101. Upgrade: It will be useful to allow direct upgrade from an old
version. It seems that's doable. One can just do the normal upgrade first
and wait enough time (for producer snapshots to be built), and then enable
remote storage.

Upgrade notes updates in the KIP.

9102. RemotePartitionRemover(RPM) process: Is it true that RPM starts
tracking the remote segments when RLMM.onPartitionLeadershipChanges() is
called with the broker being the leader for __remote_log_metadata
partition? If so, could we document it?

The current plan is to have that as part of RLMM and RPM uses that to
get the remote segments list. I will add this detail in the respective
sections.

Satish.

On Wed, 16 Dec 2020 at 23:55, Jun Rao  wrote:
>
> Hi, Satish,
>
> Thanks for the reply. A few more followup comments.
>
> 6022. For packages used for server plugins, the convention is to
> use org.apache.kafka.server. See java-based Authorizer as an example.
>
> 9100. Do we need DeletePartitionStateRecord in flat_file_format? The flat
> file captures the state of the remote segments. After a partition is
> deleted, it seems that we just need to remove the partitions's remote
> segments from the flat file.
>
> 9101. Upgrade: It will be useful to allow direct upgrade from an old
> version. It seems that's doable. One can just do the normal upgrade first
> and wait enough time (for producer snapshots to be built), and then enable
> remote storage.
>
> 9102. RemotePartitionRemover(RPM) process: Is it true that RPM starts
> tracking the remote segments when RLMM.onPartitionLeadershipChanges() is
> called with the broker being the leader for __remote_log_metadata
> partition? If so, could we document it?
>
> Jun
>
> On Tue, Dec 15, 2020 at 8:47 AM Kowshik Prakasam 
> wrote:
>
> > Hi Satish,
> >
> > Thanks for the updates! A few more comments below.
> >
> > 9001. Under the "Upgrade" section, there is a line mentioning: "Upgrade the
> > existing Kafka cluster to 2.7 version and allow this to run for the log
> > retention of user topics that you want to enable tiered storage. This will
> > allow all the topics to have the producer snapshots generated for each log
> > segment." -- Which associated change in AK were you referring to here? Is
> > it: https://github.com/apache/kafka/pull/7929 ? It seems like I don't see
> > it in the 2.7 release branch yet, here is the link:
> > https://github.com/apache/kafka/commits/2.7.
> >
> > 9002. Under the "Upgrade" section, the configuration mentioned is
> > 'remote.log.storage.system.enable'. However, under "Public Interfaces"
> > section the corresponding configuration is 'remote.storage.system.enable'.
> > Could we use the same one in both, maybe
> > 'remote.log.storage.system.enable'?
> >
> > 9003. Under "Per Topic Configuration", the KIP recommends setting
> > 'remote.log.storage.enable' to true at a per-topic level. It will be useful
> > to add a line that if the user wants to enable it for all topics, then they
> > should be able to set the cluster-wide default to true. Also, it will be
> > useful to mention that the KIP currently does not support setting it to
> > false (after it is set to true), and add that to the future work section.
> >
> > 9004. Under "Committed offsets file format", the sample provided shows
> > partition number and offset. Is the topic name required for identifying
> > which topic the partitions belong to?
> >
> > 9005. Under "Internal flat-file store format of remote log metadata", it
> > seems useful to specify both topic name and topic ID for debugging
> > purposes.
> >
> > 9006. Under "Internal flat-file store format of remote log metadata", the
> > description of "metadata-topic-offset" currently says "offset of the remote
> > log metadata topic from which this topic partition's remote log metadata is
> > fetched." Just for the wording, perhaps you meant to refer to the offset
> > upto which the file has been committed? i.e. "offset of the remote log
> > metadata topic upto which this topic partition's remote log metadata has
> > been committed into this file."
> >
> > 9007. Under "Internal flat-file store format of remote log metadata", the
> > schema of the payload (i.e. beyond the header) seems to contain the events
> > from 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2021-01-09 Thread Satish Duggana
Hi Kowshik,
Thanks for your comments. Please find the inline replies below.

9002. Under the "Upgrade" section, the configuration mentioned is
'remote.log.storage.system.enable'. However, under "Public Interfaces"
section the corresponding configuration is 'remote.storage.system.enable'.
Could we use the same one in both, maybe 'remote.log.storage.system.enable'?

Nice catch, updated the KIP.

9003. Under "Per Topic Configuration", the KIP recommends setting
'remote.log.storage.enable' to true at a per-topic level. It will be useful
to add a line that if the user wants to enable it for all topics, then they
should be able to set the cluster-wide default to true. Also, it will be
useful to mention that the KIP currently does not support setting it to
false (after it is set to true), and add that to the future work section.

We do not want to expose a config at cluster level in the initial
version. We will add that in the future. Both limitations are added to
future work.


9004. Under "Committed offsets file format", the sample provided shows
partition number and offset. Is the topic name required for identifying
which topic the partitions belong to?
File name is already mentioned as _rlmm_committed_offsets‘’ and we
already know the internal remote log metadata topic name and it is
never going to be changed.

9005. Under "Internal flat-file store format of remote log metadata", it
seems useful to specify both topic name and topic ID for debugging
purposes.

That makes sense, updated.

9006. Under "Internal flat-file store format of remote log metadata", the
description of "metadata-topic-offset" currently says "offset of the remote
log metadata topic from which this topic partition's remote log metadata is
fetched." Just for the wording, perhaps you meant to refer to the offset
upto which the file has been committed? i.e. "offset of the remote log
metadata topic upto which this topic partition's remote log metadata has
been committed into this file."
Updated

9007. Under "Internal flat-file store format of remote log metadata", the
schema of the payload (i.e. beyond the header) seems to contain the events
from the metadata topic. It seems useful to instead persist the
representation of the materialized state of the events, so that for the
same segment only the latest state is stored. Besides reducing storage
footprint, this also is likely to relate directly with the in-memory
representation of the RLMM cache (which probably is some kind of a Map with
key being segment ID and value being the segment state), so recovery from
disk will be straightforward.

This is what we already do and clarified  in the earlier meeting..

9008. Under "Topic deletion lifecycle", step (1), it will be useful to
mention when in the deletion flow does the controller publish the
delete_partition_marked event to say that the partition is marked for
deletion?
Updated.

9009. There are ~4 TODOs in the KIP. Could you please address these or
remove them?
Updated.

9010. There is a reference to a Google doc on the KIP which was used
earlier for discussions. Please could you remove the reference, since the
KIP is the source of the truth?

Which doc reference are you saying?

9011. This feedback is from an earlier comment. In the RemoteStorageManager
interface, there is an API defined for each file type. For example,
fetchOffsetIndex, fetchTimestampIndex etc. To avoid the duplication, I'd
suggest we can instead have a FileType enum and a common get API based on
the FileType. What do you think?

Sure, updated in the KIP.



On Tue, 15 Dec 2020 at 22:17, Kowshik Prakasam  wrote:
>
> Hi Satish,
>
> Thanks for the updates! A few more comments below.
>
> 9001. Under the "Upgrade" section, there is a line mentioning: "Upgrade the
> existing Kafka cluster to 2.7 version and allow this to run for the log
> retention of user topics that you want to enable tiered storage. This will
> allow all the topics to have the producer snapshots generated for each log
> segment." -- Which associated change in AK were you referring to here? Is
> it: https://github.com/apache/kafka/pull/7929 ? It seems like I don't see
> it in the 2.7 release branch yet, here is the link:
> https://github.com/apache/kafka/commits/2.7.
>
> 9002. Under the "Upgrade" section, the configuration mentioned is
> 'remote.log.storage.system.enable'. However, under "Public Interfaces"
> section the corresponding configuration is 'remote.storage.system.enable'.
> Could we use the same one in both, maybe 'remote.log.storage.system.enable'?
>
> 9003. Under "Per Topic Configuration", the KIP recommends setting
> 'remote.log.storage.enable' to true at a per-topic level. It will be useful
> to add a line that if the user wants to enable it for all topics, then they
> should be able to set the cluster-wide default to true. Also, it will be
> useful to mention that the KIP currently does not support setting it to
> false (after it is set to true), and add that to the future work section.
>
> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-16 Thread Jun Rao
Hi, Satish,

Thanks for the reply. A few more followup comments.

6022. For packages used for server plugins, the convention is to
use org.apache.kafka.server. See java-based Authorizer as an example.

9100. Do we need DeletePartitionStateRecord in flat_file_format? The flat
file captures the state of the remote segments. After a partition is
deleted, it seems that we just need to remove the partitions's remote
segments from the flat file.

9101. Upgrade: It will be useful to allow direct upgrade from an old
version. It seems that's doable. One can just do the normal upgrade first
and wait enough time (for producer snapshots to be built), and then enable
remote storage.

9102. RemotePartitionRemover(RPM) process: Is it true that RPM starts
tracking the remote segments when RLMM.onPartitionLeadershipChanges() is
called with the broker being the leader for __remote_log_metadata
partition? If so, could we document it?

Jun

On Tue, Dec 15, 2020 at 8:47 AM Kowshik Prakasam 
wrote:

> Hi Satish,
>
> Thanks for the updates! A few more comments below.
>
> 9001. Under the "Upgrade" section, there is a line mentioning: "Upgrade the
> existing Kafka cluster to 2.7 version and allow this to run for the log
> retention of user topics that you want to enable tiered storage. This will
> allow all the topics to have the producer snapshots generated for each log
> segment." -- Which associated change in AK were you referring to here? Is
> it: https://github.com/apache/kafka/pull/7929 ? It seems like I don't see
> it in the 2.7 release branch yet, here is the link:
> https://github.com/apache/kafka/commits/2.7.
>
> 9002. Under the "Upgrade" section, the configuration mentioned is
> 'remote.log.storage.system.enable'. However, under "Public Interfaces"
> section the corresponding configuration is 'remote.storage.system.enable'.
> Could we use the same one in both, maybe
> 'remote.log.storage.system.enable'?
>
> 9003. Under "Per Topic Configuration", the KIP recommends setting
> 'remote.log.storage.enable' to true at a per-topic level. It will be useful
> to add a line that if the user wants to enable it for all topics, then they
> should be able to set the cluster-wide default to true. Also, it will be
> useful to mention that the KIP currently does not support setting it to
> false (after it is set to true), and add that to the future work section.
>
> 9004. Under "Committed offsets file format", the sample provided shows
> partition number and offset. Is the topic name required for identifying
> which topic the partitions belong to?
>
> 9005. Under "Internal flat-file store format of remote log metadata", it
> seems useful to specify both topic name and topic ID for debugging
> purposes.
>
> 9006. Under "Internal flat-file store format of remote log metadata", the
> description of "metadata-topic-offset" currently says "offset of the remote
> log metadata topic from which this topic partition's remote log metadata is
> fetched." Just for the wording, perhaps you meant to refer to the offset
> upto which the file has been committed? i.e. "offset of the remote log
> metadata topic upto which this topic partition's remote log metadata has
> been committed into this file."
>
> 9007. Under "Internal flat-file store format of remote log metadata", the
> schema of the payload (i.e. beyond the header) seems to contain the events
> from the metadata topic. It seems useful to instead persist the
> representation of the materialized state of the events, so that for the
> same segment only the latest state is stored. Besides reducing storage
> footprint, this also is likely to relate directly with the in-memory
> representation of the RLMM cache (which probably is some kind of a Map with
> key being segment ID and value being the segment state), so recovery from
> disk will be straightforward.
>
> 9008. Under "Topic deletion lifecycle", step (1), it will be useful to
> mention when in the deletion flow does the controller publish the
> delete_partition_marked event to say that the partition is marked for
> deletion?
>
> 9009. There are ~4 TODOs in the KIP. Could you please address these or
> remove them?
>
> 9010. There is a reference to a Google doc on the KIP which was used
> earlier for discussions. Please could you remove the reference, since the
> KIP is the source of the truth?
>
> 9011. This feedback is from an earlier comment. In the RemoteStorageManager
> interface, there is an API defined for each file type. For example,
> fetchOffsetIndex, fetchTimestampIndex etc. To avoid the duplication, I'd
> suggest we can instead have a FileType enum and a common get API based on
> the FileType. What do you think?
>
>
> Cheers,
> Kowshik
>
>
> On Mon, Dec 14, 2020 at 11:07 AM Satish Duggana 
> wrote:
>
> > Hi Jun,
> > Thanks for your comments. Please go through the inline replies.
> >
> >
> > 5102.2: It seems that both positions can just be int. Another option is
> to
> > have two methods. Would it be clearer?
> >
> > InputStream 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-15 Thread Kowshik Prakasam
Hi Satish,

Thanks for the updates! A few more comments below.

9001. Under the "Upgrade" section, there is a line mentioning: "Upgrade the
existing Kafka cluster to 2.7 version and allow this to run for the log
retention of user topics that you want to enable tiered storage. This will
allow all the topics to have the producer snapshots generated for each log
segment." -- Which associated change in AK were you referring to here? Is
it: https://github.com/apache/kafka/pull/7929 ? It seems like I don't see
it in the 2.7 release branch yet, here is the link:
https://github.com/apache/kafka/commits/2.7.

9002. Under the "Upgrade" section, the configuration mentioned is
'remote.log.storage.system.enable'. However, under "Public Interfaces"
section the corresponding configuration is 'remote.storage.system.enable'.
Could we use the same one in both, maybe 'remote.log.storage.system.enable'?

9003. Under "Per Topic Configuration", the KIP recommends setting
'remote.log.storage.enable' to true at a per-topic level. It will be useful
to add a line that if the user wants to enable it for all topics, then they
should be able to set the cluster-wide default to true. Also, it will be
useful to mention that the KIP currently does not support setting it to
false (after it is set to true), and add that to the future work section.

9004. Under "Committed offsets file format", the sample provided shows
partition number and offset. Is the topic name required for identifying
which topic the partitions belong to?

9005. Under "Internal flat-file store format of remote log metadata", it
seems useful to specify both topic name and topic ID for debugging
purposes.

9006. Under "Internal flat-file store format of remote log metadata", the
description of "metadata-topic-offset" currently says "offset of the remote
log metadata topic from which this topic partition's remote log metadata is
fetched." Just for the wording, perhaps you meant to refer to the offset
upto which the file has been committed? i.e. "offset of the remote log
metadata topic upto which this topic partition's remote log metadata has
been committed into this file."

9007. Under "Internal flat-file store format of remote log metadata", the
schema of the payload (i.e. beyond the header) seems to contain the events
from the metadata topic. It seems useful to instead persist the
representation of the materialized state of the events, so that for the
same segment only the latest state is stored. Besides reducing storage
footprint, this also is likely to relate directly with the in-memory
representation of the RLMM cache (which probably is some kind of a Map with
key being segment ID and value being the segment state), so recovery from
disk will be straightforward.

9008. Under "Topic deletion lifecycle", step (1), it will be useful to
mention when in the deletion flow does the controller publish the
delete_partition_marked event to say that the partition is marked for
deletion?

9009. There are ~4 TODOs in the KIP. Could you please address these or
remove them?

9010. There is a reference to a Google doc on the KIP which was used
earlier for discussions. Please could you remove the reference, since the
KIP is the source of the truth?

9011. This feedback is from an earlier comment. In the RemoteStorageManager
interface, there is an API defined for each file type. For example,
fetchOffsetIndex, fetchTimestampIndex etc. To avoid the duplication, I'd
suggest we can instead have a FileType enum and a common get API based on
the FileType. What do you think?


Cheers,
Kowshik


On Mon, Dec 14, 2020 at 11:07 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for your comments. Please go through the inline replies.
>
>
> 5102.2: It seems that both positions can just be int. Another option is to
> have two methods. Would it be clearer?
>
> InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
> remoteLogSegmentMetadata,  int startPosition) throwsRemoteStorageException;
>
> InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
> remoteLogSegmentMetadata, int startPosition, int endPosition) throws
> RemoteStorageException;
>
> That makes sense to me, updated the KIP.
>
> 6003: Could you also update the javadoc for the return value?
>
> Updated.
>
> 6020: local.log.retention.bytes: Should it default to log.retention.bytes
> to be consistent with local.log.retention.ms?
>
> Yes, it can be defaulted to log.retention.bytes.
>
> 6021: Could you define TopicIdPartition?
>
> Added TopicIdPartition in the KIP.
>
> 6022: For all public facing classes, could you specify the package name?
>
> Updated.
>
>
> Thanks,
> Satish.
>
> On Tue, Dec 8, 2020 at 12:59 AM Jun Rao  wrote:
> >
> > Hi, Satish,
> >
> > Thanks for the reply. A few more comments below.
> >
> > 5102.2: It seems that both positions can just be int. Another option is
> to
> > have two methods. Would it be clearer?
> >
> > InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
> > remoteLogSegmentMetadata,
> > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-14 Thread Satish Duggana
Hi Jun,
Thanks for your comments. Please go through the inline replies.


5102.2: It seems that both positions can just be int. Another option is to
have two methods. Would it be clearer?

InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
remoteLogSegmentMetadata,  int startPosition) throwsRemoteStorageException;

InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
remoteLogSegmentMetadata, int startPosition, int endPosition) throws
RemoteStorageException;

That makes sense to me, updated the KIP.

6003: Could you also update the javadoc for the return value?

Updated.

6020: local.log.retention.bytes: Should it default to log.retention.bytes
to be consistent with local.log.retention.ms?

Yes, it can be defaulted to log.retention.bytes.

6021: Could you define TopicIdPartition?

Added TopicIdPartition in the KIP.

6022: For all public facing classes, could you specify the package name?

Updated.


Thanks,
Satish.

On Tue, Dec 8, 2020 at 12:59 AM Jun Rao  wrote:
>
> Hi, Satish,
>
> Thanks for the reply. A few more comments below.
>
> 5102.2: It seems that both positions can just be int. Another option is to
> have two methods. Would it be clearer?
>
> InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
> remoteLogSegmentMetadata,
> int startPosition) throws
> RemoteStorageException;
>
> InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
> remoteLogSegmentMetadata,
> int startPosition, int endPosition)
> throws RemoteStorageException;
>
> 6003: Could you also update the javadoc for the return value?
>
> 6010: What kind of tiering throughput have you seen with 5 threads?
>
> 6020: local.log.retention.bytes: Should it default to log.retention.bytes
> to be consistent with local.log.retention.ms?
>
> 6021: Could you define TopicIdPartition?
>
> 6022: For all public facing classes, could you specify the package name?
>
> It seems that you already added the topicId support. Two other remaining
> items are (a) the format of local tier metadata storage and (b) upgrade.
>
> Jun
>
> On Mon, Dec 7, 2020 at 8:56 AM Satish Duggana 
> wrote:
>
> > Hi Jun,
> > Thanks for your comments. Please find the inline replies below.
> >
> > >605.2 It's rare for the follower to need the remote data. So, the current
> > approach is fine too. Could you document the process of rebuilding the
> > producer state since we can't simply trim the producerState to an offset in
> > the middle of a segment.
> >
> > Will clarify in the KIP.
> >
> > >5102.2 Would it be clearer to make startPosiont long and endPosition of
> > Optional?
> >
> > We will have arg checks with respective validation. It is not a good
> > practice to have arguments with optional as mentioned here.
> > https://rules.sonarsource.com/java/RSPEC-3553
> >
> >
> > >5102.5 LogSegmentData still has leaderEpochIndex as File instead of
> > ByteBuffer.
> >
> > Updated.
> >
> > >5102.7 Could you define all public methods for LogSegmentData?
> >
> > Updated.
> >
> > >5103.5 Could you change the reference to rlm_process_interval_ms and
> > rlm_retry_interval_ms to the new config names? Also, the retry interval
> > config seems still missing. It would be useful to support exponential
> > backoff with the retry interval config.
> >
> > Good point. We wanted the retry with truncated exponential backoff,
> > updated the KIP.
> >
> > >5111. "RLM follower fetches the earliest offset for the earliest leader
> > epoch by calling RLMM.earliestLogOffset(TopicPartition topicPartition, int
> > leaderEpoch) and updates that as the log start offset." This text is still
> > there. Also, could we remove earliestLogOffset() from RLMM?
> >
> > Updated.
> >
> > >5115. There are still references to "remote log cleaners".
> >
> > Updated.
> >
> > >6000. Since we are returning new error codes, we need to bump up the
> > protocol version for Fetch request. Also, it will be useful to document all
> > new error codes and whether they are retriable or not.
> >
> > Sure, we will add that in the KIP.
> >
> > >6001. public Map segmentLeaderEpochs(): Currently, leaderEpoch
> > is int32 instead of long.
> >
> > Updated.
> >
> > >6002. Is RemoteLogSegmentMetadata.markedForDeletion() needed given
> > RemoteLogSegmentMetadata.state()?
> >
> > No, it is fixed.
> >
> > >6003. RemoteLogSegmentMetadata remoteLogSegmentMetadata(TopicPartition
> > topicPartition, long offset, int epochForOffset): Should this return
> > Optional?
> >
> > That makes sense, updated.
> >
> > >6005. RemoteLogState: It seems it's better to split it between
> > DeletePartitionUpdate and RemoteLogSegmentMetadataUpdate since the states
> > are never shared between the two use cases.
> >
> > Agree with that, updated.
> >
> > >6006. RLMM.onPartitionLeadershipChanges(): This may be ok. However, is it
> > ture that other than the metadata topic, RLMM just needs to know whether
> > there is a replica assigned to this broker and doesn't need to 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-07 Thread Jun Rao
Hi, Satish,

Thanks for the reply. A few more comments below.

5102.2: It seems that both positions can just be int. Another option is to
have two methods. Would it be clearer?

InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
remoteLogSegmentMetadata,
int startPosition) throws
RemoteStorageException;

InputStream fetchLogSegmentData(RemoteLogSegmentMetadata
remoteLogSegmentMetadata,
int startPosition, int endPosition)
throws RemoteStorageException;

6003: Could you also update the javadoc for the return value?

6010: What kind of tiering throughput have you seen with 5 threads?

6020: local.log.retention.bytes: Should it default to log.retention.bytes
to be consistent with local.log.retention.ms?

6021: Could you define TopicIdPartition?

6022: For all public facing classes, could you specify the package name?

It seems that you already added the topicId support. Two other remaining
items are (a) the format of local tier metadata storage and (b) upgrade.

Jun

On Mon, Dec 7, 2020 at 8:56 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for your comments. Please find the inline replies below.
>
> >605.2 It's rare for the follower to need the remote data. So, the current
> approach is fine too. Could you document the process of rebuilding the
> producer state since we can't simply trim the producerState to an offset in
> the middle of a segment.
>
> Will clarify in the KIP.
>
> >5102.2 Would it be clearer to make startPosiont long and endPosition of
> Optional?
>
> We will have arg checks with respective validation. It is not a good
> practice to have arguments with optional as mentioned here.
> https://rules.sonarsource.com/java/RSPEC-3553
>
>
> >5102.5 LogSegmentData still has leaderEpochIndex as File instead of
> ByteBuffer.
>
> Updated.
>
> >5102.7 Could you define all public methods for LogSegmentData?
>
> Updated.
>
> >5103.5 Could you change the reference to rlm_process_interval_ms and
> rlm_retry_interval_ms to the new config names? Also, the retry interval
> config seems still missing. It would be useful to support exponential
> backoff with the retry interval config.
>
> Good point. We wanted the retry with truncated exponential backoff,
> updated the KIP.
>
> >5111. "RLM follower fetches the earliest offset for the earliest leader
> epoch by calling RLMM.earliestLogOffset(TopicPartition topicPartition, int
> leaderEpoch) and updates that as the log start offset." This text is still
> there. Also, could we remove earliestLogOffset() from RLMM?
>
> Updated.
>
> >5115. There are still references to "remote log cleaners".
>
> Updated.
>
> >6000. Since we are returning new error codes, we need to bump up the
> protocol version for Fetch request. Also, it will be useful to document all
> new error codes and whether they are retriable or not.
>
> Sure, we will add that in the KIP.
>
> >6001. public Map segmentLeaderEpochs(): Currently, leaderEpoch
> is int32 instead of long.
>
> Updated.
>
> >6002. Is RemoteLogSegmentMetadata.markedForDeletion() needed given
> RemoteLogSegmentMetadata.state()?
>
> No, it is fixed.
>
> >6003. RemoteLogSegmentMetadata remoteLogSegmentMetadata(TopicPartition
> topicPartition, long offset, int epochForOffset): Should this return
> Optional?
>
> That makes sense, updated.
>
> >6005. RemoteLogState: It seems it's better to split it between
> DeletePartitionUpdate and RemoteLogSegmentMetadataUpdate since the states
> are never shared between the two use cases.
>
> Agree with that, updated.
>
> >6006. RLMM.onPartitionLeadershipChanges(): This may be ok. However, is it
> ture that other than the metadata topic, RLMM just needs to know whether
> there is a replica assigned to this broker and doesn't need to know whether
> the replica is the leader or the follower?
>
> That may be true. If the implementation does not need that, it can
> ignore the information in the callback.
>
> >6007: "Handle expired remote segments (leader and follower)": Why is this
> needed in both the leader and the follower?
>
> Updated.
>
> >6008.   "name": "SegmentSizeInBytes",
> "type": "int64",
> The segment size can just be int32.
>
> Updated.
>
> >6009. For the record format in the log, it seems that we need to add
> record
> type and record version before the serialized bytes. We can follow the
> convention used in
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-631%3A+The+Quorum-based+Kafka+Controller#KIP631:TheQuorumbasedKafkaController-RecordFormats
>
> Yes, KIP already mentions that these are serialized before the payload
> as below. We will mention explicitly that these two are written before
> the data is written.
>
> RLMM instance on broker publishes the message to the topic with key as
> null and value with the below format.
>
> type  : unsigned var int, represents the value type. This value is
> 'apikey' as mentioned in the schema.
> version : unsigned var int, the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-07 Thread Satish Duggana
Hi Jun,
Thanks for your comments. Please find the inline replies below.

>605.2 It's rare for the follower to need the remote data. So, the current
approach is fine too. Could you document the process of rebuilding the
producer state since we can't simply trim the producerState to an offset in
the middle of a segment.

Will clarify in the KIP.

>5102.2 Would it be clearer to make startPosiont long and endPosition of
Optional?

We will have arg checks with respective validation. It is not a good
practice to have arguments with optional as mentioned here.
https://rules.sonarsource.com/java/RSPEC-3553


>5102.5 LogSegmentData still has leaderEpochIndex as File instead of
ByteBuffer.

Updated.

>5102.7 Could you define all public methods for LogSegmentData?

Updated.

>5103.5 Could you change the reference to rlm_process_interval_ms and
rlm_retry_interval_ms to the new config names? Also, the retry interval
config seems still missing. It would be useful to support exponential
backoff with the retry interval config.

Good point. We wanted the retry with truncated exponential backoff,
updated the KIP.

>5111. "RLM follower fetches the earliest offset for the earliest leader
epoch by calling RLMM.earliestLogOffset(TopicPartition topicPartition, int
leaderEpoch) and updates that as the log start offset." This text is still
there. Also, could we remove earliestLogOffset() from RLMM?

Updated.

>5115. There are still references to "remote log cleaners".

Updated.

>6000. Since we are returning new error codes, we need to bump up the
protocol version for Fetch request. Also, it will be useful to document all
new error codes and whether they are retriable or not.

Sure, we will add that in the KIP.

>6001. public Map segmentLeaderEpochs(): Currently, leaderEpoch
is int32 instead of long.

Updated.

>6002. Is RemoteLogSegmentMetadata.markedForDeletion() needed given
RemoteLogSegmentMetadata.state()?

No, it is fixed.

>6003. RemoteLogSegmentMetadata remoteLogSegmentMetadata(TopicPartition
topicPartition, long offset, int epochForOffset): Should this return
Optional?

That makes sense, updated.

>6005. RemoteLogState: It seems it's better to split it between
DeletePartitionUpdate and RemoteLogSegmentMetadataUpdate since the states
are never shared between the two use cases.

Agree with that, updated.

>6006. RLMM.onPartitionLeadershipChanges(): This may be ok. However, is it
ture that other than the metadata topic, RLMM just needs to know whether
there is a replica assigned to this broker and doesn't need to know whether
the replica is the leader or the follower?

That may be true. If the implementation does not need that, it can
ignore the information in the callback.

>6007: "Handle expired remote segments (leader and follower)": Why is this
needed in both the leader and the follower?

Updated.

>6008.   "name": "SegmentSizeInBytes",
"type": "int64",
The segment size can just be int32.

Updated.

>6009. For the record format in the log, it seems that we need to add record
type and record version before the serialized bytes. We can follow the
convention used in
https://cwiki.apache.org/confluence/display/KAFKA/KIP-631%3A+The+Quorum-based+Kafka+Controller#KIP631:TheQuorumbasedKafkaController-RecordFormats

Yes, KIP already mentions that these are serialized before the payload
as below. We will mention explicitly that these two are written before
the data is written.

RLMM instance on broker publishes the message to the topic with key as
null and value with the below format.

type  : unsigned var int, represents the value type. This value is
'apikey' as mentioned in the schema.
version : unsigned var int, the 'version' number of the type as
mentioned in the schema.
data  : record payload in kafka protocol message format.


>6010. remote.log.manager.thread.pool.size: The default value is 10. This
might be too high when enabling the tiered feature for the first time.
Since there are lots of segments that need to be tiered initially, a large
number of threads could overwhelm the broker.

Is the default value 5 reasonable?

6011. "The number of milli seconds to keep the local log segment before it
gets deleted. If not set, the value in `log.retention.minutes` is used. If
set to -1, no time limit is applied." We should use log.retention.ms
instead of log.retention.minutes.
Nice typo catch. Updated the KIP.

Thanks,
Satish.

On Thu, Dec 3, 2020 at 8:03 AM Jun Rao  wrote:
>
> Hi, Satish,
>
> Thanks for the updated KIP. A few more comments below.
>
> 605.2 It's rare for the follower to need the remote data. So, the current
> approach is fine too. Could you document the process of rebuilding the
> producer state since we can't simply trim the producerState to an offset in
> the middle of a segment.
>
> 5102.2 Would it be clearer to make startPosiont long and endPosition of
> Optional?
>
> 5102.5 LogSegmentData still has leaderEpochIndex as File instead of
> ByteBuffer.
>
> 5102.7 Could you define all 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-02 Thread Jun Rao
Hi, Satish,

Thanks for the updated KIP. A few more comments below.

605.2 It's rare for the follower to need the remote data. So, the current
approach is fine too. Could you document the process of rebuilding the
producer state since we can't simply trim the producerState to an offset in
the middle of a segment.

5102.2 Would it be clearer to make startPosiont long and endPosition of
Optional?

5102.5 LogSegmentData still has leaderEpochIndex as File instead of
ByteBuffer.

5102.7 Could you define all public methods for LogSegmentData?

5103.5 Could you change the reference to rlm_process_interval_ms and
rlm_retry_interval_ms to the new config names? Also, the retry interval
config seems still missing. It would be useful to support exponential
backoff with the retry interval config.

5111. "RLM follower fetches the earliest offset for the earliest leader
epoch by calling RLMM.earliestLogOffset(TopicPartition topicPartition, int
leaderEpoch) and updates that as the log start offset." This text is still
there. Also, could we remove earliestLogOffset() from RLMM?

5115. There are still references to "remote log cleaners".

6000. Since we are returning new error codes, we need to bump up the
protocol version for Fetch request. Also, it will be useful to document all
new error codes and whether they are retriable or not.

6001. public Map segmentLeaderEpochs(): Currently, leaderEpoch
is int32 instead of long.

6002. Is RemoteLogSegmentMetadata.markedForDeletion() needed given
RemoteLogSegmentMetadata.state()?

6003. RemoteLogSegmentMetadata remoteLogSegmentMetadata(TopicPartition
topicPartition, long offset, int epochForOffset): Should this return
Optional?

6004. DeletePartitionUpdate.epoch(): It would be useful to pick a more
indicative name so that people understand what epoch this is.

6005. RemoteLogState: It seems it's better to split it between
DeletePartitionUpdate and RemoteLogSegmentMetadataUpdate since the states
are never shared between the two use cases.

6006. RLMM.onPartitionLeadershipChanges(): This may be ok. However, is it
ture that other than the metadata topic, RLMM just needs to know whether
there is a replica assigned to this broker and doesn't need to know whether
the replica is the leader or the follower?

6007: "Handle expired remote segments (leader and follower)": Why is this
needed in both the leader and the follower?

6008.   "name": "SegmentSizeInBytes",
"type": "int64",
The segment size can just be int32.

6009. For the record format in the log, it seems that we need to add record
type and record version before the serialized bytes. We can follow the
convention used in
https://cwiki.apache.org/confluence/display/KAFKA/KIP-631%3A+The+Quorum-based+Kafka+Controller#KIP631:TheQuorumbasedKafkaController-RecordFormats
.

6010. remote.log.manager.thread.pool.size: The default value is 10. This
might be too high when enabling the tiered feature for the first time.
Since there are lots of segments that need to be tiered initially, a large
number of threads could overwhelm the broker.

6011. "The number of milli seconds to keep the local log segment before it
gets deleted. If not set, the value in `log.retention.minutes` is used. If
set to -1, no time limit is applied." We should use log.retention.ms
instead of log.retention.minutes.

Jun

On Tue, Dec 1, 2020 at 2:42 AM Satish Duggana 
wrote:

> Hi,
> We updated the KIP with the points mentioned in the earlier mail
> except for KIP-516 related changes. You can go through them and let us
> know if you have any comments. We will update the KIP with the
> remaining todo items and KIP-516 related changes by end of this
> week(5th Dec).
>
> Thanks,
> Satish.
>
> On Tue, Nov 10, 2020 at 8:26 PM Satish Duggana 
> wrote:
> >
> > Hi Jun,
> > Thanks for your comments. Please find the inline replies below.
> >
> > 605.2 "Build the local leader epoch cache by cutting the leader epoch
> > sequence received from remote storage to [LSO, ELO]." I mentioned an
> issue
> > earlier. Suppose the leader's local start offset is 100. The follower
> finds
> > a remote segment covering offset range [80, 120). The producerState with
> > this remote segment is up to offset 120. To trim the producerState to
> > offset 100 requires more work since one needs to download the previous
> > producerState up to offset 80 and then replay the messages from 80 to
> 100.
> > It seems that it's simpler in this case for the follower just to take the
> > remote segment as it is and start fetching from offset 120.
> >
> > We chose that approach to avoid any edge cases here. It may be
> > possible that the remote log segment that is received may not have the
> > same leader epoch sequence from 100-120 as it contains on the
> > leader(this can happen due to unclean leader). It is safe to start
> > from what the leader returns here.Another way is to find the remote
> > log segment
> >
> > 5016. Just to echo what Kowshik was saying. It seems that
> > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-12-01 Thread Satish Duggana
Hi,
We updated the KIP with the points mentioned in the earlier mail
except for KIP-516 related changes. You can go through them and let us
know if you have any comments. We will update the KIP with the
remaining todo items and KIP-516 related changes by end of this
week(5th Dec).

Thanks,
Satish.

On Tue, Nov 10, 2020 at 8:26 PM Satish Duggana  wrote:
>
> Hi Jun,
> Thanks for your comments. Please find the inline replies below.
>
> 605.2 "Build the local leader epoch cache by cutting the leader epoch
> sequence received from remote storage to [LSO, ELO]." I mentioned an issue
> earlier. Suppose the leader's local start offset is 100. The follower finds
> a remote segment covering offset range [80, 120). The producerState with
> this remote segment is up to offset 120. To trim the producerState to
> offset 100 requires more work since one needs to download the previous
> producerState up to offset 80 and then replay the messages from 80 to 100.
> It seems that it's simpler in this case for the follower just to take the
> remote segment as it is and start fetching from offset 120.
>
> We chose that approach to avoid any edge cases here. It may be
> possible that the remote log segment that is received may not have the
> same leader epoch sequence from 100-120 as it contains on the
> leader(this can happen due to unclean leader). It is safe to start
> from what the leader returns here.Another way is to find the remote
> log segment
>
> 5016. Just to echo what Kowshik was saying. It seems that
> RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
> partition, not on the replicas for the __remote_log_segment_metadata
> partition. It's not clear how the leader of __remote_log_segment_metadata
> obtains the metadata for remote segments for deletion.
>
> RLMM will always receive the callback for the remote log metadata
> topic partitions hosted on the local broker and these will be
> subscribed. I will make this clear in the KIP.
>
> 5100. KIP-516 has been accepted and is being implemented now. Could you
> update the KIP based on topicID?
>
> We mentioned KIP-516 and how it helps. We will update this KIP with
> all the changes it brings with KIP-516.
>
> 5101. RLMM: It would be useful to clarify how the following two APIs are
> used. According to the wiki, the former is used for topic deletion and the
> latter is used for retention. It seems that retention should use the former
> since remote segments without a matching epoch in the leader (potentially
> due to unclean leader election) also need to be garbage collected. The
> latter seems to be used for the new leader to determine the last tiered
> segment.
> default Iterator
> listRemoteLogSegments(TopicPartition topicPartition)
> Iterator listRemoteLogSegments(TopicPartition
> topicPartition, long leaderEpoch);
>
> Right,.that is what we are currently doing. We will update the
> javadocs and wiki with that. Earlier, we did not want to remove the
> segments which are not matched with leader epochs from the ladder
> partition as they may be used later by a replica which can become a
> leader (unclean leader election) and refer those segments. But that
> may leak these segments in remote storage until the topic lifetime. We
> decided to cleanup the segments with the oldest incase of size based
> retention also.
>
> 5102. RSM:
> 5102.1 For methods like fetchLogSegmentData(), it seems that they can
> use RemoteLogSegmentId instead of RemoteLogSegmentMetadata.
>
> It will be useful to have metadata for RSM to fetch log segment. It
> may create location/path using id with other metadata too.
>
> 5102.2 In fetchLogSegmentData(), should we use long instead of Long?
>
> Wanted to keep endPosition as optional to read till the end of the
> segment and avoid sentinels.
>
> 5102.3 Why only some of the methods have default implementation and others
> Don't?
>
> Actually,  RSM will not have any default implementations. Those 3
> methods were made default earlier for tests etc. Updated the wiki.
>
> 5102.4. Could we define RemoteLogSegmentMetadataUpdate
> and DeletePartitionUpdate?
>
> Sure, they will be added.
>
>
> 5102.5 LogSegmentData: It seems that it's easier to pass
> in leaderEpochIndex as a ByteBuffer or byte array than a file since it will
> be generated in memory.
>
> Right, this is in plan.
>
> 5102.6 RemoteLogSegmentMetadata: It seems that it needs both baseOffset and
> startOffset. For example, deleteRecords() could move the startOffset to the
> middle of a segment. If we copy the full segment to remote storage, the
> baseOffset and the startOffset will be different.
>
> Good point. startOffset is baseOffset by default, if not set explicitly.
>
> 5102.7 Could we define all the public methods for RemoteLogSegmentMetadata
> and LogSegmentData?
>
> Sure, updated the wiki.
>
> 5102.8 Could we document whether endOffset in RemoteLogSegmentMetadata is
> inclusive/exclusive?
>
> It is inclusive, will update.
>
> 5103. configs:
> 5103.1 Could 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-20 Thread Kowshik Prakasam
Hi Harsha/Satish,

Hope you are doing well. Would you be able to please update the meeting
notes section for the most recent 2 meetings (from 10/13 and 11/10)? It
will be useful to share the context with the community.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes


Cheers,
Kowshik


On Tue, Nov 10, 2020 at 11:39 PM Kowshik Prakasam 
wrote:

> Hi Harsha,
>
> The goal we discussed is to aim for preview in AK 3.0. In order to get us
> there, it will be useful to think about the order in which the code changes
> will be implemented, reviewed and merged. Since you are driving the
> development, do you want to layout the order of things? For example, do you
> eventually want to break up the PR into multiple smaller ones? If so, you
> could list the milestones there. Another perspective is that this can be
> helpful to budget time suitably and to understand the progress.
> Let us know how we can help.
>
>
> Cheers,
> Kowshik
>
> On Tue, Nov 10, 2020 at 3:26 PM Harsha Chintalapani 
> wrote:
>
>> Thanks Kowshik for the link. Seems reasonable,  as we discussed on the
>> call, code and completion of this KIP will be taken up by us.
>> Regarding Milestone 2, what you think it needs to be clarified there?
>> I believe what we are promising in the KIP along with unit tests, systems
>> tests will be delivered and we can call that as preview.   We will be
>> running this in our production and continue to provide the data and
>> metrics
>> to push this feature to GA.
>>
>>
>>
>> On Tue, Nov 10, 2020 at 10:07 AM, Kowshik Prakasam <
>> kpraka...@confluent.io>
>> wrote:
>>
>> > Hi Harsha/Satish,
>> >
>> > Thanks for the discussion today. Here is a link to the KIP-405
>>  development
>> > milestones google doc we discussed in the meeting today: https://docs.
>> > google.com/document/d/1B5_jaZvWWb2DUpgbgImq0k_IPZ4DWrR8Ru7YpuJrXdc/edit
>> > . I have shared it with you. Please have a look and share your
>> > feedback/improvements. As we discussed, things are clear until
>> milestone 1.
>> > Beyond that, we can discuss it again (perhaps in next sync or later),
>> once
>> > you have thought through the implementation plan/milestones and release
>> > into preview in 3.0.
>> >
>> > Cheers,
>> > Kowshik
>> >
>> > On Tue, Nov 10, 2020 at 6:56 AM Satish Duggana <
>> satish.dugg...@gmail.com>
>> > wrote:
>> >
>> > Hi Jun,
>> > Thanks for your comments. Please find the inline replies below.
>> >
>> > 605.2 "Build the local leader epoch cache by cutting the leader epoch
>> > sequence received from remote storage to [LSO, ELO]." I mentioned an
>> issue
>> > earlier. Suppose the leader's local start offset is 100. The follower
>> finds
>> > a remote segment covering offset range [80, 120). The producerState with
>> > this remote segment is up to offset 120. To trim the producerState to
>> > offset 100 requires more work since one needs to download the previous
>> > producerState up to offset 80 and then replay the messages from 80 to
>> 100.
>> > It seems that it's simpler in this case for the follower just to take
>> the
>> > remote segment as it is and start fetching from offset 120.
>> >
>> > We chose that approach to avoid any edge cases here. It may be possible
>> > that the remote log segment that is received may not have the same
>> leader
>> > epoch sequence from 100-120 as it contains on the leader(this can happen
>> > due to unclean leader). It is safe to start from what the leader returns
>> > here.Another way is to find the remote log segment
>> >
>> > 5016. Just to echo what Kowshik was saying. It seems that
>> > RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
>> > partition, not on the replicas for the __remote_log_segment_metadata
>> > partition. It's not clear how the leader of
>> __remote_log_segment_metadata
>> > obtains the metadata for remote segments for deletion.
>> >
>> > RLMM will always receive the callback for the remote log metadata topic
>> > partitions hosted on the local broker and these will be subscribed. I
>> will
>> > make this clear in the KIP.
>> >
>> > 5100. KIP-516  has been
>> accepted and is being implemented now. Could you
>> > update the KIP based on topicID?
>> >
>> > We mentioned KIP-516 
>> and how it helps. We will update this KIP with all
>> > the changes it brings with KIP-516
>> .
>> >
>> > 5101. RLMM: It would be useful to clarify how the following two APIs are
>> > used. According to the wiki, the former is used for topic deletion and
>> the
>> > latter is used for retention. It seems that retention should use the
>> former
>> > since remote segments without a matching epoch in the leader
>> (potentially
>> > due to unclean leader election) also need to be garbage collected. The
>> > latter seems to be used 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-10 Thread Kowshik Prakasam
Hi Harsha,

The goal we discussed is to aim for preview in AK 3.0. In order to get us
there, it will be useful to think about the order in which the code changes
will be implemented, reviewed and merged. Since you are driving the
development, do you want to layout the order of things? For example, do you
eventually want to break up the PR into multiple smaller ones? If so, you
could list the milestones there. Another perspective is that this can be
helpful to budget time suitably and to understand the progress.
Let us know how we can help.


Cheers,
Kowshik

On Tue, Nov 10, 2020 at 3:26 PM Harsha Chintalapani  wrote:

> Thanks Kowshik for the link. Seems reasonable,  as we discussed on the
> call, code and completion of this KIP will be taken up by us.
> Regarding Milestone 2, what you think it needs to be clarified there?
> I believe what we are promising in the KIP along with unit tests, systems
> tests will be delivered and we can call that as preview.   We will be
> running this in our production and continue to provide the data and metrics
> to push this feature to GA.
>
>
>
> On Tue, Nov 10, 2020 at 10:07 AM, Kowshik Prakasam  >
> wrote:
>
> > Hi Harsha/Satish,
> >
> > Thanks for the discussion today. Here is a link to the KIP-405
>  development
> > milestones google doc we discussed in the meeting today: https://docs.
> > google.com/document/d/1B5_jaZvWWb2DUpgbgImq0k_IPZ4DWrR8Ru7YpuJrXdc/edit
> > . I have shared it with you. Please have a look and share your
> > feedback/improvements. As we discussed, things are clear until milestone
> 1.
> > Beyond that, we can discuss it again (perhaps in next sync or later),
> once
> > you have thought through the implementation plan/milestones and release
> > into preview in 3.0.
> >
> > Cheers,
> > Kowshik
> >
> > On Tue, Nov 10, 2020 at 6:56 AM Satish Duggana  >
> > wrote:
> >
> > Hi Jun,
> > Thanks for your comments. Please find the inline replies below.
> >
> > 605.2 "Build the local leader epoch cache by cutting the leader epoch
> > sequence received from remote storage to [LSO, ELO]." I mentioned an
> issue
> > earlier. Suppose the leader's local start offset is 100. The follower
> finds
> > a remote segment covering offset range [80, 120). The producerState with
> > this remote segment is up to offset 120. To trim the producerState to
> > offset 100 requires more work since one needs to download the previous
> > producerState up to offset 80 and then replay the messages from 80 to
> 100.
> > It seems that it's simpler in this case for the follower just to take the
> > remote segment as it is and start fetching from offset 120.
> >
> > We chose that approach to avoid any edge cases here. It may be possible
> > that the remote log segment that is received may not have the same leader
> > epoch sequence from 100-120 as it contains on the leader(this can happen
> > due to unclean leader). It is safe to start from what the leader returns
> > here.Another way is to find the remote log segment
> >
> > 5016. Just to echo what Kowshik was saying. It seems that
> > RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
> > partition, not on the replicas for the __remote_log_segment_metadata
> > partition. It's not clear how the leader of __remote_log_segment_metadata
> > obtains the metadata for remote segments for deletion.
> >
> > RLMM will always receive the callback for the remote log metadata topic
> > partitions hosted on the local broker and these will be subscribed. I
> will
> > make this clear in the KIP.
> >
> > 5100. KIP-516  has been
> accepted and is being implemented now. Could you
> > update the KIP based on topicID?
> >
> > We mentioned KIP-516 
> and how it helps. We will update this KIP with all
> > the changes it brings with KIP-516
> .
> >
> > 5101. RLMM: It would be useful to clarify how the following two APIs are
> > used. According to the wiki, the former is used for topic deletion and
> the
> > latter is used for retention. It seems that retention should use the
> former
> > since remote segments without a matching epoch in the leader (potentially
> > due to unclean leader election) also need to be garbage collected. The
> > latter seems to be used for the new leader to determine the last tiered
> > segment.
> > default Iterator
> > listRemoteLogSegments(TopicPartition topicPartition)
> > Iterator listRemoteLogSegments(TopicPartition
> > topicPartition, long leaderEpoch);
> >
> > Right,.that is what we are currently doing. We will update the javadocs
> > and wiki with that. Earlier, we did not want to remove the segments which
> > are not matched with leader epochs from the ladder partition as they may
> be
> > used later by a replica which can become a leader (unclean leader
> election)
> > and refer those segments. But that may leak 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-10 Thread Harsha Chintalapani
Thanks Kowshik for the link. Seems reasonable,  as we discussed on the
call, code and completion of this KIP will be taken up by us.
Regarding Milestone 2, what you think it needs to be clarified there?
I believe what we are promising in the KIP along with unit tests, systems
tests will be delivered and we can call that as preview.   We will be
running this in our production and continue to provide the data and metrics
to push this feature to GA.



On Tue, Nov 10, 2020 at 10:07 AM, Kowshik Prakasam 
wrote:

> Hi Harsha/Satish,
>
> Thanks for the discussion today. Here is a link to the KIP-405 development
> milestones google doc we discussed in the meeting today: https://docs.
> google.com/document/d/1B5_jaZvWWb2DUpgbgImq0k_IPZ4DWrR8Ru7YpuJrXdc/edit
> . I have shared it with you. Please have a look and share your
> feedback/improvements. As we discussed, things are clear until milestone 1.
> Beyond that, we can discuss it again (perhaps in next sync or later), once
> you have thought through the implementation plan/milestones and release
> into preview in 3.0.
>
> Cheers,
> Kowshik
>
> On Tue, Nov 10, 2020 at 6:56 AM Satish Duggana 
> wrote:
>
> Hi Jun,
> Thanks for your comments. Please find the inline replies below.
>
> 605.2 "Build the local leader epoch cache by cutting the leader epoch
> sequence received from remote storage to [LSO, ELO]." I mentioned an issue
> earlier. Suppose the leader's local start offset is 100. The follower finds
> a remote segment covering offset range [80, 120). The producerState with
> this remote segment is up to offset 120. To trim the producerState to
> offset 100 requires more work since one needs to download the previous
> producerState up to offset 80 and then replay the messages from 80 to 100.
> It seems that it's simpler in this case for the follower just to take the
> remote segment as it is and start fetching from offset 120.
>
> We chose that approach to avoid any edge cases here. It may be possible
> that the remote log segment that is received may not have the same leader
> epoch sequence from 100-120 as it contains on the leader(this can happen
> due to unclean leader). It is safe to start from what the leader returns
> here.Another way is to find the remote log segment
>
> 5016. Just to echo what Kowshik was saying. It seems that
> RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
> partition, not on the replicas for the __remote_log_segment_metadata
> partition. It's not clear how the leader of __remote_log_segment_metadata
> obtains the metadata for remote segments for deletion.
>
> RLMM will always receive the callback for the remote log metadata topic
> partitions hosted on the local broker and these will be subscribed. I will
> make this clear in the KIP.
>
> 5100. KIP-516 has been accepted and is being implemented now. Could you
> update the KIP based on topicID?
>
> We mentioned KIP-516 and how it helps. We will update this KIP with all
> the changes it brings with KIP-516.
>
> 5101. RLMM: It would be useful to clarify how the following two APIs are
> used. According to the wiki, the former is used for topic deletion and the
> latter is used for retention. It seems that retention should use the former
> since remote segments without a matching epoch in the leader (potentially
> due to unclean leader election) also need to be garbage collected. The
> latter seems to be used for the new leader to determine the last tiered
> segment.
> default Iterator
> listRemoteLogSegments(TopicPartition topicPartition)
> Iterator listRemoteLogSegments(TopicPartition
> topicPartition, long leaderEpoch);
>
> Right,.that is what we are currently doing. We will update the javadocs
> and wiki with that. Earlier, we did not want to remove the segments which
> are not matched with leader epochs from the ladder partition as they may be
> used later by a replica which can become a leader (unclean leader election)
> and refer those segments. But that may leak these segments in remote
> storage until the topic lifetime. We decided to cleanup the segments with
> the oldest incase of size based retention also.
>
> 5102. RSM:
> 5102.1 For methods like fetchLogSegmentData(), it seems that they can use
> RemoteLogSegmentId instead of RemoteLogSegmentMetadata.
>
> It will be useful to have metadata for RSM to fetch log segment. It may
> create location/path using id with other metadata too.
>
> 5102.2 In fetchLogSegmentData(), should we use long instead of Long?
>
> Wanted to keep endPosition as optional to read till the end of the segment
> and avoid sentinels.
>
> 5102.3 Why only some of the methods have default implementation and others
> Don't?
>
> Actually, RSM will not have any default implementations. Those 3 methods
> were made default earlier for tests etc. Updated the wiki.
>
> 5102.4. Could we define RemoteLogSegmentMetadataUpdate and
> DeletePartitionUpdate?
>
> Sure, they will be added.
>
> 5102.5 LogSegmentData: It seems that it's easier to pass 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-10 Thread Kowshik Prakasam
Hi Harsha/Satish,

Thanks for the discussion today. Here is a link to the KIP-405 development
milestones google doc we discussed in the meeting today:
https://docs.google.com/document/d/1B5_jaZvWWb2DUpgbgImq0k_IPZ4DWrR8Ru7YpuJrXdc/edit
. I have shared it with you. Please have a look and share your
feedback/improvements. As we discussed, things are clear until milestone 1.
Beyond that, we can discuss it again (perhaps in next sync or later), once
you have thought through the implementation plan/milestones and release
into preview in 3.0.


Cheers,
Kowshik


On Tue, Nov 10, 2020 at 6:56 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for your comments. Please find the inline replies below.
>
> 605.2 "Build the local leader epoch cache by cutting the leader epoch
> sequence received from remote storage to [LSO, ELO]." I mentioned an issue
> earlier. Suppose the leader's local start offset is 100. The follower finds
> a remote segment covering offset range [80, 120). The producerState with
> this remote segment is up to offset 120. To trim the producerState to
> offset 100 requires more work since one needs to download the previous
> producerState up to offset 80 and then replay the messages from 80 to 100.
> It seems that it's simpler in this case for the follower just to take the
> remote segment as it is and start fetching from offset 120.
>
> We chose that approach to avoid any edge cases here. It may be
> possible that the remote log segment that is received may not have the
> same leader epoch sequence from 100-120 as it contains on the
> leader(this can happen due to unclean leader). It is safe to start
> from what the leader returns here.Another way is to find the remote
> log segment
>
> 5016. Just to echo what Kowshik was saying. It seems that
> RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
> partition, not on the replicas for the __remote_log_segment_metadata
> partition. It's not clear how the leader of __remote_log_segment_metadata
> obtains the metadata for remote segments for deletion.
>
> RLMM will always receive the callback for the remote log metadata
> topic partitions hosted on the local broker and these will be
> subscribed. I will make this clear in the KIP.
>
> 5100. KIP-516 has been accepted and is being implemented now. Could you
> update the KIP based on topicID?
>
> We mentioned KIP-516 and how it helps. We will update this KIP with
> all the changes it brings with KIP-516.
>
> 5101. RLMM: It would be useful to clarify how the following two APIs are
> used. According to the wiki, the former is used for topic deletion and the
> latter is used for retention. It seems that retention should use the former
> since remote segments without a matching epoch in the leader (potentially
> due to unclean leader election) also need to be garbage collected. The
> latter seems to be used for the new leader to determine the last tiered
> segment.
> default Iterator
> listRemoteLogSegments(TopicPartition topicPartition)
> Iterator listRemoteLogSegments(TopicPartition
> topicPartition, long leaderEpoch);
>
> Right,.that is what we are currently doing. We will update the
> javadocs and wiki with that. Earlier, we did not want to remove the
> segments which are not matched with leader epochs from the ladder
> partition as they may be used later by a replica which can become a
> leader (unclean leader election) and refer those segments. But that
> may leak these segments in remote storage until the topic lifetime. We
> decided to cleanup the segments with the oldest incase of size based
> retention also.
>
> 5102. RSM:
> 5102.1 For methods like fetchLogSegmentData(), it seems that they can
> use RemoteLogSegmentId instead of RemoteLogSegmentMetadata.
>
> It will be useful to have metadata for RSM to fetch log segment. It
> may create location/path using id with other metadata too.
>
> 5102.2 In fetchLogSegmentData(), should we use long instead of Long?
>
> Wanted to keep endPosition as optional to read till the end of the
> segment and avoid sentinels.
>
> 5102.3 Why only some of the methods have default implementation and others
> Don't?
>
> Actually,  RSM will not have any default implementations. Those 3
> methods were made default earlier for tests etc. Updated the wiki.
>
> 5102.4. Could we define RemoteLogSegmentMetadataUpdate
> and DeletePartitionUpdate?
>
> Sure, they will be added.
>
>
> 5102.5 LogSegmentData: It seems that it's easier to pass
> in leaderEpochIndex as a ByteBuffer or byte array than a file since it will
> be generated in memory.
>
> Right, this is in plan.
>
> 5102.6 RemoteLogSegmentMetadata: It seems that it needs both baseOffset and
> startOffset. For example, deleteRecords() could move the startOffset to the
> middle of a segment. If we copy the full segment to remote storage, the
> baseOffset and the startOffset will be different.
>
> Good point. startOffset is baseOffset by default, if not set explicitly.
>
> 5102.7 Could we define all the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-10 Thread Satish Duggana
Hi Jun,
Thanks for your comments. Please find the inline replies below.

605.2 "Build the local leader epoch cache by cutting the leader epoch
sequence received from remote storage to [LSO, ELO]." I mentioned an issue
earlier. Suppose the leader's local start offset is 100. The follower finds
a remote segment covering offset range [80, 120). The producerState with
this remote segment is up to offset 120. To trim the producerState to
offset 100 requires more work since one needs to download the previous
producerState up to offset 80 and then replay the messages from 80 to 100.
It seems that it's simpler in this case for the follower just to take the
remote segment as it is and start fetching from offset 120.

We chose that approach to avoid any edge cases here. It may be
possible that the remote log segment that is received may not have the
same leader epoch sequence from 100-120 as it contains on the
leader(this can happen due to unclean leader). It is safe to start
from what the leader returns here.Another way is to find the remote
log segment

5016. Just to echo what Kowshik was saying. It seems that
RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
partition, not on the replicas for the __remote_log_segment_metadata
partition. It's not clear how the leader of __remote_log_segment_metadata
obtains the metadata for remote segments for deletion.

RLMM will always receive the callback for the remote log metadata
topic partitions hosted on the local broker and these will be
subscribed. I will make this clear in the KIP.

5100. KIP-516 has been accepted and is being implemented now. Could you
update the KIP based on topicID?

We mentioned KIP-516 and how it helps. We will update this KIP with
all the changes it brings with KIP-516.

5101. RLMM: It would be useful to clarify how the following two APIs are
used. According to the wiki, the former is used for topic deletion and the
latter is used for retention. It seems that retention should use the former
since remote segments without a matching epoch in the leader (potentially
due to unclean leader election) also need to be garbage collected. The
latter seems to be used for the new leader to determine the last tiered
segment.
default Iterator
listRemoteLogSegments(TopicPartition topicPartition)
Iterator listRemoteLogSegments(TopicPartition
topicPartition, long leaderEpoch);

Right,.that is what we are currently doing. We will update the
javadocs and wiki with that. Earlier, we did not want to remove the
segments which are not matched with leader epochs from the ladder
partition as they may be used later by a replica which can become a
leader (unclean leader election) and refer those segments. But that
may leak these segments in remote storage until the topic lifetime. We
decided to cleanup the segments with the oldest incase of size based
retention also.

5102. RSM:
5102.1 For methods like fetchLogSegmentData(), it seems that they can
use RemoteLogSegmentId instead of RemoteLogSegmentMetadata.

It will be useful to have metadata for RSM to fetch log segment. It
may create location/path using id with other metadata too.

5102.2 In fetchLogSegmentData(), should we use long instead of Long?

Wanted to keep endPosition as optional to read till the end of the
segment and avoid sentinels.

5102.3 Why only some of the methods have default implementation and others
Don't?

Actually,  RSM will not have any default implementations. Those 3
methods were made default earlier for tests etc. Updated the wiki.

5102.4. Could we define RemoteLogSegmentMetadataUpdate
and DeletePartitionUpdate?

Sure, they will be added.


5102.5 LogSegmentData: It seems that it's easier to pass
in leaderEpochIndex as a ByteBuffer or byte array than a file since it will
be generated in memory.

Right, this is in plan.

5102.6 RemoteLogSegmentMetadata: It seems that it needs both baseOffset and
startOffset. For example, deleteRecords() could move the startOffset to the
middle of a segment. If we copy the full segment to remote storage, the
baseOffset and the startOffset will be different.

Good point. startOffset is baseOffset by default, if not set explicitly.

5102.7 Could we define all the public methods for RemoteLogSegmentMetadata
and LogSegmentData?

Sure, updated the wiki.

5102.8 Could we document whether endOffset in RemoteLogSegmentMetadata is
inclusive/exclusive?

It is inclusive, will update.

5103. configs:
5103.1 Could we define the default value of non-required configs (e.g the
size of new thread pools)?

Sure, that makes sense.

5103.2 It seems that local.log.retention.ms should default to retention.ms,
instead of remote.log.retention.minutes. Similarly, it seems
that local.log.retention.bytes should default to segment.bytes.

Right, we do not have  remote.log.retention as we discussed earlier.
Thanks for catching the typo.

5103.3 remote.log.manager.thread.pool.size: The description says "used in
scheduling tasks to copy segments, fetch remote log 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-06 Thread Kowshik Prakasam
Hi Satish,

Thanks for your response.

5015. That makes sense, good point.

5019 and 5020. My 2 cents is that while you are implementing it, it will be
useful to update the KIP with details about the RocksDB-based design that
you envision. This will facilitate the discussions.


Cheers,
Kowshik



On Fri, Nov 6, 2020 at 5:45 AM Satish Duggana 
wrote:

> Hi Kowshik,
> Thanks for your comments.
>
> 5012. In the RemoteStorageManager interface, there is an API defined for
> each file type. For example, fetchOffsetIndex, fetchTimestampIndex etc. To
> avoid the duplication, I'd suggest we can instead have a FileType enum and
> a common get API based on the FileType.
>
> That is a good point. We can have suggested changes.
>
>
> 5014. There are some TODO sections in the KIP. Would these be filled up in
> future iterations?
>
> Right.
>
> 5015. Under "Topic deletion lifecycle", I'm trying to understand why do we
> need delete_partition_marked as well as the delete_partition_started
> messages. I couldn't spot a drawback if supposing we simplified the design
> such that the controller would only write delete_partition_started message,
> and RemoteLogCleaner (RLC) instance picks it up for processing. What am I
> Missing?
>
> Having delete_partition_marked event  does not add any complexity but
> it gives audit of the source of the respective action. imho, removing
> this does not make it simpler.
>
> 5016. Under "Topic deletion lifecycle", step (4) is mentioned as "RLC gets
> all the remote log segments for the partition and each of these remote log
> segments is deleted with the next steps.". Since the RLC instance runs on
> each tier topic partition leader, how does the RLC then get the list of
> remote log segments to be deleted? It will be useful to add that detail to
> the KIP.
>
> Sure, we will address that in the KIP.
>
> 5017. Under "Public Interfaces -> Configs", there is a line mentioning "We
> will support flipping remote.log.storage.enable in next versions." It will
> be useful to mention this in the "Future Work" section of the KIP too.
>
> That makes sense. Will add that in future work items.
>
> 5018. The KIP introduces a number of configuration parameters. It will be
> useful to mention in the KIP if the user should assume these as static
> configuration in the server.properties file, or dynamic configuration which
> can be modified without restarting the broker.
>
> As discussed earlier, we will update with the config types.
>
> 5019.  Maybe this is planned as a future update to the KIP, but I thought
> I'd mention it here. Could you please add details to the KIP on why RocksDB
> was chosen as the default cache implementation of RLMM, and how it is going
> to be used? Were alternatives compared/considered? For example, it would be
> useful to explain/evaluate the following: 1) debuggability of the RocksDB
> JNI interface, 2) performance, 3) portability across platforms and 4)
> interface parity of RocksDB’s JNI api with it's underlying C/C++ api.
>
> RocksDB is widely used in Kafka Streams. We were evaluating RocksDB
> and a custom file store. Custom file store adds lot of complexity in
> maintaining the files and compacting them etc, RocksDB already
> provides the required features and it is  simple to use. We are
> working on RocksDB implementation with a couple of approaches and we
> will update the results once we are done.
>
> 5020. Following up on (5019), for the RocksDB cache, it will be useful to
> explain the relationship/mapping between the following in the KIP: 1) # of
> tiered partitions, 2) # of partitions of metadata topic
> __remote_log_metadata and 3) # of RocksDB instances. i.e. is the plan to
> have a RocksDB instance per tiered partition, or per metadata topic
> partition, or just 1 for per broker?
>
> We are exploring of having not more than 2 instances per broker.
>
> 5021. I was looking at the implementation prototype (PR link:
> https://github.com/apache/kafka/pull/7561). It seems that a boolean
> attribute is being introduced into the Log layer to check if remote log
> capability is enabled. While the boolean footprint is small at the moment,
> this can easily grow in the future and become harder to
> test/maintain, considering that the Log layer is already pretty complex. We
> should start thinking about how to manage such changes to the Log layer
> (for the purpose of improved testability, better separation of concerns and
> readability). One proposal I have is to take a step back and define a
> higher level Log interface. Then, the Broker code can be changed to use
> this interface. It can be changed such that only a handle to the interface
> is exposed to other components (such as LogCleaner, ReplicaManager etc.)
> and not the underlying Log object. This approach keeps the user of the Log
> layer agnostic of the whereabouts of the data. Underneath the interface,
> the implementing classes can completely separate local log capabilities
> from the remote log. For example, 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-06 Thread Satish Duggana
Hi Kowshik,
Thanks for your comments.

5012. In the RemoteStorageManager interface, there is an API defined for
each file type. For example, fetchOffsetIndex, fetchTimestampIndex etc. To
avoid the duplication, I'd suggest we can instead have a FileType enum and
a common get API based on the FileType.

That is a good point. We can have suggested changes.


5014. There are some TODO sections in the KIP. Would these be filled up in
future iterations?

Right.

5015. Under "Topic deletion lifecycle", I'm trying to understand why do we
need delete_partition_marked as well as the delete_partition_started
messages. I couldn't spot a drawback if supposing we simplified the design
such that the controller would only write delete_partition_started message,
and RemoteLogCleaner (RLC) instance picks it up for processing. What am I
Missing?

Having delete_partition_marked event  does not add any complexity but
it gives audit of the source of the respective action. imho, removing
this does not make it simpler.

5016. Under "Topic deletion lifecycle", step (4) is mentioned as "RLC gets
all the remote log segments for the partition and each of these remote log
segments is deleted with the next steps.". Since the RLC instance runs on
each tier topic partition leader, how does the RLC then get the list of
remote log segments to be deleted? It will be useful to add that detail to
the KIP.

Sure, we will address that in the KIP.

5017. Under "Public Interfaces -> Configs", there is a line mentioning "We
will support flipping remote.log.storage.enable in next versions." It will
be useful to mention this in the "Future Work" section of the KIP too.

That makes sense. Will add that in future work items.

5018. The KIP introduces a number of configuration parameters. It will be
useful to mention in the KIP if the user should assume these as static
configuration in the server.properties file, or dynamic configuration which
can be modified without restarting the broker.

As discussed earlier, we will update with the config types.

5019.  Maybe this is planned as a future update to the KIP, but I thought
I'd mention it here. Could you please add details to the KIP on why RocksDB
was chosen as the default cache implementation of RLMM, and how it is going
to be used? Were alternatives compared/considered? For example, it would be
useful to explain/evaluate the following: 1) debuggability of the RocksDB
JNI interface, 2) performance, 3) portability across platforms and 4)
interface parity of RocksDB’s JNI api with it's underlying C/C++ api.

RocksDB is widely used in Kafka Streams. We were evaluating RocksDB
and a custom file store. Custom file store adds lot of complexity in
maintaining the files and compacting them etc, RocksDB already
provides the required features and it is  simple to use. We are
working on RocksDB implementation with a couple of approaches and we
will update the results once we are done.

5020. Following up on (5019), for the RocksDB cache, it will be useful to
explain the relationship/mapping between the following in the KIP: 1) # of
tiered partitions, 2) # of partitions of metadata topic
__remote_log_metadata and 3) # of RocksDB instances. i.e. is the plan to
have a RocksDB instance per tiered partition, or per metadata topic
partition, or just 1 for per broker?

We are exploring of having not more than 2 instances per broker.

5021. I was looking at the implementation prototype (PR link:
https://github.com/apache/kafka/pull/7561). It seems that a boolean
attribute is being introduced into the Log layer to check if remote log
capability is enabled. While the boolean footprint is small at the moment,
this can easily grow in the future and become harder to
test/maintain, considering that the Log layer is already pretty complex. We
should start thinking about how to manage such changes to the Log layer
(for the purpose of improved testability, better separation of concerns and
readability). One proposal I have is to take a step back and define a
higher level Log interface. Then, the Broker code can be changed to use
this interface. It can be changed such that only a handle to the interface
is exposed to other components (such as LogCleaner, ReplicaManager etc.)
and not the underlying Log object. This approach keeps the user of the Log
layer agnostic of the whereabouts of the data. Underneath the interface,
the implementing classes can completely separate local log capabilities
from the remote log. For example, the Log class can be simplified to only
manage logic surrounding local log segments and metadata. Additionally, a
wrapper class can be provided (implementing the higher level Log interface)
which will contain any/all logic surrounding tiered data. The wrapper
class will wrap around an instance of the Log class delegating the local
log logic to it. Finally, a handle to the wrapper class can be exposed to
the other components wherever they need a handle to the higher level Log
interface.

It is still a draft 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-11-04 Thread Jun Rao
Hi, Satish,

Thanks for the updated KIP. A few more comments below.

605.2 "Build the local leader epoch cache by cutting the leader epoch
sequence received from remote storage to [LSO, ELO]." I mentioned an issue
earlier. Suppose the leader's local start offset is 100. The follower finds
a remote segment covering offset range [80, 120). The producerState with
this remote segment is up to offset 120. To trim the producerState to
offset 100 requires more work since one needs to download the previous
producerState up to offset 80 and then replay the messages from 80 to 100.
It seems that it's simpler in this case for the follower just to take the
remote segment as it is and start fetching from offset 120.

5016. Just to echo what Kowshik was saying. It seems that
RLMM.onPartitionLeadershipChanges() is only called on the replicas for a
partition, not on the replicas for the __remote_log_segment_metadata
partition. It's not clear how the leader of __remote_log_segment_metadata
obtains the metadata for remote segments for deletion.

5100. KIP-516 has been accepted and is being implemented now. Could you
update the KIP based on topicID?

5101. RLMM: It would be useful to clarify how the following two APIs are
used. According to the wiki, the former is used for topic deletion and the
latter is used for retention. It seems that retention should use the former
since remote segments without a matching epoch in the leader (potentially
due to unclean leader election) also need to be garbage collected. The
latter seems to be used for the new leader to determine the last tiered
segment.
default Iterator
listRemoteLogSegments(TopicPartition topicPartition)
Iterator listRemoteLogSegments(TopicPartition
topicPartition, long leaderEpoch);

5102. RSM:
5102.1 For methods like fetchLogSegmentData(), it seems that they can
use RemoteLogSegmentId instead of RemoteLogSegmentMetadata.
5102.2 In fetchLogSegmentData(), should we use long instead of Long?
5102.3 Why only some of the methods have default implementation and others
don't?
5102.4. Could we define RemoteLogSegmentMetadataUpdate
and DeletePartitionUpdate?
5102.5 LogSegmentData: It seems that it's easier to pass
in leaderEpochIndex as a ByteBuffer or byte array than a file since it will
be generated in memory.
5102.6 RemoteLogSegmentMetadata: It seems that it needs both baseOffset and
startOffset. For example, deleteRecords() could move the startOffset to the
middle of a segment. If we copy the full segment to remote storage, the
baseOffset and the startOffset will be different.
5102.7 Could we define all the public methods for RemoteLogSegmentMetadata
and LogSegmentData?
5102.8 Could we document whether endOffset in RemoteLogSegmentMetadata is
inclusive/exclusive?

5103. configs:
5103.1 Could we define the default value of non-required configs (e.g the
size of new thread pools)?
5103.2 It seems that local.log.retention.ms should default to retention.ms,
instead of remote.log.retention.minutes. Similarly, it seems
that local.log.retention.bytes should default to segment.bytes.
5103.3 remote.log.manager.thread.pool.size: The description says "used in
scheduling tasks to copy segments, fetch remote log indexes and clean up
remote log segments". However, there is a separate
config remote.log.reader.threads for fetching remote data. It's weird to
fetch remote index and log in different thread pools since both are used
for serving fetch requests.
5103.4 remote.log.manager.task.interval.ms: Is that the amount of time to
back off when there is no work to do? If so, perhaps it can be renamed as
backoff.ms.
5103.5 Are rlm_process_interval_ms and rlm_retry_interval_ms configs? If
so, they need to be listed in this section.

5104. "RLM maintains a bounded cache(possibly LRU) of the index files of
remote log segments to avoid multiple index fetches from the remote
storage." Is the RLM in memory or on disk? If on disk, where is it stored?
Do we need a configuration to bound the size?

5105. The KIP uses local-log-start-offset and Earliest Local Offset in
different places. It would be useful to standardize the terminology.

5106. The section on "In BuildingRemoteLogAux state". It listed two options
without saying which option is chosen.

5107. Follower to leader transition: It has step 2, but not step 1.

5108. If a consumer fetches from the remote data and the remote storage is
not available, what error code is used in the fetch response?

5109. "ListOffsets: For timestamps >= 0, it returns the first message
offset whose timestamp is >= to the given timestamp in the request. That
means it checks in remote log time indexes first, after which local log
time indexes are checked." Could you document which method in RLMM is used
for this?

5110. Stopreplica: "it sets all the remote log segment metadata of that
partition with a delete marker and publishes them to RLMM." This seems
outdated given the new topic deletion logic.

5111. "RLM follower fetches the earliest offset for the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-10-27 Thread Kowshik Prakasam
Hi Satish,

Thanks for the updates to the KIP. Here are my first batch of
comments/suggestions on the latest version of the KIP.

5012. In the RemoteStorageManager interface, there is an API defined for
each file type. For example, fetchOffsetIndex, fetchTimestampIndex etc. To
avoid the duplication, I'd suggest we can instead have a FileType enum and
a common get API based on the FileType.

5013. There are some references to the Google doc in the KIP. I wasn't sure
if the Google doc is expected to be in sync with the contents of the wiki.
Going forward, it seems easier if just the KIP is maintained as the source
of truth. In this regard, could you please move all the references to the
Google doc, maybe to a separate References section at the bottom of the KIP?

5014. There are some TODO sections in the KIP. Would these be filled up in
future iterations?

5015. Under "Topic deletion lifecycle", I'm trying to understand why do we
need delete_partition_marked as well as the delete_partition_started
messages. I couldn't spot a drawback if supposing we simplified the design
such that the controller would only write delete_partition_started message,
and RemoteLogCleaner (RLC) instance picks it up for processing. What am I
missing?

5016. Under "Topic deletion lifecycle", step (4) is mentioned as "RLC gets
all the remote log segments for the partition and each of these remote log
segments is deleted with the next steps.". Since the RLC instance runs on
each tier topic partition leader, how does the RLC then get the list of
remote log segments to be deleted? It will be useful to add that detail to
the KIP.

5017. Under "Public Interfaces -> Configs", there is a line mentioning "We
will support flipping remote.log.storage.enable in next versions." It will
be useful to mention this in the "Future Work" section of the KIP too.

5018. The KIP introduces a number of configuration parameters. It will be
useful to mention in the KIP if the user should assume these as static
configuration in the server.properties file, or dynamic configuration which
can be modified without restarting the broker.

5019.  Maybe this is planned as a future update to the KIP, but I thought
I'd mention it here. Could you please add details to the KIP on why RocksDB
was chosen as the default cache implementation of RLMM, and how it is going
to be used? Were alternatives compared/considered? For example, it would be
useful to explain/evaluate the following: 1) debuggability of the RocksDB
JNI interface, 2) performance, 3) portability across platforms and 4)
interface parity of RocksDB’s JNI api with it's underlying C/C++ api.

5020. Following up on (5019), for the RocksDB cache, it will be useful to
explain the relationship/mapping between the following in the KIP: 1) # of
tiered partitions, 2) # of partitions of metadata topic
__remote_log_metadata and 3) # of RocksDB instances. i.e. is the plan to
have a RocksDB instance per tiered partition, or per metadata topic
partition, or just 1 for per broker?

5021. I was looking at the implementation prototype (PR link:
https://github.com/apache/kafka/pull/7561). It seems that a boolean
attribute is being introduced into the Log layer to check if remote log
capability is enabled. While the boolean footprint is small at the moment,
this can easily grow in the future and become harder to
test/maintain, considering that the Log layer is already pretty complex. We
should start thinking about how to manage such changes to the Log layer
(for the purpose of improved testability, better separation of concerns and
readability). One proposal I have is to take a step back and define a
higher level Log interface. Then, the Broker code can be changed to use
this interface. It can be changed such that only a handle to the interface
is exposed to other components (such as LogCleaner, ReplicaManager etc.)
and not the underlying Log object. This approach keeps the user of the Log
layer agnostic of the whereabouts of the data. Underneath the interface,
the implementing classes can completely separate local log capabilities
from the remote log. For example, the Log class can be simplified to only
manage logic surrounding local log segments and metadata. Additionally, a
wrapper class can be provided (implementing the higher level Log interface)
which will contain any/all logic surrounding tiered data. The wrapper
class will wrap around an instance of the Log class delegating the local
log logic to it. Finally, a handle to the wrapper class can be exposed to
the other components wherever they need a handle to the higher level Log
interface.


Cheers,
Kowshik

On Mon, Oct 26, 2020 at 9:52 PM Satish Duggana 
wrote:

> Hi,
> KIP is updated with 1) topic deletion lifecycle and its related items
> 2) Protocol changes(mainly related to ListOffsets) and other minor
> changes.
> Please go through them and let us know your comments.
>
> Thanks,
> Satish.
>
> On Mon, Sep 28, 2020 at 9:10 PM Satish Duggana 
> wrote:
> >
> > Hi 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-10-26 Thread Satish Duggana
Hi,
KIP is updated with 1) topic deletion lifecycle and its related items
2) Protocol changes(mainly related to ListOffsets) and other minor
changes.
Please go through them and let us know your comments.

Thanks,
Satish.

On Mon, Sep 28, 2020 at 9:10 PM Satish Duggana  wrote:
>
> Hi Dhruvil,
> Thanks for looking into the KIP and sending your comments. Sorry for
> the late reply, missed it in the mail thread.
>
> 1. Could you describe how retention would work with this KIP and which
> threads are responsible for driving this work? I believe there are 3 kinds
> of retention processes we are looking at:
>   (a) Regular retention for data in tiered storage as per configured `
> retention.ms` / `retention.bytes`.
>   (b) Local retention for data in local storage as per configured `
> local.log.retention.ms` / `local.log.retention.bytes`
>   (c) Possibly regular retention for data in local storage, if the tiering
> task is lagging or for data that is below the log start offset.
>
> Local log retention is done by the existing log cleanup tasks. These
> are not done for segments that are not yet copied to remote storage.
> Remote log cleanup is done by the leader partition’s RLMTask.
>
> 2. When does a segment become eligible to be tiered? Is it as soon as the
> segment is rolled and the end offset is less than the last stable offset as
> mentioned in the KIP? I wonder if we need to consider other parameters too,
> like the highwatermark so that we are guaranteed that what we are tiering
> has been committed to the log and accepted by the ISR.
>
> AFAIK, last stable offset is always <= highwatermark. This will make
> sure we are always tiering the message segments which have been
> accepted by ISR and transactionally completed.
>
>
> 3. The section on "Follower Fetch Scenarios" is useful but is a bit
> difficult to parse at the moment. It would be useful to summarize the
> changes we need in the ReplicaFetcher.
>
> It may become difficult for users to read/follow if we add code changes here.
>
> 4. Related to the above, it's a bit unclear how we are planning on
> restoring the producer state for a new replica. Could you expand on that?
>
> It is mentioned in the KIP BuildingRemoteLogAuxState is introduced to
> build the state like leader epoch sequence and producer snapshots
> before it starts fetching the data from the leader. We will make it
> clear in the KIP.
>
>
> 5. Similarly, it would be worth summarizing the behavior on unclean leader
> election. There are several scenarios to consider here: data loss from
> local log, data loss from remote log, data loss from metadata topic, etc.
> It's worth describing these in detail.
>
> We mentioned the cases about unclean leader election in the follower
> fetch scenarios.
> If there are errors while fetching data from remote store or metadata
> store, it will work the same way as it works with local log. It
> returns the error back to the caller. Please let us know if I am
> missing your point here.
>
>
> 7. For a READ_COMMITTED FetchRequest, how do we retrieve and return the
> aborted transaction metadata?
>
> When a fetch for a remote log is accessed, we will fetch aborted
> transactions along with the segment if it is not found in the local
> index cache. This includes the case of transaction index not existing
> in the remote log segment. That means, the cache entry can be empty or
> have a list of aborted transactions.
>
>
> 8. The `LogSegmentData` class assumes that we have a log segment, offset
> index, time index, transaction index, producer snapshot and leader epoch
> index. How do we deal with cases where we do not have one or more of these?
> For example, we may not have a transaction index or producer snapshot for a
> particular segment. The former is optional, and the latter is only kept for
> up to the 3 latest segments.
>
> This is a good point,  we discussed this in the last meeting.
> Transaction index is optional and we will copy them only if it exists.
> We want to keep all the producer snapshots at each log segment rolling
> and they can be removed if the log copying is successful and it still
> maintains the existing latest 3 segments, We only delete the producer
> snapshots which have been copied to remote log segments on leader.
> Follower will keep the log segments beyond the segments which have not
> been copied to remote storage. We will update the KIP with these
> details.
>
> Thanks,
> Satish.
>
> On Thu, Sep 17, 2020 at 1:47 AM Dhruvil Shah  wrote:
> >
> > Hi Satish, Harsha,
> >
> > Thanks for the KIP. Few questions below:
> >
> > 1. Could you describe how retention would work with this KIP and which
> > threads are responsible for driving this work? I believe there are 3 kinds
> > of retention processes we are looking at:
> >   (a) Regular retention for data in tiered storage as per configured `
> > retention.ms` / `retention.bytes`.
> >   (b) Local retention for data in local storage as per configured `
> > local.log.retention.ms` / 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-09-28 Thread Satish Duggana
Hi Dhruvil,
Thanks for looking into the KIP and sending your comments. Sorry for
the late reply, missed it in the mail thread.

1. Could you describe how retention would work with this KIP and which
threads are responsible for driving this work? I believe there are 3 kinds
of retention processes we are looking at:
  (a) Regular retention for data in tiered storage as per configured `
retention.ms` / `retention.bytes`.
  (b) Local retention for data in local storage as per configured `
local.log.retention.ms` / `local.log.retention.bytes`
  (c) Possibly regular retention for data in local storage, if the tiering
task is lagging or for data that is below the log start offset.

Local log retention is done by the existing log cleanup tasks. These
are not done for segments that are not yet copied to remote storage.
Remote log cleanup is done by the leader partition’s RLMTask.

2. When does a segment become eligible to be tiered? Is it as soon as the
segment is rolled and the end offset is less than the last stable offset as
mentioned in the KIP? I wonder if we need to consider other parameters too,
like the highwatermark so that we are guaranteed that what we are tiering
has been committed to the log and accepted by the ISR.

AFAIK, last stable offset is always <= highwatermark. This will make
sure we are always tiering the message segments which have been
accepted by ISR and transactionally completed.


3. The section on "Follower Fetch Scenarios" is useful but is a bit
difficult to parse at the moment. It would be useful to summarize the
changes we need in the ReplicaFetcher.

It may become difficult for users to read/follow if we add code changes here.

4. Related to the above, it's a bit unclear how we are planning on
restoring the producer state for a new replica. Could you expand on that?

It is mentioned in the KIP BuildingRemoteLogAuxState is introduced to
build the state like leader epoch sequence and producer snapshots
before it starts fetching the data from the leader. We will make it
clear in the KIP.


5. Similarly, it would be worth summarizing the behavior on unclean leader
election. There are several scenarios to consider here: data loss from
local log, data loss from remote log, data loss from metadata topic, etc.
It's worth describing these in detail.

We mentioned the cases about unclean leader election in the follower
fetch scenarios.
If there are errors while fetching data from remote store or metadata
store, it will work the same way as it works with local log. It
returns the error back to the caller. Please let us know if I am
missing your point here.


7. For a READ_COMMITTED FetchRequest, how do we retrieve and return the
aborted transaction metadata?

When a fetch for a remote log is accessed, we will fetch aborted
transactions along with the segment if it is not found in the local
index cache. This includes the case of transaction index not existing
in the remote log segment. That means, the cache entry can be empty or
have a list of aborted transactions.


8. The `LogSegmentData` class assumes that we have a log segment, offset
index, time index, transaction index, producer snapshot and leader epoch
index. How do we deal with cases where we do not have one or more of these?
For example, we may not have a transaction index or producer snapshot for a
particular segment. The former is optional, and the latter is only kept for
up to the 3 latest segments.

This is a good point,  we discussed this in the last meeting.
Transaction index is optional and we will copy them only if it exists.
We want to keep all the producer snapshots at each log segment rolling
and they can be removed if the log copying is successful and it still
maintains the existing latest 3 segments, We only delete the producer
snapshots which have been copied to remote log segments on leader.
Follower will keep the log segments beyond the segments which have not
been copied to remote storage. We will update the KIP with these
details.

Thanks,
Satish.

On Thu, Sep 17, 2020 at 1:47 AM Dhruvil Shah  wrote:
>
> Hi Satish, Harsha,
>
> Thanks for the KIP. Few questions below:
>
> 1. Could you describe how retention would work with this KIP and which
> threads are responsible for driving this work? I believe there are 3 kinds
> of retention processes we are looking at:
>   (a) Regular retention for data in tiered storage as per configured `
> retention.ms` / `retention.bytes`.
>   (b) Local retention for data in local storage as per configured `
> local.log.retention.ms` / `local.log.retention.bytes`
>   (c) Possibly regular retention for data in local storage, if the tiering
> task is lagging or for data that is below the log start offset.
>
> 2. When does a segment become eligible to be tiered? Is it as soon as the
> segment is rolled and the end offset is less than the last stable offset as
> mentioned in the KIP? I wonder if we need to consider other parameters too,
> like the highwatermark so that we are guaranteed that 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-09-16 Thread Dhruvil Shah
Hi Satish, Harsha,

Thanks for the KIP. Few questions below:

1. Could you describe how retention would work with this KIP and which
threads are responsible for driving this work? I believe there are 3 kinds
of retention processes we are looking at:
  (a) Regular retention for data in tiered storage as per configured `
retention.ms` / `retention.bytes`.
  (b) Local retention for data in local storage as per configured `
local.log.retention.ms` / `local.log.retention.bytes`
  (c) Possibly regular retention for data in local storage, if the tiering
task is lagging or for data that is below the log start offset.

2. When does a segment become eligible to be tiered? Is it as soon as the
segment is rolled and the end offset is less than the last stable offset as
mentioned in the KIP? I wonder if we need to consider other parameters too,
like the highwatermark so that we are guaranteed that what we are tiering
has been committed to the log and accepted by the ISR.

3. The section on "Follower Fetch Scenarios" is useful but is a bit
difficult to parse at the moment. It would be useful to summarize the
changes we need in the ReplicaFetcher.

4. Related to the above, it's a bit unclear how we are planning on
restoring the producer state for a new replica. Could you expand on that?

5. Similarly, it would be worth summarizing the behavior on unclean leader
election. There are several scenarios to consider here: data loss from
local log, data loss from remote log, data loss from metadata topic, etc.
It's worth describing these in detail.

6. It would be useful to add details about how we plan on using RocksDB in
the default implementation of `RemoteLogMetadataManager`.

7. For a READ_COMMITTED FetchRequest, how do we retrieve and return the
aborted transaction metadata?

8. The `LogSegmentData` class assumes that we have a log segment, offset
index, time index, transaction index, producer snapshot and leader epoch
index. How do we deal with cases where we do not have one or more of these?
For example, we may not have a transaction index or producer snapshot for a
particular segment. The former is optional, and the latter is only kept for
up to the 3 latest segments.

Thanks,
Dhruvil

On Mon, Sep 7, 2020 at 6:54 PM Harsha Ch  wrote:

> Hi All,
>
> We are all working through the last meeting feedback. I'll cancel the
> tomorrow 's meeting and we can meanwhile continue our discussion in mailing
> list. We can start the regular meeting from next week onwards.
>
> Thanks,
>
> Harsha
>
> On Fri, Sep 04, 2020 at 8:41 AM, Satish Duggana < satish.dugg...@gmail.com
> > wrote:
>
> >
> >
> >
> > Hi Jun,
> > Thanks for your thorough review and comments. Please find the inline
> > replies below.
> >
> >
> >
> > 600. The topic deletion logic needs more details.
> > 600.1 The KIP mentions "The controller considers the topic partition is
> > deleted only when it determines that there are no log segments for that
> > topic partition by using RLMM". How is this done?
> >
> >
> >
> > It uses RLMM#listSegments() returns all the segments for the given topic
> > partition.
> >
> >
> >
> > 600.2 "If the delete option is enabled then the leader will stop RLM task
> > and stop processing and it sets all the remote log segment metadata of
> > that partition with a delete marker and publishes them to RLMM." We
> > discussed this earlier. When a topic is being deleted, there may not be a
> > leader for the deleted partition.
> >
> >
> >
> > This is a good point. As suggested in the meeting, we will add a separate
> > section for topic/partition deletion lifecycle and this scenario will be
> > addressed.
> >
> >
> >
> > 601. Unclean leader election
> > 601.1 Scenario 1: new empty follower
> > After step 1, the follower restores up to offset 3. So why does it have
> > LE-2 at offset 5?
> >
> >
> >
> > Nice catch. It was showing the leader epoch fetched from the remote
> > storage. It should be shown with the truncated till offset 3. Updated the
> > KIP.
> >
> >
> >
> > 601.2 senario 5: After Step 3, leader A has inconsistent data between its
> > local and the tiered data. For example. offset 3 has msg 3 LE-0 locally,
> > but msg 5 LE-1 in the remote store. While it's ok for the unclean leader
> > to lose data, it should still return consistent data, whether it's from
> > the local or the remote store.
> >
> >
> >
> > There is no inconsistency here as LE-0 offsets are [0, 4] and LE-2:
> > [5, ]. It will always get the right records for the given offset and
> > leader epoch. In case of remote, RSM is invoked to get the remote log
> > segment that contains the given offset with the leader epoch.
> >
> >
> >
> > 601.4 It seems that retention is based on
> > listRemoteLogSegments(TopicPartition topicPartition, long leaderEpoch).
> > When there is an unclean leader election, it's possible for the new
> leader
> > to not to include certain epochs in its epoch cache. How are remote
> > segments associated with those epochs being cleaned?
> >
> >
> >
> > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-09-07 Thread Harsha Ch
Hi All,

We are all working through the last meeting feedback. I'll cancel the tomorrow 
's meeting and we can meanwhile continue our discussion in mailing list. We can 
start the regular meeting from next week onwards.

Thanks,

Harsha

On Fri, Sep 04, 2020 at 8:41 AM, Satish Duggana < satish.dugg...@gmail.com > 
wrote:

> 
> 
> 
> Hi Jun,
> Thanks for your thorough review and comments. Please find the inline
> replies below.
> 
> 
> 
> 600. The topic deletion logic needs more details.
> 600.1 The KIP mentions "The controller considers the topic partition is
> deleted only when it determines that there are no log segments for that
> topic partition by using RLMM". How is this done?
> 
> 
> 
> It uses RLMM#listSegments() returns all the segments for the given topic
> partition.
> 
> 
> 
> 600.2 "If the delete option is enabled then the leader will stop RLM task
> and stop processing and it sets all the remote log segment metadata of
> that partition with a delete marker and publishes them to RLMM." We
> discussed this earlier. When a topic is being deleted, there may not be a
> leader for the deleted partition.
> 
> 
> 
> This is a good point. As suggested in the meeting, we will add a separate
> section for topic/partition deletion lifecycle and this scenario will be
> addressed.
> 
> 
> 
> 601. Unclean leader election
> 601.1 Scenario 1: new empty follower
> After step 1, the follower restores up to offset 3. So why does it have
> LE-2 at offset 5?
> 
> 
> 
> Nice catch. It was showing the leader epoch fetched from the remote
> storage. It should be shown with the truncated till offset 3. Updated the
> KIP.
> 
> 
> 
> 601.2 senario 5: After Step 3, leader A has inconsistent data between its
> local and the tiered data. For example. offset 3 has msg 3 LE-0 locally,
> but msg 5 LE-1 in the remote store. While it's ok for the unclean leader
> to lose data, it should still return consistent data, whether it's from
> the local or the remote store.
> 
> 
> 
> There is no inconsistency here as LE-0 offsets are [0, 4] and LE-2:
> [5, ]. It will always get the right records for the given offset and
> leader epoch. In case of remote, RSM is invoked to get the remote log
> segment that contains the given offset with the leader epoch.
> 
> 
> 
> 601.4 It seems that retention is based on
> listRemoteLogSegments(TopicPartition topicPartition, long leaderEpoch).
> When there is an unclean leader election, it's possible for the new leader
> to not to include certain epochs in its epoch cache. How are remote
> segments associated with those epochs being cleaned?
> 
> 
> 
> That is a good point. This leader will also cleanup the epochs earlier to
> its start leader epoch and delete those segments. It gets the earliest
> epoch for a partition and starts deleting segments from that leader epoch.
> We need one more API in RLMM to get the earliest leader epoch.
> 
> 
> 
> 601.5 The KIP discusses the handling of unclean leader elections for user
> topics. What about unclean leader elections on
> __remote_log_segment_metadata?
> This is the same as other system topics like consumer_offsets,
> __transaction_state topics. As discussed in the meeting, we will add the
> behavior of __remote_log_segment_metadata topic’s unclean leader
> truncation.
> 
> 
> 
> 602. It would be useful to clarify the limitations in the initial release.
> The KIP mentions not supporting compacted topics. What about JBOD and
> changing the configuration of a topic from delete to compact after remote.
> log. storage. enable ( http://remote.log.storage.enable/ ) is enabled?
> 
> 
> 
> This was updated in the KIP earlier.
> 
> 
> 
> 603. RLM leader tasks:
> 603.1"It checks for rolled over LogSegments (which have the last message
> offset less than last stable offset of that topic partition) and copies
> them along with their offset/time/transaction indexes and leader epoch
> cache to the remote tier." It needs to copy the producer snapshot too.
> 
> 
> 
> Right. It copies producer snapshots too as mentioned in LogSegmentData.
> 
> 
> 
> 603.2 "Local logs are not cleaned up till those segments are copied
> successfully to remote even though their retention time/size is reached"
> This seems weird. If the tiering stops because the remote store is not
> available, we don't want the local data to grow forever.
> 
> 
> 
> It was clarified in the discussion that the comment was more about the
> local storage goes beyond the log.retention. The above statement is about
> local.log.retention but not for the complete log.retention. When it
> reaches the log.retention then it will delete the local logs even though
> those are not copied to remote storage.
> 
> 
> 
> 604. "RLM maintains a bounded cache(possibly LRU) of the index files of
> remote log segments to avoid multiple index fetches from the remote
> storage. These indexes can be used in the same way as local segment
> indexes are used." Could you provide more details on this? Are the indexes
> cached in 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-09-04 Thread Satish Duggana
Hi Jun,
Thanks for your thorough review and comments. Please find the inline
replies below.

600. The topic deletion logic needs more details.
600.1 The KIP mentions "The controller considers the topic partition is
deleted only when it determines that there are no log segments for that
topic partition by using RLMM". How is this done?

It uses RLMM#listSegments() returns all the segments for the given
topic partition.

600.2 "If the delete option is enabled then the leader will stop RLM task
and stop processing and it sets all the remote log segment metadata of that
partition with a delete marker and publishes them to RLMM." We discussed
this earlier. When a topic is being deleted, there may not be a leader for
the deleted partition.

This is a good point. As suggested in the meeting, we will add a
separate section for topic/partition deletion lifecycle and this
scenario will be addressed.

601. Unclean leader election
601.1 Scenario 1: new empty follower
After step 1, the follower restores up to offset 3. So why does it have
LE-2 at offset 5?

Nice catch. It was showing the leader epoch fetched from the remote
storage. It should be shown with the truncated till offset 3. Updated
the KIP.

601.2 senario 5: After Step 3, leader A has inconsistent data between its
local and the tiered data. For example. offset 3 has msg 3 LE-0 locally,
but msg 5 LE-1 in the remote store. While it's ok for the unclean leader to
lose data, it should still return consistent data, whether it's from the
local or the remote store.

There is no inconsistency here as LE-0 offsets are [0, 4] and LE-2:
[5, ]. It will always get the right records for the given offset and
leader epoch. In case of remote, RSM is invoked to get the remote log
segment that contains the given offset with the leader epoch.

601.4 It seems that retention is based on
listRemoteLogSegments(TopicPartition topicPartition, long leaderEpoch).
When there is an unclean leader election, it's possible for the new leader
to not to include certain epochs in its epoch cache. How are remote
segments associated with those epochs being cleaned?

That is a good point. This leader will also cleanup the epochs earlier
to its start leader epoch and delete those segments. It gets the
earliest epoch for a partition and starts deleting segments from that
leader epoch. We need one more API in RLMM to get the earliest leader
epoch.

601.5 The KIP discusses the handling of unclean leader elections for user
topics. What about unclean leader elections on
__remote_log_segment_metadata?
This is the same as other system topics like consumer_offsets,
__transaction_state topics. As discussed in the meeting, we will add
the behavior of __remote_log_segment_metadata topic’s unclean leader
truncation.

602. It would be useful to clarify the limitations in the initial release.
The KIP mentions not supporting compacted topics. What about JBOD and
changing the configuration of a topic from delete to compact
after remote.log.storage.enable is enabled?

This was updated in the KIP earlier.

603. RLM leader tasks:
603.1"It checks for rolled over LogSegments (which have the last message
offset less than last stable offset of that topic partition) and copies
them along with their offset/time/transaction indexes and leader epoch
cache to the remote tier." It needs to copy the producer snapshot too.

Right. It copies producer snapshots too as mentioned in LogSegmentData.

603.2 "Local logs are not cleaned up till those segments are copied
successfully to remote even though their retention time/size is reached"
This seems weird. If the tiering stops because the remote store is not
available, we don't want the local data to grow forever.

It was clarified in the discussion that the comment was more about the
local storage goes beyond the log.retention. The above statement is
about local.log.retention but not for the complete  log.retention.
When it reaches the log.retention then it will delete the local logs
even though those are not copied to remote storage.


604. "RLM maintains a bounded cache(possibly LRU) of the index files of
remote log segments to avoid multiple index fetches from the remote
storage. These indexes can be used in the same way as local segment indexes
are used." Could you provide more details on this? Are the indexes cached
in memory or on disk? If on disk, where are they stored? Are the cached
indexes bound by a certain size?

These are cached on disk and stored in log.dir with a name
“__remote_log_index_cache”. They are bound by the total size. This
will be exposed as a user configuration,

605. BuildingRemoteLogAux
605.1 In this section, two options are listed. Which one is chosen?
Option-2, updated the KIP.

605.2 In option 2, it says  "Build the local leader epoch cache by cutting
the leader epoch sequence received from remote storage to [LSO, ELO]. (LSO
= log start offset)." We need to do the same thing for the producer
snapshot. However, it's hard to cut the producer snapshot 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-31 Thread Jun Rao
Hi, Satish,

Thanks for the updated KIP. Made another pass. A few more comments below.

600. The topic deletion logic needs more details.
600.1 The KIP mentions "The controller considers the topic partition is
deleted only when it determines that there are no log segments for that
topic partition by using RLMM". How is this done?
600.2 "If the delete option is enabled then the leader will stop RLM task
and stop processing and it sets all the remote log segment metadata of that
partition with a delete marker and publishes them to RLMM." We discussed
this earlier. When a topic is being deleted, there may not be a leader for
the deleted partition.

601. Unclean leader election
601.1 Scenario 1: new empty follower
After step 1, the follower restores up to offset 3. So why does it have
LE-2 at offset 5?
601.2 senario 5: After Step 3, leader A has inconsistent data between its
local and the tiered data. For example. offset 3 has msg 3 LE-0 locally,
but msg 5 LE-1 in the remote store. While it's ok for the unclean leader to
lose data, it should still return consistent data, whether it's from the
local or the remote store.
601.3 The follower picks up log start offset using the following api.
Suppose that we have 3 remote segments (LE, SegmentStartOffset) as (2, 10),
(3, 20) and (7, 15) due to an unclean leader election. Using the following
api will cause logStartOffset to go backward from 20 to 15. How do we
prevent that?
earliestLogOffset(TopicPartition topicPartition, int leaderEpoch)
601.4 It seems that retention is based on
listRemoteLogSegments(TopicPartition topicPartition, long leaderEpoch).
When there is an unclean leader election, it's possible for the new leader
to not to include certain epochs in its epoch cache. How are remote
segments associated with those epochs being cleaned?
601.5 The KIP discusses the handling of unclean leader elections for user
topics. What about unclean leader elections on
__remote_log_segment_metadata?

602. It would be useful to clarify the limitations in the initial release.
The KIP mentions not supporting compacted topics. What about JBOD and
changing the configuration of a topic from delete to compact
after remote.log.storage.enable is enabled?

603. RLM leader tasks:
603.1"It checks for rolled over LogSegments (which have the last message
offset less than last stable offset of that topic partition) and copies
them along with their offset/time/transaction indexes and leader epoch
cache to the remote tier." It needs to copy the producer snapshot too.
603.2 "Local logs are not cleaned up till those segments are copied
successfully to remote even though their retention time/size is reached"
This seems weird. If the tiering stops because the remote store is not
available, we don't want the local data to grow forever.

604. "RLM maintains a bounded cache(possibly LRU) of the index files of
remote log segments to avoid multiple index fetches from the remote
storage. These indexes can be used in the same way as local segment indexes
are used." Could you provide more details on this? Are the indexes cached
in memory or on disk? If on disk, where are they stored? Are the cached
indexes bound by a certain size?

605. BuildingRemoteLogAux
605.1 In this section, two options are listed. Which one is chosen?
605.2 In option 2, it says  "Build the local leader epoch cache by cutting
the leader epoch sequence received from remote storage to [LSO, ELO]. (LSO
= log start offset)." We need to do the same thing for the producer
snapshot. However, it's hard to cut the producer snapshot to an earlier
offset. Another option is to simply take the lastOffset from the remote
segment and use that as the starting fetch offset in the follower. This
avoids the need for cutting.

606. ListOffsets: Since we need a version bump, could you document it under
a protocol change section?

607. "LogStartOffset of a topic can point to either of local segment or
remote segment but it is initialised and maintained in the Log class like
now. This is already maintained in `Log` class while loading the logs and
it can also be fetched from RemoteLogMetadataManager." What will happen to
the existing logic (e.g. log recovery) that currently depends on
logStartOffset but assumes it's local?

608. Handle expired remote segment: How does it pick up new logStartOffset
from deleteRecords?

609. RLMM message format:
609.1 It includes both MaxTimestamp and EventTimestamp. Where does it get
both since the message in the log only contains one timestamp?
609.2 If we change just the state (e.g. to DELETE_STARTED), it seems it's
wasteful to have to include all other fields not changed.
609.3 Could you document which process makes the following transitions
DELETE_MARKED, DELETE_STARTED, DELETE_FINISHED?

610. remote.log.reader.max.pending.tasks: "Maximum remote log reader thread
pool task queue size. If the task queue is full, broker will stop reading
remote log segments."  What does the broker do if the queue is full?

611. 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-31 Thread Satish Duggana
KIP is updated with
- Remote log segment metadata topic message format/schema.
- Added remote log segment metadata state transitions and explained
how the deletion of segments is handled, including the case of
partition deletions.
- Added a few more limitations in the "Non goals" section.

Thanks,
Satish.

On Thu, Aug 27, 2020 at 12:42 AM Harsha Ch  wrote:
>
> Updated the KIP with Meeting Notes section
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes
>
> On Tue, Aug 25, 2020 at 1:03 PM Jun Rao  wrote:
>
> > Hi, Harsha,
> >
> > Thanks for the summary. Could you add the summary and the recording link to
> > the last section of
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > ?
> >
> > Jun
> >
> > On Tue, Aug 25, 2020 at 11:12 AM Harsha Chintalapani 
> > wrote:
> >
> > > Thanks everyone for attending the meeting today.
> > > Here is the recording
> > >
> > >
> > https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/view?usp=sharing
> > >
> > > Notes:
> > >
> > >1. KIP is updated with follower fetch protocol and ready to reviewed
> > >2. Satish to capture schema of internal metadata topic in the KIP
> > >3. We will update the KIP with details of different cases
> > >4. Test plan will be captured in a doc and will add to the KIP
> > >5. Add a section "Limitations" to capture the capabilities that will
> > be
> > >introduced with this KIP and what will not be covered in this KIP.
> > >
> > > Please add to it I missed anything. Will produce a formal meeting notes
> > > from next meeting onwards.
> > >
> > > Thanks,
> > > Harsha
> > >
> > >
> > >
> > > On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng 
> > > wrote:
> > >
> > > > We did some basic feature tests at Uber. The test cases and results are
> > > > shared in this google doc:
> > > > https://docs.google.com/spreadsheets/d/
> > > > 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing
> > > >
> > > > The performance test results were already shared in the KIP last month.
> > > >
> > > > On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch 
> > wrote:
> > > >
> > > > "Understand commitments towards driving design & implementation of the
> > > KIP
> > > > further and how it aligns with participant interests in contributing to
> > > the
> > > > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." What is that
> > > about?
> > > >
> > > > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam <
> > > kpraka...@confluent.io>
> > > > wrote:
> > > >
> > > > Hi Harsha,
> > > >
> > > > The following google doc contains a proposal for temporary agenda for
> > the
> > > > KIP-405  sync meeting
> > > > tomorrow:
> > > >
> > > > https://docs.google.com/document/d/
> > > > 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
> > > >
> > > > .
> > > > Please could you add it to the Google calendar invite?
> > > >
> > > > Thank you.
> > > >
> > > > Cheers,
> > > > Kowshik
> > > >
> > > > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch 
> > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
> > > > community to be able to follow the discussion.
> > > >
> > > > Jun, please add the required folks on confluent side.
> > > >
> > > > Thanks,
> > > >
> > > > Harsha
> > > >
> > > > On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez <
> > alexandre.dupriez@
> > > > gmail.com > wrote:
> > > >
> > > > Hi Jun,
> > > >
> > > > Many thanks for your initiative.
> > > >
> > > > If you like, I am happy to attend at the time you suggested.
> > > >
> > > > Many thanks,
> > > > Alexandre
> > > >
> > > > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
> > > harsha.
> > > > c...@gmail.com ) > a écrit :
> > > >
> > > > Hi Jun,
> > > > Thanks. This will help a lot. Tuesday will work for us.
> > > > -Harsha
> > > >
> > > > On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io ( jun@
> > > > confluent.io ) > wrote:
> > > >
> > > > Hi, Satish, Ying, Harsha,
> > > >
> > > > Do you think it would be useful to have a regular virtual meeting to
> > > > discuss this KIP? The goal of the meeting will be sharing
> > > > design/development progress and discussing any open issues to
> > > >
> > > > accelerate
> > > >
> > > > this KIP. If so, will every Tuesday (from next week) 9am-10am
> > > >
> > > > PT
> > > >
> > > > work for you? I can help set up a Zoom meeting, invite everyone who
> > > >
> > > > might
> > > >
> > > > be interested, have it recorded and shared, etc.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> > > >
> > > > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
> > > >
> > > > wrote:
> > > >
> > > > Hi Kowshik,
> > > >
> > > > Thanks for looking into the KIP and sending your comments.
> > > >
> > > > 5001. Under the section "Follower fetch 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-26 Thread Harsha Ch
Updated the KIP with Meeting Notes section
https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes

On Tue, Aug 25, 2020 at 1:03 PM Jun Rao  wrote:

> Hi, Harsha,
>
> Thanks for the summary. Could you add the summary and the recording link to
> the last section of
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> ?
>
> Jun
>
> On Tue, Aug 25, 2020 at 11:12 AM Harsha Chintalapani 
> wrote:
>
> > Thanks everyone for attending the meeting today.
> > Here is the recording
> >
> >
> https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/view?usp=sharing
> >
> > Notes:
> >
> >1. KIP is updated with follower fetch protocol and ready to reviewed
> >2. Satish to capture schema of internal metadata topic in the KIP
> >3. We will update the KIP with details of different cases
> >4. Test plan will be captured in a doc and will add to the KIP
> >5. Add a section "Limitations" to capture the capabilities that will
> be
> >introduced with this KIP and what will not be covered in this KIP.
> >
> > Please add to it I missed anything. Will produce a formal meeting notes
> > from next meeting onwards.
> >
> > Thanks,
> > Harsha
> >
> >
> >
> > On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng 
> > wrote:
> >
> > > We did some basic feature tests at Uber. The test cases and results are
> > > shared in this google doc:
> > > https://docs.google.com/spreadsheets/d/
> > > 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing
> > >
> > > The performance test results were already shared in the KIP last month.
> > >
> > > On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch 
> wrote:
> > >
> > > "Understand commitments towards driving design & implementation of the
> > KIP
> > > further and how it aligns with participant interests in contributing to
> > the
> > > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." What is that
> > about?
> > >
> > > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam <
> > kpraka...@confluent.io>
> > > wrote:
> > >
> > > Hi Harsha,
> > >
> > > The following google doc contains a proposal for temporary agenda for
> the
> > > KIP-405  sync meeting
> > > tomorrow:
> > >
> > > https://docs.google.com/document/d/
> > > 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
> > >
> > > .
> > > Please could you add it to the Google calendar invite?
> > >
> > > Thank you.
> > >
> > > Cheers,
> > > Kowshik
> > >
> > > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch 
> wrote:
> > >
> > > Hi All,
> > >
> > > Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
> > > community to be able to follow the discussion.
> > >
> > > Jun, please add the required folks on confluent side.
> > >
> > > Thanks,
> > >
> > > Harsha
> > >
> > > On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez <
> alexandre.dupriez@
> > > gmail.com > wrote:
> > >
> > > Hi Jun,
> > >
> > > Many thanks for your initiative.
> > >
> > > If you like, I am happy to attend at the time you suggested.
> > >
> > > Many thanks,
> > > Alexandre
> > >
> > > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
> > harsha.
> > > c...@gmail.com ) > a écrit :
> > >
> > > Hi Jun,
> > > Thanks. This will help a lot. Tuesday will work for us.
> > > -Harsha
> > >
> > > On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io ( jun@
> > > confluent.io ) > wrote:
> > >
> > > Hi, Satish, Ying, Harsha,
> > >
> > > Do you think it would be useful to have a regular virtual meeting to
> > > discuss this KIP? The goal of the meeting will be sharing
> > > design/development progress and discussing any open issues to
> > >
> > > accelerate
> > >
> > > this KIP. If so, will every Tuesday (from next week) 9am-10am
> > >
> > > PT
> > >
> > > work for you? I can help set up a Zoom meeting, invite everyone who
> > >
> > > might
> > >
> > > be interested, have it recorded and shared, etc.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> > >
> > > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
> > >
> > > wrote:
> > >
> > > Hi Kowshik,
> > >
> > > Thanks for looking into the KIP and sending your comments.
> > >
> > > 5001. Under the section "Follower fetch protocol in detail", the
> > > next-local-offset is the offset upto which the segments are copied
> > >
> > > to
> > >
> > > remote storage. Instead, would last-tiered-offset be a better name
> > >
> > > than
> > >
> > > next-local-offset? last-tiered-offset seems to naturally align well
> > >
> > > with
> > >
> > > the definition provided in the KIP.
> > >
> > > Both next-local-offset and local-log-start-offset were introduced
> > >
> > > to
> > >
> > > talk
> > >
> > > about offsets related to local log. We are fine with
> > >
> > > last-tiered-offset
> > >
> > > too as you suggested.
> > >
> > > 5002. After leadership is established for a partition, the leader
> > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-25 Thread Jun Rao
Hi, Harsha,

Thanks for the summary. Could you add the summary and the recording link to
the last section of
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
?

Jun

On Tue, Aug 25, 2020 at 11:12 AM Harsha Chintalapani 
wrote:

> Thanks everyone for attending the meeting today.
> Here is the recording
>
> https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/view?usp=sharing
>
> Notes:
>
>1. KIP is updated with follower fetch protocol and ready to reviewed
>2. Satish to capture schema of internal metadata topic in the KIP
>3. We will update the KIP with details of different cases
>4. Test plan will be captured in a doc and will add to the KIP
>5. Add a section "Limitations" to capture the capabilities that will be
>introduced with this KIP and what will not be covered in this KIP.
>
> Please add to it I missed anything. Will produce a formal meeting notes
> from next meeting onwards.
>
> Thanks,
> Harsha
>
>
>
> On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng 
> wrote:
>
> > We did some basic feature tests at Uber. The test cases and results are
> > shared in this google doc:
> > https://docs.google.com/spreadsheets/d/
> > 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing
> >
> > The performance test results were already shared in the KIP last month.
> >
> > On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch  wrote:
> >
> > "Understand commitments towards driving design & implementation of the
> KIP
> > further and how it aligns with participant interests in contributing to
> the
> > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." What is that
> about?
> >
> > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam <
> kpraka...@confluent.io>
> > wrote:
> >
> > Hi Harsha,
> >
> > The following google doc contains a proposal for temporary agenda for the
> > KIP-405  sync meeting
> > tomorrow:
> >
> > https://docs.google.com/document/d/
> > 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
> >
> > .
> > Please could you add it to the Google calendar invite?
> >
> > Thank you.
> >
> > Cheers,
> > Kowshik
> >
> > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch  wrote:
> >
> > Hi All,
> >
> > Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
> > community to be able to follow the discussion.
> >
> > Jun, please add the required folks on confluent side.
> >
> > Thanks,
> >
> > Harsha
> >
> > On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez < alexandre.dupriez@
> > gmail.com > wrote:
> >
> > Hi Jun,
> >
> > Many thanks for your initiative.
> >
> > If you like, I am happy to attend at the time you suggested.
> >
> > Many thanks,
> > Alexandre
> >
> > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
> harsha.
> > c...@gmail.com ) > a écrit :
> >
> > Hi Jun,
> > Thanks. This will help a lot. Tuesday will work for us.
> > -Harsha
> >
> > On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io ( jun@
> > confluent.io ) > wrote:
> >
> > Hi, Satish, Ying, Harsha,
> >
> > Do you think it would be useful to have a regular virtual meeting to
> > discuss this KIP? The goal of the meeting will be sharing
> > design/development progress and discussing any open issues to
> >
> > accelerate
> >
> > this KIP. If so, will every Tuesday (from next week) 9am-10am
> >
> > PT
> >
> > work for you? I can help set up a Zoom meeting, invite everyone who
> >
> > might
> >
> > be interested, have it recorded and shared, etc.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> >
> > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
> >
> > wrote:
> >
> > Hi Kowshik,
> >
> > Thanks for looking into the KIP and sending your comments.
> >
> > 5001. Under the section "Follower fetch protocol in detail", the
> > next-local-offset is the offset upto which the segments are copied
> >
> > to
> >
> > remote storage. Instead, would last-tiered-offset be a better name
> >
> > than
> >
> > next-local-offset? last-tiered-offset seems to naturally align well
> >
> > with
> >
> > the definition provided in the KIP.
> >
> > Both next-local-offset and local-log-start-offset were introduced
> >
> > to
> >
> > talk
> >
> > about offsets related to local log. We are fine with
> >
> > last-tiered-offset
> >
> > too as you suggested.
> >
> > 5002. After leadership is established for a partition, the leader
> >
> > would
> >
> > begin uploading a segment to remote storage. If successful, the
> >
> > leader
> >
> > would write the updated RemoteLogSegmentMetadata to the metadata
> >
> > topic
> >
> > (via
> >
> > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
> >
> > seems
> >
> > useful that before the first time the segment is uploaded by the
> >
> > leader
> >
> > for
> >
> > a partition, the leader should ensure to catch up to all the
> >
> > metadata
> >
> > events written so far in the metadata topic for that partition (ex:
> >
> > by
> >
> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-25 Thread Harsha Chintalapani
Thanks everyone for attending the meeting today.
Here is the recording
https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/view?usp=sharing

Notes:

   1. KIP is updated with follower fetch protocol and ready to reviewed
   2. Satish to capture schema of internal metadata topic in the KIP
   3. We will update the KIP with details of different cases
   4. Test plan will be captured in a doc and will add to the KIP
   5. Add a section "Limitations" to capture the capabilities that will be
   introduced with this KIP and what will not be covered in this KIP.

Please add to it I missed anything. Will produce a formal meeting notes
from next meeting onwards.

Thanks,
Harsha



On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng  wrote:

> We did some basic feature tests at Uber. The test cases and results are
> shared in this google doc:
> https://docs.google.com/spreadsheets/d/
> 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing
>
> The performance test results were already shared in the KIP last month.
>
> On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch  wrote:
>
> "Understand commitments towards driving design & implementation of the KIP
> further and how it aligns with participant interests in contributing to the
> efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." What is that about?
>
> On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam 
> wrote:
>
> Hi Harsha,
>
> The following google doc contains a proposal for temporary agenda for the
> KIP-405  sync meeting
> tomorrow:
>
> https://docs.google.com/document/d/
> 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
>
> .
> Please could you add it to the Google calendar invite?
>
> Thank you.
>
> Cheers,
> Kowshik
>
> On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch  wrote:
>
> Hi All,
>
> Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
> community to be able to follow the discussion.
>
> Jun, please add the required folks on confluent side.
>
> Thanks,
>
> Harsha
>
> On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez < alexandre.dupriez@
> gmail.com > wrote:
>
> Hi Jun,
>
> Many thanks for your initiative.
>
> If you like, I am happy to attend at the time you suggested.
>
> Many thanks,
> Alexandre
>
> Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com ( harsha.
> c...@gmail.com ) > a écrit :
>
> Hi Jun,
> Thanks. This will help a lot. Tuesday will work for us.
> -Harsha
>
> On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io ( jun@
> confluent.io ) > wrote:
>
> Hi, Satish, Ying, Harsha,
>
> Do you think it would be useful to have a regular virtual meeting to
> discuss this KIP? The goal of the meeting will be sharing
> design/development progress and discussing any open issues to
>
> accelerate
>
> this KIP. If so, will every Tuesday (from next week) 9am-10am
>
> PT
>
> work for you? I can help set up a Zoom meeting, invite everyone who
>
> might
>
> be interested, have it recorded and shared, etc.
>
> Thanks,
>
> Jun
>
> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
>
> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
>
> wrote:
>
> Hi Kowshik,
>
> Thanks for looking into the KIP and sending your comments.
>
> 5001. Under the section "Follower fetch protocol in detail", the
> next-local-offset is the offset upto which the segments are copied
>
> to
>
> remote storage. Instead, would last-tiered-offset be a better name
>
> than
>
> next-local-offset? last-tiered-offset seems to naturally align well
>
> with
>
> the definition provided in the KIP.
>
> Both next-local-offset and local-log-start-offset were introduced
>
> to
>
> talk
>
> about offsets related to local log. We are fine with
>
> last-tiered-offset
>
> too as you suggested.
>
> 5002. After leadership is established for a partition, the leader
>
> would
>
> begin uploading a segment to remote storage. If successful, the
>
> leader
>
> would write the updated RemoteLogSegmentMetadata to the metadata
>
> topic
>
> (via
>
> RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
>
> seems
>
> useful that before the first time the segment is uploaded by the
>
> leader
>
> for
>
> a partition, the leader should ensure to catch up to all the
>
> metadata
>
> events written so far in the metadata topic for that partition (ex:
>
> by
>
> previous leader). To achieve this, the leader could start a lease
>
> (using
>
> an
>
> establish_leader metadata event) before commencing tiering, and
>
> wait
>
> until
>
> the event is read back. For example, this seems useful to avoid
>
> cases
>
> where
>
> zombie leaders can be active for the same partition. This can also
>
> prove
>
> useful to help avoid making decisions on which segments to be
>
> uploaded
>
> for
>
> a partition, until the current leader has caught up to a complete
>
> view
>
> of
>
> all segments uploaded for the partition so far (otherwise this may
>
> cause
>
> same segment being uploaded twice -- once by the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-24 Thread Ying Zheng
We did some basic feature tests at Uber. The test cases and results are
shared in this google doc:
https://docs.google.com/spreadsheets/d/1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing

The performance test results were already shared in the KIP last month.

On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch  wrote:

> "Understand commitments towards driving design & implementation of the KIP
> further and how it aligns with participant interests in contributing to the
> efforts (ex: in the context of Uber’s Q3/Q4 roadmap)."
> What is that about?
>
> On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam 
> wrote:
>
> > Hi Harsha,
> >
> > The following google doc contains a proposal for temporary agenda for the
> > KIP-405  sync meeting
> > tomorrow:
> >
> https://docs.google.com/document/d/1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
> >  .
> > Please could you add it to the Google calendar invite?
> >
> > Thank you.
> >
> >
> > Cheers,
> > Kowshik
> >
> > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch  wrote:
> >
> >> Hi All,
> >>
> >> Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
> >> community to be able to follow the discussion.
> >>
> >> Jun, please add the required folks on confluent side.
> >>
> >> Thanks,
> >>
> >> Harsha
> >>
> >> On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez <
> >> alexandre.dupr...@gmail.com > wrote:
> >>
> >> >
> >> >
> >> >
> >> > Hi Jun,
> >> >
> >> >
> >> >
> >> > Many thanks for your initiative.
> >> >
> >> >
> >> >
> >> > If you like, I am happy to attend at the time you suggested.
> >> >
> >> >
> >> >
> >> > Many thanks,
> >> > Alexandre
> >> >
> >> >
> >> >
> >> > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
> >> > harsha...@gmail.com ) > a écrit :
> >> >
> >> >
> >> >>
> >> >>
> >> >> Hi Jun,
> >> >> Thanks. This will help a lot. Tuesday will work for us.
> >> >> -Harsha
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io (
> >> >> j...@confluent.io ) > wrote:
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>> Hi, Satish, Ying, Harsha,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Do you think it would be useful to have a regular virtual meeting to
> >> >>> discuss this KIP? The goal of the meeting will be sharing
> >> >>> design/development progress and discussing any open issues to
> >> accelerate
> >> >>> this KIP. If so, will every Tuesday (from next week) 9am-10am
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> PT
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>> work for you? I can help set up a Zoom meeting, invite everyone who
> >> might
> >> >>> be interested, have it recorded and shared, etc.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Jun
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>> wrote:
> >> >>>
> >> >>>
> >> 
> >> 
> >>  Hi Kowshik,
> >> 
> >> 
> >> 
> >>  Thanks for looking into the KIP and sending your comments.
> >> 
> >> 
> >> 
> >>  5001. Under the section "Follower fetch protocol in detail", the
> >>  next-local-offset is the offset upto which the segments are copied
> to
> >>  remote storage. Instead, would last-tiered-offset be a better name
> >> than
> >>  next-local-offset? last-tiered-offset seems to naturally align well
> >> 
> >> 
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> with
> >> >>
> >> >>
> >> >>>
> >> 
> >> 
> >>  the definition provided in the KIP.
> >> 
> >> 
> >> 
> >>  Both next-local-offset and local-log-start-offset were introduced
> to
> >> talk
> >>  about offsets related to local log. We are fine with
> >> last-tiered-offset
> >>  too as you suggested.
> >> 
> >> 
> >> 
> >>  5002. After leadership is established for a partition, the leader
> >> would
> >>  begin uploading a segment to remote storage. If successful, the
> >> leader
> >>  would write the updated RemoteLogSegmentMetadata to the metadata
> >> topic
> >> 
> >> 
> >> >>>
> >> >>>
> >> >>>
> >> >>> (via
> >> >>>
> >> >>>
> >> 
> >> 
> >>  RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
> >> seems
> >>  useful that before the first time the segment is uploaded by the
> >> leader
> >> 
> >> 
> >> >>>
> >> >>>
> >> >>>
> >> >>> for
> >> >>>
> >> >>>
> >> 
> >> 
> >>  a partition, the leader should ensure to catch up to all the
> metadata
> >>  events written so far in the metadata topic for that partition (ex:
> >> by
> >>  previous leader). To achieve this, the leader could start a lease
> >> 
> >> 
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> (using
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>> an
> >> >>>
> >> >>>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-24 Thread Harsha Ch
"Understand commitments towards driving design & implementation of the KIP
further and how it aligns with participant interests in contributing to the
efforts (ex: in the context of Uber’s Q3/Q4 roadmap)."
What is that about?

On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam 
wrote:

> Hi Harsha,
>
> The following google doc contains a proposal for temporary agenda for the
> KIP-405  sync meeting
> tomorrow:
> https://docs.google.com/document/d/1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
>  .
> Please could you add it to the Google calendar invite?
>
> Thank you.
>
>
> Cheers,
> Kowshik
>
> On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch  wrote:
>
>> Hi All,
>>
>> Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
>> community to be able to follow the discussion.
>>
>> Jun, please add the required folks on confluent side.
>>
>> Thanks,
>>
>> Harsha
>>
>> On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez <
>> alexandre.dupr...@gmail.com > wrote:
>>
>> >
>> >
>> >
>> > Hi Jun,
>> >
>> >
>> >
>> > Many thanks for your initiative.
>> >
>> >
>> >
>> > If you like, I am happy to attend at the time you suggested.
>> >
>> >
>> >
>> > Many thanks,
>> > Alexandre
>> >
>> >
>> >
>> > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
>> > harsha...@gmail.com ) > a écrit :
>> >
>> >
>> >>
>> >>
>> >> Hi Jun,
>> >> Thanks. This will help a lot. Tuesday will work for us.
>> >> -Harsha
>> >>
>> >>
>> >>
>> >> On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io (
>> >> j...@confluent.io ) > wrote:
>> >>
>> >>
>> >>>
>> >>>
>> >>> Hi, Satish, Ying, Harsha,
>> >>>
>> >>>
>> >>>
>> >>> Do you think it would be useful to have a regular virtual meeting to
>> >>> discuss this KIP? The goal of the meeting will be sharing
>> >>> design/development progress and discussing any open issues to
>> accelerate
>> >>> this KIP. If so, will every Tuesday (from next week) 9am-10am
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> PT
>> >>
>> >>
>> >>>
>> >>>
>> >>> work for you? I can help set up a Zoom meeting, invite everyone who
>> might
>> >>> be interested, have it recorded and shared, etc.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>>
>> >>>
>> >>> Jun
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
>> >>
>> >>
>> >>>
>> >>>
>> >>> wrote:
>> >>>
>> >>>
>> 
>> 
>>  Hi Kowshik,
>> 
>> 
>> 
>>  Thanks for looking into the KIP and sending your comments.
>> 
>> 
>> 
>>  5001. Under the section "Follower fetch protocol in detail", the
>>  next-local-offset is the offset upto which the segments are copied to
>>  remote storage. Instead, would last-tiered-offset be a better name
>> than
>>  next-local-offset? last-tiered-offset seems to naturally align well
>> 
>> 
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> with
>> >>
>> >>
>> >>>
>> 
>> 
>>  the definition provided in the KIP.
>> 
>> 
>> 
>>  Both next-local-offset and local-log-start-offset were introduced to
>> talk
>>  about offsets related to local log. We are fine with
>> last-tiered-offset
>>  too as you suggested.
>> 
>> 
>> 
>>  5002. After leadership is established for a partition, the leader
>> would
>>  begin uploading a segment to remote storage. If successful, the
>> leader
>>  would write the updated RemoteLogSegmentMetadata to the metadata
>> topic
>> 
>> 
>> >>>
>> >>>
>> >>>
>> >>> (via
>> >>>
>> >>>
>> 
>> 
>>  RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
>> seems
>>  useful that before the first time the segment is uploaded by the
>> leader
>> 
>> 
>> >>>
>> >>>
>> >>>
>> >>> for
>> >>>
>> >>>
>> 
>> 
>>  a partition, the leader should ensure to catch up to all the metadata
>>  events written so far in the metadata topic for that partition (ex:
>> by
>>  previous leader). To achieve this, the leader could start a lease
>> 
>> 
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> (using
>> >>
>> >>
>> >>>
>> >>>
>> >>> an
>> >>>
>> >>>
>> 
>> 
>>  establish_leader metadata event) before commencing tiering, and wait
>> 
>> 
>> >>>
>> >>>
>> >>>
>> >>> until
>> >>>
>> >>>
>> 
>> 
>>  the event is read back. For example, this seems useful to avoid cases
>> 
>> 
>> >>>
>> >>>
>> >>>
>> >>> where
>> >>>
>> >>>
>> 
>> 
>>  zombie leaders can be active for the same partition. This can also
>> 
>> 
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> prove
>> >>
>> >>
>> >>>
>> 
>> 
>>  useful to help avoid making decisions on which segments to be
>> uploaded
>> 
>> 
>> >>>
>> >>>
>> >>>
>> >>> for
>> >>>
>> >>>
>> 
>> 
>>  a partition, until the current leader has caught up to a complete
>> view
>> 
>> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-24 Thread Kowshik Prakasam
Hi Harsha,

The following google doc contains a proposal for temporary agenda for the
KIP-405  sync meeting
tomorrow:
https://docs.google.com/document/d/1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
 .
Please could you add it to the Google calendar invite?

Thank you.


Cheers,
Kowshik

On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch  wrote:

> Hi All,
>
> Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for
> community to be able to follow the discussion.
>
> Jun, please add the required folks on confluent side.
>
> Thanks,
>
> Harsha
>
> On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez <
> alexandre.dupr...@gmail.com > wrote:
>
> >
> >
> >
> > Hi Jun,
> >
> >
> >
> > Many thanks for your initiative.
> >
> >
> >
> > If you like, I am happy to attend at the time you suggested.
> >
> >
> >
> > Many thanks,
> > Alexandre
> >
> >
> >
> > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
> > harsha...@gmail.com ) > a écrit :
> >
> >
> >>
> >>
> >> Hi Jun,
> >> Thanks. This will help a lot. Tuesday will work for us.
> >> -Harsha
> >>
> >>
> >>
> >> On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io (
> >> j...@confluent.io ) > wrote:
> >>
> >>
> >>>
> >>>
> >>> Hi, Satish, Ying, Harsha,
> >>>
> >>>
> >>>
> >>> Do you think it would be useful to have a regular virtual meeting to
> >>> discuss this KIP? The goal of the meeting will be sharing
> >>> design/development progress and discussing any open issues to
> accelerate
> >>> this KIP. If so, will every Tuesday (from next week) 9am-10am
> >>>
> >>>
> >>
> >>
> >>
> >> PT
> >>
> >>
> >>>
> >>>
> >>> work for you? I can help set up a Zoom meeting, invite everyone who
> might
> >>> be interested, have it recorded and shared, etc.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>> Jun
> >>>
> >>>
> >>>
> >>> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> >>>
> >>>
> >>
> >>
> >>
> >> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) >
> >>
> >>
> >>>
> >>>
> >>> wrote:
> >>>
> >>>
> 
> 
>  Hi Kowshik,
> 
> 
> 
>  Thanks for looking into the KIP and sending your comments.
> 
> 
> 
>  5001. Under the section "Follower fetch protocol in detail", the
>  next-local-offset is the offset upto which the segments are copied to
>  remote storage. Instead, would last-tiered-offset be a better name
> than
>  next-local-offset? last-tiered-offset seems to naturally align well
> 
> 
> >>>
> >>>
> >>
> >>
> >>
> >> with
> >>
> >>
> >>>
> 
> 
>  the definition provided in the KIP.
> 
> 
> 
>  Both next-local-offset and local-log-start-offset were introduced to
> talk
>  about offsets related to local log. We are fine with
> last-tiered-offset
>  too as you suggested.
> 
> 
> 
>  5002. After leadership is established for a partition, the leader
> would
>  begin uploading a segment to remote storage. If successful, the leader
>  would write the updated RemoteLogSegmentMetadata to the metadata topic
> 
> 
> >>>
> >>>
> >>>
> >>> (via
> >>>
> >>>
> 
> 
>  RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
> seems
>  useful that before the first time the segment is uploaded by the
> leader
> 
> 
> >>>
> >>>
> >>>
> >>> for
> >>>
> >>>
> 
> 
>  a partition, the leader should ensure to catch up to all the metadata
>  events written so far in the metadata topic for that partition (ex: by
>  previous leader). To achieve this, the leader could start a lease
> 
> 
> >>>
> >>>
> >>
> >>
> >>
> >> (using
> >>
> >>
> >>>
> >>>
> >>> an
> >>>
> >>>
> 
> 
>  establish_leader metadata event) before commencing tiering, and wait
> 
> 
> >>>
> >>>
> >>>
> >>> until
> >>>
> >>>
> 
> 
>  the event is read back. For example, this seems useful to avoid cases
> 
> 
> >>>
> >>>
> >>>
> >>> where
> >>>
> >>>
> 
> 
>  zombie leaders can be active for the same partition. This can also
> 
> 
> >>>
> >>>
> >>
> >>
> >>
> >> prove
> >>
> >>
> >>>
> 
> 
>  useful to help avoid making decisions on which segments to be uploaded
> 
> 
> >>>
> >>>
> >>>
> >>> for
> >>>
> >>>
> 
> 
>  a partition, until the current leader has caught up to a complete view
> 
> 
> >>>
> >>>
> >>
> >>
> >>
> >> of
> >>
> >>
> >>>
> 
> 
>  all segments uploaded for the partition so far (otherwise this may
> 
> 
> >>>
> >>>
> >>
> >>
> >>
> >> cause
> >>
> >>
> >>>
> 
> 
>  same segment being uploaded twice -- once by the previous leader and
> 
> 
> >>>
> >>>
> >>
> >>
> >>
> >> then
> >>
> >>
> >>>
> 
> 
>  by the new leader).
> 
> 
> 
>  We allow copying segments to remote storage which may have common
> offsets.
>  Please go through the KIP to understand the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-20 Thread Alexandre Dupriez
Hi Jun,

Many thanks for your initiative.

If you like, I am happy to attend at the time you suggested.

Many thanks,
Alexandre

Le mer. 19 août 2020 à 22:00, Harsha Ch  a écrit :

> Hi Jun,
>  Thanks. This will help a lot. Tuesday will work for us.
> -Harsha
>
>
> On Wed, Aug 19, 2020 at 1:24 PM Jun Rao  wrote:
>
> > Hi, Satish, Ying, Harsha,
> >
> > Do you think it would be useful to have a regular virtual meeting to
> > discuss this KIP? The goal of the meeting will be sharing
> > design/development progress and discussing any open issues to
> > accelerate this KIP. If so, will every Tuesday (from next week) 9am-10am
> PT
> > work for you? I can help set up a Zoom meeting, invite everyone who might
> > be interested, have it recorded and shared, etc.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> satish.dugg...@gmail.com>
> > wrote:
> >
> > > Hi  Kowshik,
> > >
> > > Thanks for looking into the  KIP and sending your comments.
> > >
> > > 5001. Under the section "Follower fetch protocol in detail", the
> > > next-local-offset is the offset upto which the segments are copied to
> > > remote storage. Instead, would last-tiered-offset be a better name than
> > > next-local-offset? last-tiered-offset seems to naturally align well
> with
> > > the definition provided in the KIP.
> > >
> > > Both next-local-offset and local-log-start-offset were introduced to
> > > talk about offsets related to local log. We are fine with
> > > last-tiered-offset too as you suggested.
> > >
> > > 5002. After leadership is established for a partition, the leader would
> > > begin uploading a segment to remote storage. If successful, the leader
> > > would write the updated RemoteLogSegmentMetadata to the metadata topic
> > (via
> > > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> > > useful that before the first time the segment is uploaded by the leader
> > for
> > > a partition, the leader should ensure to catch up to all the metadata
> > > events written so far in the metadata topic for that partition (ex: by
> > > previous leader). To achieve this, the leader could start a lease
> (using
> > an
> > > establish_leader metadata event) before commencing tiering, and wait
> > until
> > > the event is read back. For example, this seems useful to avoid cases
> > where
> > > zombie leaders can be active for the same partition. This can also
> prove
> > > useful to help avoid making decisions on which segments to be uploaded
> > for
> > > a partition, until the current leader has caught up to a complete view
> of
> > > all segments uploaded for the partition so far (otherwise this may
> cause
> > > same segment being uploaded twice -- once by the previous leader and
> then
> > > by the new leader).
> > >
> > > We allow copying segments to remote storage which may have common
> > > offsets. Please go through the KIP to understand the follower fetch
> > > protocol(1) and follower to leader transition(2).
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > >
> > >
> > > 5003. There is a natural interleaving between uploading a segment to
> > remote
> > > store, and, writing a metadata event for the same (via
> > > RLMM.putRemoteLogSegmentData). There can be cases where a remote
> segment
> > is
> > > uploaded, then the leader fails and a corresponding metadata event
> never
> > > gets written. In such cases, the orphaned remote segment has to be
> > > eventually deleted (since there is no confirmation of the upload). To
> > > handle this, we could use 2 separate metadata events viz.
> copy_initiated
> > > and copy_completed, so that copy_initiated events that don't have a
> > > corresponding copy_completed event can be treated as garbage and
> deleted
> > > from the remote object store by the broker.
> > >
> > > We are already updating RMM with RemoteLogSegmentMetadata pre and post
> > > copying of log segments. We had a flag in RemoteLogSegmentMetadata
> > > whether it is copied or not. But we are making changes in
> > > RemoteLogSegmentMetadata to introduce a state field in
> > > RemoteLogSegmentMetadata which will have the respective started and
> > > finished states. This includes for other operations like delete too.
> > >
> > > 5004. In the default implementation of RLMM (using the internal topic
> > > __remote_log_metadata), a separate topic called
> > > __remote_segments_to_be_deleted is going to be used just to track
> > failures
> > > in removing remote log segments. A separate topic (effectively another
> > > metadata stream) introduces some maintenance overhead and design
> > > complexity. It seems to me that the same can be achieved just by using
> > just
> > > the __remote_log_metadata topic 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-19 Thread Harsha Ch
Hi Jun,
 Thanks. This will help a lot. Tuesday will work for us.
-Harsha


On Wed, Aug 19, 2020 at 1:24 PM Jun Rao  wrote:

> Hi, Satish, Ying, Harsha,
>
> Do you think it would be useful to have a regular virtual meeting to
> discuss this KIP? The goal of the meeting will be sharing
> design/development progress and discussing any open issues to
> accelerate this KIP. If so, will every Tuesday (from next week) 9am-10am PT
> work for you? I can help set up a Zoom meeting, invite everyone who might
> be interested, have it recorded and shared, etc.
>
> Thanks,
>
> Jun
>
> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana 
> wrote:
>
> > Hi  Kowshik,
> >
> > Thanks for looking into the  KIP and sending your comments.
> >
> > 5001. Under the section "Follower fetch protocol in detail", the
> > next-local-offset is the offset upto which the segments are copied to
> > remote storage. Instead, would last-tiered-offset be a better name than
> > next-local-offset? last-tiered-offset seems to naturally align well with
> > the definition provided in the KIP.
> >
> > Both next-local-offset and local-log-start-offset were introduced to
> > talk about offsets related to local log. We are fine with
> > last-tiered-offset too as you suggested.
> >
> > 5002. After leadership is established for a partition, the leader would
> > begin uploading a segment to remote storage. If successful, the leader
> > would write the updated RemoteLogSegmentMetadata to the metadata topic
> (via
> > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> > useful that before the first time the segment is uploaded by the leader
> for
> > a partition, the leader should ensure to catch up to all the metadata
> > events written so far in the metadata topic for that partition (ex: by
> > previous leader). To achieve this, the leader could start a lease (using
> an
> > establish_leader metadata event) before commencing tiering, and wait
> until
> > the event is read back. For example, this seems useful to avoid cases
> where
> > zombie leaders can be active for the same partition. This can also prove
> > useful to help avoid making decisions on which segments to be uploaded
> for
> > a partition, until the current leader has caught up to a complete view of
> > all segments uploaded for the partition so far (otherwise this may cause
> > same segment being uploaded twice -- once by the previous leader and then
> > by the new leader).
> >
> > We allow copying segments to remote storage which may have common
> > offsets. Please go through the KIP to understand the follower fetch
> > protocol(1) and follower to leader transition(2).
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> >
> >
> > 5003. There is a natural interleaving between uploading a segment to
> remote
> > store, and, writing a metadata event for the same (via
> > RLMM.putRemoteLogSegmentData). There can be cases where a remote segment
> is
> > uploaded, then the leader fails and a corresponding metadata event never
> > gets written. In such cases, the orphaned remote segment has to be
> > eventually deleted (since there is no confirmation of the upload). To
> > handle this, we could use 2 separate metadata events viz. copy_initiated
> > and copy_completed, so that copy_initiated events that don't have a
> > corresponding copy_completed event can be treated as garbage and deleted
> > from the remote object store by the broker.
> >
> > We are already updating RMM with RemoteLogSegmentMetadata pre and post
> > copying of log segments. We had a flag in RemoteLogSegmentMetadata
> > whether it is copied or not. But we are making changes in
> > RemoteLogSegmentMetadata to introduce a state field in
> > RemoteLogSegmentMetadata which will have the respective started and
> > finished states. This includes for other operations like delete too.
> >
> > 5004. In the default implementation of RLMM (using the internal topic
> > __remote_log_metadata), a separate topic called
> > __remote_segments_to_be_deleted is going to be used just to track
> failures
> > in removing remote log segments. A separate topic (effectively another
> > metadata stream) introduces some maintenance overhead and design
> > complexity. It seems to me that the same can be achieved just by using
> just
> > the __remote_log_metadata topic with the following steps: 1) the leader
> > writes a delete_initiated metadata event, 2) the leader deletes the
> segment
> > and 3) the leader writes a delete_completed metadata event. Tiered
> segments
> > that have delete_initiated message and not delete_completed message, can
> be
> > considered to be a failure and retried.
> >
> > Jun suggested in earlier mail to keep this simple . We decided not to
> > have this topic as 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-19 Thread Jun Rao
Hi, Satish, Ying, Harsha,

Do you think it would be useful to have a regular virtual meeting to
discuss this KIP? The goal of the meeting will be sharing
design/development progress and discussing any open issues to
accelerate this KIP. If so, will every Tuesday (from next week) 9am-10am PT
work for you? I can help set up a Zoom meeting, invite everyone who might
be interested, have it recorded and shared, etc.

Thanks,

Jun

On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana 
wrote:

> Hi  Kowshik,
>
> Thanks for looking into the  KIP and sending your comments.
>
> 5001. Under the section "Follower fetch protocol in detail", the
> next-local-offset is the offset upto which the segments are copied to
> remote storage. Instead, would last-tiered-offset be a better name than
> next-local-offset? last-tiered-offset seems to naturally align well with
> the definition provided in the KIP.
>
> Both next-local-offset and local-log-start-offset were introduced to
> talk about offsets related to local log. We are fine with
> last-tiered-offset too as you suggested.
>
> 5002. After leadership is established for a partition, the leader would
> begin uploading a segment to remote storage. If successful, the leader
> would write the updated RemoteLogSegmentMetadata to the metadata topic (via
> RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> useful that before the first time the segment is uploaded by the leader for
> a partition, the leader should ensure to catch up to all the metadata
> events written so far in the metadata topic for that partition (ex: by
> previous leader). To achieve this, the leader could start a lease (using an
> establish_leader metadata event) before commencing tiering, and wait until
> the event is read back. For example, this seems useful to avoid cases where
> zombie leaders can be active for the same partition. This can also prove
> useful to help avoid making decisions on which segments to be uploaded for
> a partition, until the current leader has caught up to a complete view of
> all segments uploaded for the partition so far (otherwise this may cause
> same segment being uploaded twice -- once by the previous leader and then
> by the new leader).
>
> We allow copying segments to remote storage which may have common
> offsets. Please go through the KIP to understand the follower fetch
> protocol(1) and follower to leader transition(2).
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
>
>
> 5003. There is a natural interleaving between uploading a segment to remote
> store, and, writing a metadata event for the same (via
> RLMM.putRemoteLogSegmentData). There can be cases where a remote segment is
> uploaded, then the leader fails and a corresponding metadata event never
> gets written. In such cases, the orphaned remote segment has to be
> eventually deleted (since there is no confirmation of the upload). To
> handle this, we could use 2 separate metadata events viz. copy_initiated
> and copy_completed, so that copy_initiated events that don't have a
> corresponding copy_completed event can be treated as garbage and deleted
> from the remote object store by the broker.
>
> We are already updating RMM with RemoteLogSegmentMetadata pre and post
> copying of log segments. We had a flag in RemoteLogSegmentMetadata
> whether it is copied or not. But we are making changes in
> RemoteLogSegmentMetadata to introduce a state field in
> RemoteLogSegmentMetadata which will have the respective started and
> finished states. This includes for other operations like delete too.
>
> 5004. In the default implementation of RLMM (using the internal topic
> __remote_log_metadata), a separate topic called
> __remote_segments_to_be_deleted is going to be used just to track failures
> in removing remote log segments. A separate topic (effectively another
> metadata stream) introduces some maintenance overhead and design
> complexity. It seems to me that the same can be achieved just by using just
> the __remote_log_metadata topic with the following steps: 1) the leader
> writes a delete_initiated metadata event, 2) the leader deletes the segment
> and 3) the leader writes a delete_completed metadata event. Tiered segments
> that have delete_initiated message and not delete_completed message, can be
> considered to be a failure and retried.
>
> Jun suggested in earlier mail to keep this simple . We decided not to
> have this topic as mentioned in our earlier replies, updated the KIP.
> As I mentioned in an earlier comment, we are  adding state entries for
> delete operations too.
>
> 5005. When a Kafka cluster is provisioned for the first time with KIP-405
> tiered storage enabled, could you explain in the KIP about how the
> bootstrap for __remote_log_metadata topic 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-18 Thread Satish Duggana
Hi  Kowshik,

Thanks for looking into the  KIP and sending your comments.

5001. Under the section "Follower fetch protocol in detail", the
next-local-offset is the offset upto which the segments are copied to
remote storage. Instead, would last-tiered-offset be a better name than
next-local-offset? last-tiered-offset seems to naturally align well with
the definition provided in the KIP.

Both next-local-offset and local-log-start-offset were introduced to
talk about offsets related to local log. We are fine with
last-tiered-offset too as you suggested.

5002. After leadership is established for a partition, the leader would
begin uploading a segment to remote storage. If successful, the leader
would write the updated RemoteLogSegmentMetadata to the metadata topic (via
RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
useful that before the first time the segment is uploaded by the leader for
a partition, the leader should ensure to catch up to all the metadata
events written so far in the metadata topic for that partition (ex: by
previous leader). To achieve this, the leader could start a lease (using an
establish_leader metadata event) before commencing tiering, and wait until
the event is read back. For example, this seems useful to avoid cases where
zombie leaders can be active for the same partition. This can also prove
useful to help avoid making decisions on which segments to be uploaded for
a partition, until the current leader has caught up to a complete view of
all segments uploaded for the partition so far (otherwise this may cause
same segment being uploaded twice -- once by the previous leader and then
by the new leader).

We allow copying segments to remote storage which may have common
offsets. Please go through the KIP to understand the follower fetch
protocol(1) and follower to leader transition(2).

https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition


5003. There is a natural interleaving between uploading a segment to remote
store, and, writing a metadata event for the same (via
RLMM.putRemoteLogSegmentData). There can be cases where a remote segment is
uploaded, then the leader fails and a corresponding metadata event never
gets written. In such cases, the orphaned remote segment has to be
eventually deleted (since there is no confirmation of the upload). To
handle this, we could use 2 separate metadata events viz. copy_initiated
and copy_completed, so that copy_initiated events that don't have a
corresponding copy_completed event can be treated as garbage and deleted
from the remote object store by the broker.

We are already updating RMM with RemoteLogSegmentMetadata pre and post
copying of log segments. We had a flag in RemoteLogSegmentMetadata
whether it is copied or not. But we are making changes in
RemoteLogSegmentMetadata to introduce a state field in
RemoteLogSegmentMetadata which will have the respective started and
finished states. This includes for other operations like delete too.

5004. In the default implementation of RLMM (using the internal topic
__remote_log_metadata), a separate topic called
__remote_segments_to_be_deleted is going to be used just to track failures
in removing remote log segments. A separate topic (effectively another
metadata stream) introduces some maintenance overhead and design
complexity. It seems to me that the same can be achieved just by using just
the __remote_log_metadata topic with the following steps: 1) the leader
writes a delete_initiated metadata event, 2) the leader deletes the segment
and 3) the leader writes a delete_completed metadata event. Tiered segments
that have delete_initiated message and not delete_completed message, can be
considered to be a failure and retried.

Jun suggested in earlier mail to keep this simple . We decided not to
have this topic as mentioned in our earlier replies, updated the KIP.
As I mentioned in an earlier comment, we are  adding state entries for
delete operations too.

5005. When a Kafka cluster is provisioned for the first time with KIP-405
tiered storage enabled, could you explain in the KIP about how the
bootstrap for __remote_log_metadata topic will be performed in the the
default RLMM implementation?

__remote_log_segment_metadata topic is created by default with the
respective topic like  partitions/replication-factor etc. Can  you be
more specific on what you are looking for?

5008. The system-wide configuration 'remote.log.storage.enable' is used to
enable tiered storage. Can this be made a topic-level configuration, so
that the user can enable/disable tiered storage at a topic level rather
than a system-wide default for an entire Kafka cluster?

Yes, we mentioned in an earlier mail thread that it will be supported
at topic level too, updated  the KIP.

5009. 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-13 Thread Kowshik Prakasam
Hi Harsha/Satish,

Thanks for the great KIP. Below are the first set of questions/suggestions
I had after making a pass on the KIP.

5001. Under the section "Follower fetch protocol in detail", the
next-local-offset is the offset upto which the segments are copied to
remote storage. Instead, would last-tiered-offset be a better name than
next-local-offset? last-tiered-offset seems to naturally align well with
the definition provided in the KIP.

5002. After leadership is established for a partition, the leader would
begin uploading a segment to remote storage. If successful, the leader
would write the updated RemoteLogSegmentMetadata to the metadata topic (via
RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
useful that before the first time the segment is uploaded by the leader for
a partition, the leader should ensure to catch up to all the metadata
events written so far in the metadata topic for that partition (ex: by
previous leader). To achieve this, the leader could start a lease (using an
establish_leader metadata event) before commencing tiering, and wait until
the event is read back. For example, this seems useful to avoid cases where
zombie leaders can be active for the same partition. This can also prove
useful to help avoid making decisions on which segments to be uploaded for
a partition, until the current leader has caught up to a complete view of
all segments uploaded for the partition so far (otherwise this may cause
same segment being uploaded twice -- once by the previous leader and then
by the new leader).

5003. There is a natural interleaving between uploading a segment to remote
store, and, writing a metadata event for the same (via
RLMM.putRemoteLogSegmentData). There can be cases where a remote segment is
uploaded, then the leader fails and a corresponding metadata event never
gets written. In such cases, the orphaned remote segment has to be
eventually deleted (since there is no confirmation of the upload). To
handle this, we could use 2 separate metadata events viz. copy_initiated
and copy_completed, so that copy_initiated events that don't have a
corresponding copy_completed event can be treated as garbage and deleted
from the remote object store by the broker.

5004. In the default implementation of RLMM (using the internal topic
__remote_log_metadata), a separate topic called
__remote_segments_to_be_deleted is going to be used just to track failures
in removing remote log segments. A separate topic (effectively another
metadata stream) introduces some maintenance overhead and design
complexity. It seems to me that the same can be achieved just by using just
the __remote_log_metadata topic with the following steps: 1) the leader
writes a delete_initiated metadata event, 2) the leader deletes the segment
and 3) the leader writes a delete_completed metadata event. Tiered segments
that have delete_initiated message and not delete_completed message, can be
considered to be a failure and retried.

5005. When a Kafka cluster is provisioned for the first time with KIP-405
tiered storage enabled, could you explain in the KIP about how the
bootstrap for __remote_log_metadata topic will be performed in the the
default RLMM implementation?

5006. I currently do not see details on the KIP on why RocksDB was chosen
as the default cache implementation, and how it is going to be used. Were
alternatives compared/considered? For example, it would be useful to
explain/evaulate the following: 1) debuggability of the RocksDB JNI
interface, 2) performance, 3) portability across platforms and 4) interface
parity of RocksDB’s JNI api with it's underlying C/C++ api.

5007. For the RocksDB cache (the default implementation of RLMM), what is
the relationship/mapping between the following: 1) # of tiered partitions,
2) # of partitions of metadata topic __remote_log_metadata and 3) # of
RocksDB instances? i.e. is the plan to have a RocksDB instance per tiered
partition, or per metadata topic partition, or just 1 for per broker?

5008. The system-wide configuration 'remote.log.storage.enable' is used to
enable tiered storage. Can this be made a topic-level configuration, so
that the user can enable/disable tiered storage at a topic level rather
than a system-wide default for an entire Kafka cluster?

5009. Whenever a topic with tiered storage enabled is deleted, the
underlying actions require the topic data to be deleted in local store as
well as remote store, and eventually the topic metadata needs to be deleted
too. What is the role of the controller in deleting a topic and it's
contents, while the topic has tiered storage enabled?

5010. RLMM APIs are currently synchronous, for example
RLMM.putRemoteLogSegmentData waits until the put operation is completed in
the remote metadata store. It may also block until the leader has caught up
to the metadata (not sure). Could we make these apis asynchronous (ex:
based on java.util.concurrent.Future) to provide room for tapping
performance 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-08-06 Thread Satish Duggana
Hi Jun,
Thanks for your comments.

> At the high level, that approach sounds reasonable to
me. It would be useful to document how RLMM handles overlapping archived
offset ranges and how those overlapping segments are deleted through
retention.

Sure, we will document that in the KIP.

>How is the remaining part of the KIP coming along? To me, the two biggest
missing items are (1) more detailed documentation on how all the new APIs
are being used and (2) metadata format and usage in the internal
topic __remote_log_metadata.

We are working on updating APIs based on the recent discussions and
get the perf numbers by plugging in rocksdb as a cache store for RLMM.
We will update the KIP with the updated APIs and with the above
requested details in a few days and let you know.

Thanks,
Satish.




On Wed, Aug 5, 2020 at 12:49 AM Jun Rao  wrote:
>
> Hi, Ying, Satish,
>
> Thanks for the reply. At the high level, that approach sounds reasonable to
> me. It would be useful to document how RLMM handles overlapping archived
> offset ranges and how those overlapping segments are deleted through
> retention.
>
> How is the remaining part of the KIP coming along? To me, the two biggest
> missing items are (1) more detailed documentation on how all the new APIs
> are being used and (2) metadata format and usage in the internal
> topic __remote_log_metadata.
>
> Thanks,
>
> Jun
>
> On Tue, Aug 4, 2020 at 8:32 AM Satish Duggana 
> wrote:
>
> > Hi Jun,
> > Thanks for your comment,
> >
> > 1001. Using the new leader as the source of truth may be fine too. What's
> > not clear to me is when a follower takes over as the new leader, from which
> > offset does it start archiving to the block storage. I assume that the new
> > leader starts from the latest archived ooffset by the previous leader, but
> > it seems that's not the case. It would be useful to document this in the
> > Wiki.
> >
> > When a follower becomes a leader it needs to findout the offset from
> > which the segments to be copied to remote storage. This is found by
> > traversing from the the latest leader epoch from leader epoch history
> > and find the highest offset of a segment with that epoch copied into
> > remote storage by using respective RLMM APIs. If it can not find an
> > entry then it checks for the previous leader epoch till it finds an
> > entry, If there are no entries till the earliest leader epoch in
> > leader epoch cache then it starts copying the segments from the
> > earliest epoch entry’s offset.
> > Added an example in the KIP here[1]. We will update RLMM APIs in the KIP.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> >
> > Satish.
> >
> >
> > On Tue, Aug 4, 2020 at 9:00 PM Satish Duggana 
> > wrote:
> > >
> > > Hi Ying,
> > > Thanks for your comment.
> > >
> > > 1001. Using the new leader as the source of truth may be fine too. What's
> > > not clear to me is when a follower takes over as the new leader, from
> > which
> > > offset does it start archiving to the block storage. I assume that the
> > new
> > > leader starts from the latest archived ooffset by the previous leader,
> > but
> > > it seems that's not the case. It would be useful to document this in the
> > > Wiki.
> > >
> > > When a follower becomes a leader it needs to findout the offset from
> > > which the segments to be copied to remote storage. This is found by
> > > traversing from the the latest leader epoch from leader epoch history
> > > and find the highest offset of a segment with that epoch copied into
> > > remote storage by using respective RLMM APIs. If it can not find an
> > > entry then it checks for the previous leader epoch till it finds an
> > > entry, If there are no entries till the earliest leader epoch in
> > > leader epoch cache then it starts copying the segments from the
> > > earliest epoch entry’s offset.
> > > Added an example in the KIP here[1]. We will update RLMM APIs in the KIP.
> > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > >
> > >
> > > Satish.
> > >
> > >
> > > On Tue, Aug 4, 2020 at 10:28 AM Ying Zheng 
> > wrote:
> > > >
> > > > Hi Jun,
> > > >
> > > > Thank you for the comment! The current KIP is not very clear about this
> > > > part.
> > > >
> > > > 1001. The new leader will start archiving from the earliest local
> > segment
> > > > that is not fully
> > > > covered by the "valid" remote data. "valid" means the (offset, leader
> > > > epoch) pair is valid
> > > > based on the leader-epoch history.
> > > >
> > > > There are some edge cases where the same offset range (with the same
> > leader
> > > > epoch) can
> > > > be copied to the remote storage more than once. But this kind of
> > > > duplication shouldn't be a
> > > > problem.
> > > >
> > > > Staish is going to explain the details in the KIP with examples.
> > > >
> > > >

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-31 Thread Jun Rao
Hi, Ying,

Thanks for the reply.

1001. Using the new leader as the source of truth may be fine too. What's
not clear to me is when a follower takes over as the new leader, from which
offset does it start archiving to the block storage. I assume that the new
leader starts from the latest archived ooffset by the previous leader, but
it seems that's not the case. It would be useful to document this in the
wiki.

Jun

On Tue, Jul 28, 2020 at 12:11 PM Ying Zheng  wrote:

> 1001.
>
> We did consider this approach. The concerns are
> 1)  This makes unclean-leader-election rely on remote storage. In case the
> remote storage
>  is unavailable, Kafka will not be able to finish the
> unclean-leader-election.
> 2) Since the user set local retention time (or local retention bytes), I
> think we are expected to
> keep that much local data when possible (avoid truncating all the local
> data). But, as you said,
> unclean leader elections are very rare, this may not be a big problem.
>
> The current design uses the leader broker as source-of-truth. This is
> consistent with the
> existing Kafka behavior.
>
> By using remote storage as the source-of-truth, the follower logic can be a
> little simpler,
> but the leader logic is going to be more complex. Overall, I don't see
> there many benefits
> of using remote storage as the source-of-truth.
>
>
>
> On Tue, Jul 28, 2020 at 10:25 AM Jun Rao  wrote:
>
> > Hi, Satish,
> >
> > Thanks for the reply.
> >
> > 1001. In your example, I was thinking that you could just download the
> > latest leader epoch from the object store. After that you know the leader
> > should end with offset 1100. The leader will delete all its local data
> > before offset 1000 and start accepting new messages at offset 1100.
> > Consumer requests for messages before offset 1100 will be served from the
> > object store. The benefit with this approach is that it's simpler to
> reason
> > about who is the source of truth. The downside is slightly  increased
> > unavailability window during unclean leader election. Since unclean
> leader
> > elections are rare, I am not sure if this is a big concern.
> >
> > 1008. Yes, I think introducing sth like local.retention.ms seems more
> > consistent.
> >
> > Jun
> >
> > On Tue, Jul 28, 2020 at 2:30 AM Satish Duggana  >
> > wrote:
> >
> > > HI Jun,
> > > Thanks for your comments. We put our inline replies below.
> > >
> > > 1001. I was thinking that you could just use the tiered metadata to do
> > the
> > > reconciliation. The tiered metadata contains offset ranges and epoch
> > > history. Those should be enough for reconciliation purposes.
> > >
> > > If we use remote storage as the source-of-truth during
> > > unclean-leader-election, it's possible that after reconciliation the
> > > remote storage will have more recent data than the new leader's local
> > > storage. For example, the new leader's latest message is offset 1000,
> > > while the remote storage has message 1100. In such a case, the new
> > > leader will have to download the messages from 1001 to 1100, before
> > > accepting new messages from producers. Otherwise, there would be a gap
> > > in the local data between 1000 and 1101.
> > >
> > > Moreover, with the current design, leader epoch history is stored in
> > > remote storage, rather than the metadata topic. We did consider saving
> > > epoch history in remote segment metadata. But the concern is that
> > > there is currently no limit for the epoch history size. Theoretically,
> > > if a user has a very long remote retention time and there are very
> > > frequent leadership changes, the leader epoch history can become too
> > > long to fit into a regular Kafka message.
> > >
> > >
> > > 1003.3 Having just a serverEndpoint string is probably not enough.
> > > Connecting to a Kafka cluster may need various security credentials. We
> > can
> > > make RLMM configurable and pass in the properties through the
> configure()
> > > method. Ditto for RSM.
> > >
> > > RLMM and  RSM are already configurable and they take properties which
> > > start with "remote.log.metadata." and "remote.log.storage."
> > > respectively and a few others. We have listener-name as the config for
> > > RLMM and other properties(like security) can be sent as you suggested.
> > > We will update the KIP with the details.
> > >
> > >
> > > 1008.1 We started with log.retention.hours and log.retention.minutes,
> and
> > > added log.retention.ms later. If we are adding a new configuration, ms
> > > level config alone is enough and is simpler. We can build tools to make
> > the
> > > configuration at different granularities easier. The definition of
> > > log.retention.ms is "The number of milliseconds to keep a log file
> > before
> > > deleting it". The deletion is independent of whether tiering is enabled
> > or
> > > not. If this changes to just the local portion of the data, we are
> > changing
> > > the meaning of an existing configuration.
> > >
> > > We are fine with either 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-28 Thread Ying Zheng
1001.

We did consider this approach. The concerns are
1)  This makes unclean-leader-election rely on remote storage. In case the
remote storage
 is unavailable, Kafka will not be able to finish the
unclean-leader-election.
2) Since the user set local retention time (or local retention bytes), I
think we are expected to
keep that much local data when possible (avoid truncating all the local
data). But, as you said,
unclean leader elections are very rare, this may not be a big problem.

The current design uses the leader broker as source-of-truth. This is
consistent with the
existing Kafka behavior.

By using remote storage as the source-of-truth, the follower logic can be a
little simpler,
but the leader logic is going to be more complex. Overall, I don't see
there many benefits
of using remote storage as the source-of-truth.



On Tue, Jul 28, 2020 at 10:25 AM Jun Rao  wrote:

> Hi, Satish,
>
> Thanks for the reply.
>
> 1001. In your example, I was thinking that you could just download the
> latest leader epoch from the object store. After that you know the leader
> should end with offset 1100. The leader will delete all its local data
> before offset 1000 and start accepting new messages at offset 1100.
> Consumer requests for messages before offset 1100 will be served from the
> object store. The benefit with this approach is that it's simpler to reason
> about who is the source of truth. The downside is slightly  increased
> unavailability window during unclean leader election. Since unclean leader
> elections are rare, I am not sure if this is a big concern.
>
> 1008. Yes, I think introducing sth like local.retention.ms seems more
> consistent.
>
> Jun
>
> On Tue, Jul 28, 2020 at 2:30 AM Satish Duggana 
> wrote:
>
> > HI Jun,
> > Thanks for your comments. We put our inline replies below.
> >
> > 1001. I was thinking that you could just use the tiered metadata to do
> the
> > reconciliation. The tiered metadata contains offset ranges and epoch
> > history. Those should be enough for reconciliation purposes.
> >
> > If we use remote storage as the source-of-truth during
> > unclean-leader-election, it's possible that after reconciliation the
> > remote storage will have more recent data than the new leader's local
> > storage. For example, the new leader's latest message is offset 1000,
> > while the remote storage has message 1100. In such a case, the new
> > leader will have to download the messages from 1001 to 1100, before
> > accepting new messages from producers. Otherwise, there would be a gap
> > in the local data between 1000 and 1101.
> >
> > Moreover, with the current design, leader epoch history is stored in
> > remote storage, rather than the metadata topic. We did consider saving
> > epoch history in remote segment metadata. But the concern is that
> > there is currently no limit for the epoch history size. Theoretically,
> > if a user has a very long remote retention time and there are very
> > frequent leadership changes, the leader epoch history can become too
> > long to fit into a regular Kafka message.
> >
> >
> > 1003.3 Having just a serverEndpoint string is probably not enough.
> > Connecting to a Kafka cluster may need various security credentials. We
> can
> > make RLMM configurable and pass in the properties through the configure()
> > method. Ditto for RSM.
> >
> > RLMM and  RSM are already configurable and they take properties which
> > start with "remote.log.metadata." and "remote.log.storage."
> > respectively and a few others. We have listener-name as the config for
> > RLMM and other properties(like security) can be sent as you suggested.
> > We will update the KIP with the details.
> >
> >
> > 1008.1 We started with log.retention.hours and log.retention.minutes, and
> > added log.retention.ms later. If we are adding a new configuration, ms
> > level config alone is enough and is simpler. We can build tools to make
> the
> > configuration at different granularities easier. The definition of
> > log.retention.ms is "The number of milliseconds to keep a log file
> before
> > deleting it". The deletion is independent of whether tiering is enabled
> or
> > not. If this changes to just the local portion of the data, we are
> changing
> > the meaning of an existing configuration.
> >
> > We are fine with either way. We can go with log.retention. as the
> > effective log retention instead of local log retention. With this
> > convention, we need to introduce  local.log.retention instead of
> > remote.log.retention.ms that we proposed. If log.retention.ms as -1
> > then remote retention is also considered as unlimited but user should
> > be able to set the local.retention.ms.
> > So, we need to introduce local.log.retention.ms and
> > local.log.retention.bytes which should  always  be <=
> > log.retention.ms/bytes respectively.
> >
> >
> >
> > On Fri, Jul 24, 2020 at 3:37 AM Jun Rao  wrote:
> > >
> > > Hi, Satish,
> > >
> > > Thanks for the reply. A few quick comments below.

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-28 Thread Jun Rao
Hi, Satish,

Thanks for the reply.

1001. In your example, I was thinking that you could just download the
latest leader epoch from the object store. After that you know the leader
should end with offset 1100. The leader will delete all its local data
before offset 1000 and start accepting new messages at offset 1100.
Consumer requests for messages before offset 1100 will be served from the
object store. The benefit with this approach is that it's simpler to reason
about who is the source of truth. The downside is slightly  increased
unavailability window during unclean leader election. Since unclean leader
elections are rare, I am not sure if this is a big concern.

1008. Yes, I think introducing sth like local.retention.ms seems more
consistent.

Jun

On Tue, Jul 28, 2020 at 2:30 AM Satish Duggana 
wrote:

> HI Jun,
> Thanks for your comments. We put our inline replies below.
>
> 1001. I was thinking that you could just use the tiered metadata to do the
> reconciliation. The tiered metadata contains offset ranges and epoch
> history. Those should be enough for reconciliation purposes.
>
> If we use remote storage as the source-of-truth during
> unclean-leader-election, it's possible that after reconciliation the
> remote storage will have more recent data than the new leader's local
> storage. For example, the new leader's latest message is offset 1000,
> while the remote storage has message 1100. In such a case, the new
> leader will have to download the messages from 1001 to 1100, before
> accepting new messages from producers. Otherwise, there would be a gap
> in the local data between 1000 and 1101.
>
> Moreover, with the current design, leader epoch history is stored in
> remote storage, rather than the metadata topic. We did consider saving
> epoch history in remote segment metadata. But the concern is that
> there is currently no limit for the epoch history size. Theoretically,
> if a user has a very long remote retention time and there are very
> frequent leadership changes, the leader epoch history can become too
> long to fit into a regular Kafka message.
>
>
> 1003.3 Having just a serverEndpoint string is probably not enough.
> Connecting to a Kafka cluster may need various security credentials. We can
> make RLMM configurable and pass in the properties through the configure()
> method. Ditto for RSM.
>
> RLMM and  RSM are already configurable and they take properties which
> start with "remote.log.metadata." and "remote.log.storage."
> respectively and a few others. We have listener-name as the config for
> RLMM and other properties(like security) can be sent as you suggested.
> We will update the KIP with the details.
>
>
> 1008.1 We started with log.retention.hours and log.retention.minutes, and
> added log.retention.ms later. If we are adding a new configuration, ms
> level config alone is enough and is simpler. We can build tools to make the
> configuration at different granularities easier. The definition of
> log.retention.ms is "The number of milliseconds to keep a log file before
> deleting it". The deletion is independent of whether tiering is enabled or
> not. If this changes to just the local portion of the data, we are changing
> the meaning of an existing configuration.
>
> We are fine with either way. We can go with log.retention. as the
> effective log retention instead of local log retention. With this
> convention, we need to introduce  local.log.retention instead of
> remote.log.retention.ms that we proposed. If log.retention.ms as -1
> then remote retention is also considered as unlimited but user should
> be able to set the local.retention.ms.
> So, we need to introduce local.log.retention.ms and
> local.log.retention.bytes which should  always  be <=
> log.retention.ms/bytes respectively.
>
>
>
> On Fri, Jul 24, 2020 at 3:37 AM Jun Rao  wrote:
> >
> > Hi, Satish,
> >
> > Thanks for the reply. A few quick comments below.
> >
> > 1001. I was thinking that you could just use the tiered metadata to do
> the
> > reconciliation. The tiered metadata contains offset ranges and epoch
> > history. Those should be enough for reconciliation purposes.
> >
> > 1003.3 Having just a serverEndpoint string is probably not enough.
> > Connecting to a Kafka cluster may need various security credentials. We
> can
> > make RLMM configurable and pass in the properties through the configure()
> > method. Ditto for RSM.
> >
> > 1008.1 We started with log.retention.hours and log.retention.minutes, and
> > added log.retention.ms later. If we are adding a new configuration, ms
> > level config alone is enough and is simpler. We can build tools to make
> the
> > configuration at different granularities easier. The definition of
> > log.retention.ms is "The number of milliseconds to keep a log file
> before
> > deleting it". The deletion is independent of whether tiering is enabled
> or
> > not. If this changes to just the local portion of the data, we are
> changing
> > the meaning of an 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-28 Thread Satish Duggana
HI Jun,
Thanks for your comments. We put our inline replies below.

1001. I was thinking that you could just use the tiered metadata to do the
reconciliation. The tiered metadata contains offset ranges and epoch
history. Those should be enough for reconciliation purposes.

If we use remote storage as the source-of-truth during
unclean-leader-election, it's possible that after reconciliation the
remote storage will have more recent data than the new leader's local
storage. For example, the new leader's latest message is offset 1000,
while the remote storage has message 1100. In such a case, the new
leader will have to download the messages from 1001 to 1100, before
accepting new messages from producers. Otherwise, there would be a gap
in the local data between 1000 and 1101.

Moreover, with the current design, leader epoch history is stored in
remote storage, rather than the metadata topic. We did consider saving
epoch history in remote segment metadata. But the concern is that
there is currently no limit for the epoch history size. Theoretically,
if a user has a very long remote retention time and there are very
frequent leadership changes, the leader epoch history can become too
long to fit into a regular Kafka message.


1003.3 Having just a serverEndpoint string is probably not enough.
Connecting to a Kafka cluster may need various security credentials. We can
make RLMM configurable and pass in the properties through the configure()
method. Ditto for RSM.

RLMM and  RSM are already configurable and they take properties which
start with "remote.log.metadata." and "remote.log.storage."
respectively and a few others. We have listener-name as the config for
RLMM and other properties(like security) can be sent as you suggested.
We will update the KIP with the details.


1008.1 We started with log.retention.hours and log.retention.minutes, and
added log.retention.ms later. If we are adding a new configuration, ms
level config alone is enough and is simpler. We can build tools to make the
configuration at different granularities easier. The definition of
log.retention.ms is "The number of milliseconds to keep a log file before
deleting it". The deletion is independent of whether tiering is enabled or
not. If this changes to just the local portion of the data, we are changing
the meaning of an existing configuration.

We are fine with either way. We can go with log.retention. as the
effective log retention instead of local log retention. With this
convention, we need to introduce  local.log.retention instead of
remote.log.retention.ms that we proposed. If log.retention.ms as -1
then remote retention is also considered as unlimited but user should
be able to set the local.retention.ms.
So, we need to introduce local.log.retention.ms and
local.log.retention.bytes which should  always  be <=
log.retention.ms/bytes respectively.



On Fri, Jul 24, 2020 at 3:37 AM Jun Rao  wrote:
>
> Hi, Satish,
>
> Thanks for the reply. A few quick comments below.
>
> 1001. I was thinking that you could just use the tiered metadata to do the
> reconciliation. The tiered metadata contains offset ranges and epoch
> history. Those should be enough for reconciliation purposes.
>
> 1003.3 Having just a serverEndpoint string is probably not enough.
> Connecting to a Kafka cluster may need various security credentials. We can
> make RLMM configurable and pass in the properties through the configure()
> method. Ditto for RSM.
>
> 1008.1 We started with log.retention.hours and log.retention.minutes, and
> added log.retention.ms later. If we are adding a new configuration, ms
> level config alone is enough and is simpler. We can build tools to make the
> configuration at different granularities easier. The definition of
> log.retention.ms is "The number of milliseconds to keep a log file before
> deleting it". The deletion is independent of whether tiering is enabled or
> not. If this changes to just the local portion of the data, we are changing
> the meaning of an existing configuration.
>
> Jun
>
>
> On Thu, Jul 23, 2020 at 11:04 AM Satish Duggana 
> wrote:
>
> > Hi Jun,
> >
> > Thank you for the comments! Ying, Harsha and I discussed and put our
> > comments below.
> >
> >
> > 1001. The KIP described a few scenarios of unclean leader elections. This
> > is very useful, but I am wondering if this is the best approach. My
> > understanding of the proposed approach is to allow the new (unclean) leader
> > to take new messages immediately. While this increases availability, it
> > creates the problem that there could be multiple conflicting segments in
> > the remote store for the same offset range. This seems to make it harder
> > for RLMM to determine which archived log segments contain the correct data.
> > For example, an archived log segment could at one time be the correct data,
> > but be changed to incorrect data after an unclean leader election. An
> > alternative approach is to let the unclean leader use the archived data as
> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-23 Thread Jun Rao
Hi, Satish,

Thanks for the reply. A few quick comments below.

1001. I was thinking that you could just use the tiered metadata to do the
reconciliation. The tiered metadata contains offset ranges and epoch
history. Those should be enough for reconciliation purposes.

1003.3 Having just a serverEndpoint string is probably not enough.
Connecting to a Kafka cluster may need various security credentials. We can
make RLMM configurable and pass in the properties through the configure()
method. Ditto for RSM.

1008.1 We started with log.retention.hours and log.retention.minutes, and
added log.retention.ms later. If we are adding a new configuration, ms
level config alone is enough and is simpler. We can build tools to make the
configuration at different granularities easier. The definition of
log.retention.ms is "The number of milliseconds to keep a log file before
deleting it". The deletion is independent of whether tiering is enabled or
not. If this changes to just the local portion of the data, we are changing
the meaning of an existing configuration.

Jun


On Thu, Jul 23, 2020 at 11:04 AM Satish Duggana 
wrote:

> Hi Jun,
>
> Thank you for the comments! Ying, Harsha and I discussed and put our
> comments below.
>
>
> 1001. The KIP described a few scenarios of unclean leader elections. This
> is very useful, but I am wondering if this is the best approach. My
> understanding of the proposed approach is to allow the new (unclean) leader
> to take new messages immediately. While this increases availability, it
> creates the problem that there could be multiple conflicting segments in
> the remote store for the same offset range. This seems to make it harder
> for RLMM to determine which archived log segments contain the correct data.
> For example, an archived log segment could at one time be the correct data,
> but be changed to incorrect data after an unclean leader election. An
> alternative approach is to let the unclean leader use the archived data as
> the source of truth. So, when the new (unclean) leader takes over, it first
> reconciles the local data based on the archived data before taking new
> messages. This makes the job of RLMM a bit easier since all archived data
> are considered correct. This increases availability a bit. However, since
> unclean leader elections are rare, this may be ok.
>
> Firstly, We don't want to assume the remote storage is more reliable than
> Kafka. Kafka unclean leader election usually happens when there is a large
> scale outage that impacts multiple racks (or even multiple availability
> zones). In such a case, the remote storage may be unavailable or unstable.
> Pulling a large amount of data from the remote storage to reconcile the
> local data may also exacerbate the outage. With the current design, the new
> leader can start working even when the remote storage is temporarily
> unavailable.
>
> Secondly, it is not easier to implement the reconciling logic at the leader
> side. It can take a long time for the new leader to download the remote
> data and rebuild local producer id / leader epoch information. During this
> period, the leader cannot accept any requests from the clients and
> followers. We have to introduce a new state for the leader, and a new error
> code to let the clients / followers know what is happening.
>
>
>
> 1002. RemoteStorageManager.
> 1002.1 There seems to be some inconsistencies in RemoteStorageManager. We
> pass in RemoteLogSegmentId copyLogSegment(). For all other methods, we pass
> in RemoteLogSegmentMetadata.
>
> Nice catch, we can have the RemoteLogSegmentMetadata for copyLogSegment
> too.
>
> 1002.2 Is endOffset in RemoteLogSegmentMetadata inclusive or exclusive?
>
> It is inclusive.
>
> 1002.3 It seems that we need an api to get the leaderEpoch history for a
> partition.
>
> Yes, updated the KIP with the new method.
>
>
> 1002.4 Could you define the type of RemoteLogSegmentContext?
>
> This is removed in the latest code and it is not needed.
>
>
> 1003 RemoteLogMetadataManager
>
> 1003.1 I am not sure why we need both of the following methods
> in RemoteLogMetadataManager. Could we combine them into one that takes in
> offset and returns RemoteLogSegmentMetadata?
> RemoteLogSegmentId getRemoteLogSegmentId(TopicPartition topicPartition,
> long offset) throws IOException;
> RemoteLogSegmentMetadata getRemoteLogSegmentMetadata(RemoteLogSegmentId
> remoteLogSegmentId) throws IOException;
>
> Good point, these can be merged for now. I guess we needed them in earlier
> version of the implementation but it is not needed now.
>
> 1003.2 There seems to be some inconsistencies in the methods below. I am
> not sure why one takes RemoteLogSegmentMetadata and the other
> takes RemoteLogSegmentId.
> void putRemoteLogSegmentData(RemoteLogSegmentMetadata
> remoteLogSegmentMetadata) throws IOException;
> void deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> remoteLogSegmentId) throws IOException;
>
> RLMM stores 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-23 Thread Satish Duggana
Hi Jun,

Thank you for the comments! Ying, Harsha and I discussed and put our
comments below.


1001. The KIP described a few scenarios of unclean leader elections. This
is very useful, but I am wondering if this is the best approach. My
understanding of the proposed approach is to allow the new (unclean) leader
to take new messages immediately. While this increases availability, it
creates the problem that there could be multiple conflicting segments in
the remote store for the same offset range. This seems to make it harder
for RLMM to determine which archived log segments contain the correct data.
For example, an archived log segment could at one time be the correct data,
but be changed to incorrect data after an unclean leader election. An
alternative approach is to let the unclean leader use the archived data as
the source of truth. So, when the new (unclean) leader takes over, it first
reconciles the local data based on the archived data before taking new
messages. This makes the job of RLMM a bit easier since all archived data
are considered correct. This increases availability a bit. However, since
unclean leader elections are rare, this may be ok.

Firstly, We don't want to assume the remote storage is more reliable than
Kafka. Kafka unclean leader election usually happens when there is a large
scale outage that impacts multiple racks (or even multiple availability
zones). In such a case, the remote storage may be unavailable or unstable.
Pulling a large amount of data from the remote storage to reconcile the
local data may also exacerbate the outage. With the current design, the new
leader can start working even when the remote storage is temporarily
unavailable.

Secondly, it is not easier to implement the reconciling logic at the leader
side. It can take a long time for the new leader to download the remote
data and rebuild local producer id / leader epoch information. During this
period, the leader cannot accept any requests from the clients and
followers. We have to introduce a new state for the leader, and a new error
code to let the clients / followers know what is happening.



1002. RemoteStorageManager.
1002.1 There seems to be some inconsistencies in RemoteStorageManager. We
pass in RemoteLogSegmentId copyLogSegment(). For all other methods, we pass
in RemoteLogSegmentMetadata.

Nice catch, we can have the RemoteLogSegmentMetadata for copyLogSegment
too.

1002.2 Is endOffset in RemoteLogSegmentMetadata inclusive or exclusive?

It is inclusive.

1002.3 It seems that we need an api to get the leaderEpoch history for a
partition.

Yes, updated the KIP with the new method.


1002.4 Could you define the type of RemoteLogSegmentContext?

This is removed in the latest code and it is not needed.


1003 RemoteLogMetadataManager

1003.1 I am not sure why we need both of the following methods
in RemoteLogMetadataManager. Could we combine them into one that takes in
offset and returns RemoteLogSegmentMetadata?
RemoteLogSegmentId getRemoteLogSegmentId(TopicPartition topicPartition,
long offset) throws IOException;
RemoteLogSegmentMetadata getRemoteLogSegmentMetadata(RemoteLogSegmentId
remoteLogSegmentId) throws IOException;

Good point, these can be merged for now. I guess we needed them in earlier
version of the implementation but it is not needed now.

1003.2 There seems to be some inconsistencies in the methods below. I am
not sure why one takes RemoteLogSegmentMetadata and the other
takes RemoteLogSegmentId.
void putRemoteLogSegmentData(RemoteLogSegmentMetadata
remoteLogSegmentMetadata) throws IOException;
void deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
remoteLogSegmentId) throws IOException;

RLMM stores RemoteLogSegmentMetadata which is identified by
RemoteLogsSegmentId. So, when it is added it takes
RemoteLogSegmentMetadata. `delete` operation needs only RemoteLogsSegmentId
as RemoteLogSegmentMetadata can be identified with RemoteLogsSegmentId.

1003.3 In void onServerStarted(final String serverEndpoint), what
is serverEndpoint used for?

This can be used by RLMM implementation to connect to the local Kafka
cluster. Incase of default  implementation, it is used in initializing
kafka clients connecting to the local cluster.

1004. It would be useful to document how all the new APIs are being used.
For example, when is RemoteLogSegmentMetadata.markedForDeletion being set
and used? How are
RemoteLogMetadataManager.earliestLogOffset/highestLogOffset being used?

RLMM APIs are going through the changes and they should be ready in a few
days. I will update the KIP and the mail  thread once they are ready.

1005. Handling partition deletion: The KIP says "RLMM will eventually
delete these segments by using RemoteStorageManager." Which replica does
this logic?

This is a good point. When a topic is deleted, it will not have any
leader/followers to do the cleanup. We will have a cleaner agent on a
single broker in the cluster to do this cleanup, we plan to add that in

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-20 Thread Jun Rao
Hi, Satish, Ying, Harsha,

Thanks for the updated KIP. A few more comments below.

1000. Regarding Colin's question on querying the metadata directly in the
remote block store. One issue is that not all block stores offer the needed
api to query the metadata. For example, S3 only offers an api to list
objects under a prefix and this api has the eventual consistency semantic.

1001. The KIP described a few scenarios of unclean leader elections. This
is very useful, but I am wondering if this is the best approach. My
understanding of the proposed approach is to allow the new (unclean) leader
to take new messages immediately. While this increases availability, it
creates the problem that there could be multiple conflicting segments in
the remote store for the same offset range. This seems to make it harder
for RLMM to determine which archived log segments contain the correct data.
For example, an archived log segment could at one time be the correct data,
but be changed to incorrect data after an unclean leader election. An
alternative approach is to let the unclean leader use the archived data as
the source of truth. So, when the new (unclean) leader takes over, it first
reconciles the local data based on the archived data before taking new
messages. This makes the job of RLMM a bit easier since all archived data
are considered correct. This increases availability a bit. However, since
unclean leader elections are rare, this may be ok.

1002. RemoteStorageManager.
1002.1 There seems to be some inconsistencies in RemoteStorageManager. We
pass in RemoteLogSegmentId copyLogSegment(). For all other methods, we pass
in RemoteLogSegmentMetadata.
1002.2 Is endOffset in RemoteLogSegmentMetadata inclusive or exclusive?
1002.3 It seems that we need an api to get the leaderEpoch history for a
partition.
1002.4 Could you define the type of RemoteLogSegmentContext?

1003 RemoteLogMetadataManager
1003.1 I am not sure why we need both of the following methods
in RemoteLogMetadataManager. Could we combine them into one that takes in
offset and returns RemoteLogSegmentMetadata?
RemoteLogSegmentId getRemoteLogSegmentId(TopicPartition topicPartition,
long offset) throws IOException;
RemoteLogSegmentMetadata getRemoteLogSegmentMetadata(RemoteLogSegmentId
remoteLogSegmentId) throws IOException;
1003.2 There seems to be some inconsistencies in the methods below. I am
not sure why one takes RemoteLogSegmentMetadata and the other
takes RemoteLogSegmentId.
void putRemoteLogSegmentData(RemoteLogSegmentMetadata
remoteLogSegmentMetadata) throws IOException;
void deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
remoteLogSegmentId) throws IOException;
1003.3 In void onServerStarted(final String serverEndpoint), what
is serverEndpoint used for?

1004. It would be useful to document how all the new APIs are being used.
For example, when is RemoteLogSegmentMetadata.markedForDeletion being set
and used? How are
RemoteLogMetadataManager.earliestLogOffset/highestLogOffset being used?

1005. Handling partition deletion: The KIP says "RLMM will eventually
delete these segments by using RemoteStorageManager." Which replica does
this logic?

1006. "If there are any failures in removing remote log segments then those
are stored in a specific topic (default as __remote_segments_to_be_deleted)
and user can consume the events(which contain remote-log-segment-id) from
that topic and clean them up from remote storage.  " Not sure if it's worth
the complexity of adding another topic. Could we just retry?

1007. RemoteFetchPurgatory: Could we just reuse the existing fetchPurgatory?

1008. Configurations:
1008.1 remote.log.retention.ms, remote.log.retention.minutes,
remote.log.retention.hours: It seems that we just need the ms one. Also,
are we changing the meaning of existing config log.retention.ms to mean the
local retention? For backward compatibility, it's better to not change the
meaning of existing configurations.
1008.2 Should remote.log.storage.enable be at the topic level?

1009. It would be useful to list all limitations in a separate section:
compacted topic, JBOD, etc. Also, is changing a topic from delete to
compact and vice versa allowed when tiering is enabled?

1010. Thanks for performance numbers. Are those with RocksDB as the cache?

Thanks,

Jun

On Wed, Jul 15, 2020 at 6:12 PM Harsha Ch  wrote:

> Hi Colin,
>Thats not what we said in the previous email. RLMM is
> pluggable storage and by running numbers even 1PB data you do not need more
> than 10GB local storage.
> If in future this becomes a blocker for any users we can revisit but this
> does not warrant another implementation at this point to push the data to
> remote storage.
> We can ofcourse implement another RLMM that is optional for users to
> configure to push to remote. But that doesn't need to be addressed in this
> KIP.
>
> Thanks,
> Harsha
>
> On Wed, Jul 15, 2020 at 5:50 PM Colin McCabe  wrote:
>
> > Hi Ying,
> >
> > Thanks for the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-15 Thread Harsha Ch
Hi Colin,
   Thats not what we said in the previous email. RLMM is
pluggable storage and by running numbers even 1PB data you do not need more
than 10GB local storage.
If in future this becomes a blocker for any users we can revisit but this
does not warrant another implementation at this point to push the data to
remote storage.
We can ofcourse implement another RLMM that is optional for users to
configure to push to remote. But that doesn't need to be addressed in this
KIP.

Thanks,
Harsha

On Wed, Jul 15, 2020 at 5:50 PM Colin McCabe  wrote:

> Hi Ying,
>
> Thanks for the response.
>
> It sounds like you agree that storing the metadata in the remote storage
> would be a better design overall.  Given that that's true, is there any
> reason to include the worse implementation based on RocksDB?
>
> Choosing a long-term metadata store is not something that we should do
> lightly.  It can take users years to migrate from metadata store to the
> other.  I also don't think it's realistic or desirable for users to write
> their own metadata stores.  Even assuming that they could do a good job at
> this, it would create huge fragmentation in the Kafka ecosystem.
>
> best,
> Colin
>
>
> On Tue, Jul 14, 2020, at 09:39, Ying Zheng wrote:
> > Hi Jun,
> > Hi Colin,
> >
> > Satish and I are still discussing some details about how to handle
> > transactions / producer ids. Satish is going to make some minor changes
> to
> > RLMM API and other parts. Other than that, we have finished updating the
> KIP
> >
> > I agree with Colin that the current design of using rocksDB is not
> > optimal. But this design is simple and should work for almost all the
> > existing Kafka users. RLMM is a plugin. Users can replace rocksDB with
> > their own RLMM implementation, if needed. So, I think we can keep rocksDB
> > for now. What do you think?
> >
> >
> > Thanks,
> > Ying
> >
> >
> >
> > On Tue, Jul 7, 2020 at 10:35 AM Jun Rao  wrote:
> >
> > > Hi, Ying,
> > >
> > > Thanks for the update. It's good to see the progress on this. Please
> let us
> > > know when you are done updating the KIP wiki.
> > >
> > > Jun
> > >
> > > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng 
> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Satish and I have added more design details in the KIP, including
> how to
> > > > keep consistency between replicas (especially when there is
> leadership
> > > > changes / log truncations) and new metrics. We also made some other
> minor
> > > > changes in the doc. We will finish the KIP changes in the next
> couple of
> > > > days. We will let you know when we are done. Most of the changes are
> > > > already updated to the wiki KIP. You can take a look. But it's not
> the
> > > > final version yet.
> > > >
> > > > As for the implementation, the code is mostly done and we already had
> > > some
> > > > feature tests / system tests. I have added the performance test
> results
> > > in
> > > > the KIP. However the recent design changes (e.g. leader epoch info
> > > > management / log truncation / some of the new metrics) have not been
> > > > implemented yet. It will take about 2 weeks for us to implement
> after you
> > > > review and agree with those design changes.
> > > >
> > > >
> > > >
> > > > On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:
> > > >
> > > > > Hi, Satish, Harsha,
> > > > >
> > > > > Any new updates on the KIP? This feature is one of the most
> important
> > > and
> > > > > most requested features in Apache Kafka right now. It would be
> helpful
> > > if
> > > > > we can make sustained progress on this. Could you share how far
> along
> > > is
> > > > > the design/implementation right now? Is there anything that other
> > > people
> > > > > can help to get it across the line?
> > > > >
> > > > > As for "transactional support" and "follower
> requests/replication", no
> > > > > further comments from me as long as the producer state and leader
> epoch
> > > > can
> > > > > be restored properly from the object store when needed.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana <
> > > satish.dugg...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > We did not want to add many implementation details in the KIP.
> But we
> > > > > > decided to add them in the KIP as appendix or
> sub-sections(including
> > > > > > follower fetch protocol) to describe the flow with the main
> cases.
> > > > > > That will answer most of the queries. I will update on this mail
> > > > > > thread when the respective sections are updated.
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi Satish,
> > > > > > >
> > > > > > > A couple of questions specific to the section "Follower
> > > > > > > Requests/Replication", pages 16:17 in the design document [1].
> > > > > > >
> > > > > > > 900. It is mentioned that followers fetch 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-15 Thread Colin McCabe
Hi Ying,

Thanks for the response.

It sounds like you agree that storing the metadata in the remote storage would 
be a better design overall.  Given that that's true, is there any reason to 
include the worse implementation based on RocksDB?

Choosing a long-term metadata store is not something that we should do lightly. 
 It can take users years to migrate from metadata store to the other.  I also 
don't think it's realistic or desirable for users to write their own metadata 
stores.  Even assuming that they could do a good job at this, it would create 
huge fragmentation in the Kafka ecosystem.

best,
Colin


On Tue, Jul 14, 2020, at 09:39, Ying Zheng wrote:
> Hi Jun,
> Hi Colin,
> 
> Satish and I are still discussing some details about how to handle
> transactions / producer ids. Satish is going to make some minor changes to
> RLMM API and other parts. Other than that, we have finished updating the KIP
> 
> I agree with Colin that the current design of using rocksDB is not
> optimal. But this design is simple and should work for almost all the
> existing Kafka users. RLMM is a plugin. Users can replace rocksDB with
> their own RLMM implementation, if needed. So, I think we can keep rocksDB
> for now. What do you think?
> 
> 
> Thanks,
> Ying
> 
> 
> 
> On Tue, Jul 7, 2020 at 10:35 AM Jun Rao  wrote:
> 
> > Hi, Ying,
> >
> > Thanks for the update. It's good to see the progress on this. Please let us
> > know when you are done updating the KIP wiki.
> >
> > Jun
> >
> > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng  wrote:
> >
> > > Hi Jun,
> > >
> > > Satish and I have added more design details in the KIP, including how to
> > > keep consistency between replicas (especially when there is leadership
> > > changes / log truncations) and new metrics. We also made some other minor
> > > changes in the doc. We will finish the KIP changes in the next couple of
> > > days. We will let you know when we are done. Most of the changes are
> > > already updated to the wiki KIP. You can take a look. But it's not the
> > > final version yet.
> > >
> > > As for the implementation, the code is mostly done and we already had
> > some
> > > feature tests / system tests. I have added the performance test results
> > in
> > > the KIP. However the recent design changes (e.g. leader epoch info
> > > management / log truncation / some of the new metrics) have not been
> > > implemented yet. It will take about 2 weeks for us to implement after you
> > > review and agree with those design changes.
> > >
> > >
> > >
> > > On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:
> > >
> > > > Hi, Satish, Harsha,
> > > >
> > > > Any new updates on the KIP? This feature is one of the most important
> > and
> > > > most requested features in Apache Kafka right now. It would be helpful
> > if
> > > > we can make sustained progress on this. Could you share how far along
> > is
> > > > the design/implementation right now? Is there anything that other
> > people
> > > > can help to get it across the line?
> > > >
> > > > As for "transactional support" and "follower requests/replication", no
> > > > further comments from me as long as the producer state and leader epoch
> > > can
> > > > be restored properly from the object store when needed.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana <
> > satish.dugg...@gmail.com>
> > > > wrote:
> > > >
> > > > > We did not want to add many implementation details in the KIP. But we
> > > > > decided to add them in the KIP as appendix or sub-sections(including
> > > > > follower fetch protocol) to describe the flow with the main cases.
> > > > > That will answer most of the queries. I will update on this mail
> > > > > thread when the respective sections are updated.
> > > > >
> > > > > Thanks,
> > > > > Satish.
> > > > >
> > > > > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
> > > > >  wrote:
> > > > > >
> > > > > > Hi Satish,
> > > > > >
> > > > > > A couple of questions specific to the section "Follower
> > > > > > Requests/Replication", pages 16:17 in the design document [1].
> > > > > >
> > > > > > 900. It is mentioned that followers fetch auxiliary states from the
> > > > > > remote storage.
> > > > > >
> > > > > > 900.a Does the consistency model of the external storage impacts
> > > reads
> > > > > > of leader epochs and other auxiliary data?
> > > > > >
> > > > > > 900.b What are the benefits of using a mechanism to store and
> > access
> > > > > > the leader epochs which is different from other metadata associated
> > > to
> > > > > > tiered segments? What are the benefits of retrieving this
> > information
> > > > > > on-demand from the follower rather than relying on propagation via
> > > the
> > > > > > topic __remote_log_metadata? What are the advantages over using a
> > > > > > dedicated control structure (e.g. a new record type) propagated via
> > > > > > this topic? Since in the document, different control paths are
> > > > > > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-14 Thread Ying Zheng
Hi Jun,
Hi Colin,

Satish and I are still discussing some details about how to handle
transactions / producer ids. Satish is going to make some minor changes to
RLMM API and other parts. Other than that, we have finished updating the KIP

I agree with Colin that the current design of using rocksDB is not
optimal. But this design is simple and should work for almost all the
existing Kafka users. RLMM is a plugin. Users can replace rocksDB with
their own RLMM implementation, if needed. So, I think we can keep rocksDB
for now. What do you think?


Thanks,
Ying



On Tue, Jul 7, 2020 at 10:35 AM Jun Rao  wrote:

> Hi, Ying,
>
> Thanks for the update. It's good to see the progress on this. Please let us
> know when you are done updating the KIP wiki.
>
> Jun
>
> On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng  wrote:
>
> > Hi Jun,
> >
> > Satish and I have added more design details in the KIP, including how to
> > keep consistency between replicas (especially when there is leadership
> > changes / log truncations) and new metrics. We also made some other minor
> > changes in the doc. We will finish the KIP changes in the next couple of
> > days. We will let you know when we are done. Most of the changes are
> > already updated to the wiki KIP. You can take a look. But it's not the
> > final version yet.
> >
> > As for the implementation, the code is mostly done and we already had
> some
> > feature tests / system tests. I have added the performance test results
> in
> > the KIP. However the recent design changes (e.g. leader epoch info
> > management / log truncation / some of the new metrics) have not been
> > implemented yet. It will take about 2 weeks for us to implement after you
> > review and agree with those design changes.
> >
> >
> >
> > On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:
> >
> > > Hi, Satish, Harsha,
> > >
> > > Any new updates on the KIP? This feature is one of the most important
> and
> > > most requested features in Apache Kafka right now. It would be helpful
> if
> > > we can make sustained progress on this. Could you share how far along
> is
> > > the design/implementation right now? Is there anything that other
> people
> > > can help to get it across the line?
> > >
> > > As for "transactional support" and "follower requests/replication", no
> > > further comments from me as long as the producer state and leader epoch
> > can
> > > be restored properly from the object store when needed.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana <
> satish.dugg...@gmail.com>
> > > wrote:
> > >
> > > > We did not want to add many implementation details in the KIP. But we
> > > > decided to add them in the KIP as appendix or sub-sections(including
> > > > follower fetch protocol) to describe the flow with the main cases.
> > > > That will answer most of the queries. I will update on this mail
> > > > thread when the respective sections are updated.
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
> > > >  wrote:
> > > > >
> > > > > Hi Satish,
> > > > >
> > > > > A couple of questions specific to the section "Follower
> > > > > Requests/Replication", pages 16:17 in the design document [1].
> > > > >
> > > > > 900. It is mentioned that followers fetch auxiliary states from the
> > > > > remote storage.
> > > > >
> > > > > 900.a Does the consistency model of the external storage impacts
> > reads
> > > > > of leader epochs and other auxiliary data?
> > > > >
> > > > > 900.b What are the benefits of using a mechanism to store and
> access
> > > > > the leader epochs which is different from other metadata associated
> > to
> > > > > tiered segments? What are the benefits of retrieving this
> information
> > > > > on-demand from the follower rather than relying on propagation via
> > the
> > > > > topic __remote_log_metadata? What are the advantages over using a
> > > > > dedicated control structure (e.g. a new record type) propagated via
> > > > > this topic? Since in the document, different control paths are
> > > > > operating in the system, how are the metadata stored in
> > > > > __remote_log_metadata [which also include the epoch of the leader
> > > > > which offloaded a segment] and the remote auxiliary states, kept in
> > > > > sync?
> > > > >
> > > > > 900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE.
> Is
> > > > > this in response to a Fetch or OffsetForLeaderEpoch request?
> > > > >
> > > > > 900.d What happens if, after a follower encountered an
> > > > > OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
> > > > > leader epochs fail (for instance, because the remote storage is
> > > > > temporarily unavailable)? Does the follower fallbacks to a mode
> where
> > > > > it ignores tiered segments, and applies truncation using only
> locally
> > > > > available information? What happens when access to the remote
> storage
> > > > > is restored? How is the replica 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-12 Thread Satish Duggana
We already mentioned in the KIP that RemoteLogMetadataMamnager is
pluggable. Users have an option to plugin their own implementation of
RLMM instead of using the default implementaion(which is on topic
storage), which can be based on their remote storage environments like
AWS, GCP, Azure etc.

On Sun, Jul 12, 2020 at 6:36 AM Adam Bellemare  wrote:
>
> My 2 cents -
>
> I agree with Colin. I think that it's important that the metadata not grow
> unbounded without being delegated to external storage. Indefinite long-term
> storage of entity data in Kafka can result in extremely large datasets
> where the vast majority of data is stored in the external tier. I would be
> very disappointed to have the metadata storage be a limiting factor to
> exactly how much data I can store in Kafka. Additionally, and for example,
> I think it's very reasonable that an AWS metadata store could be
> implemented with DynamoDB (key-value store) paired with S3 - faster
> random-access metadata lookup than plain S3, but without needing to rebuild
> rocksDB state locally.
>
>
>
> On Fri, Jul 10, 2020 at 3:57 PM Colin McCabe  wrote:
>
> > Hi all,
> >
> > Thanks for the KIP.
> >
> > I took a look and one thing that stood out to me is that the more metadata
> > we have, the more storage we will need on local disk for the rocksDB
> > database.  This seems like it contradicts some of the goals of the
> > project.  Ideally the space we need on local disk should be related only to
> > the size of the hot set, not the size of the cold set.  It also seems like
> > it could lead to extremely long rocksdb rebuild times if we somehow lose a
> > broker's local storage and have to rebuild it.
> >
> > Instead, I think it would be more reasonable to store cold metadata in the
> > "remote" storage (HDFS, s3, etc.).  Not only does this free up space on the
> > local and avoid long rebuild times, but it also gives us more control over
> > the management of our cache.  With rocksDB we are delegating cache
> > management to an external library that doesn't really understand our
> > use-case.
> >
> > To give a concrete example of how this is bad, imagine that we have 10
> > worker threads and we get  10 requests for something that requires us to
> > fetch cold tiered storage metadata.  Now every worker thread is blocked
> > inside rocksDB and the broker can do nothing until it finishes fetching
> > from disk.  When accessing a remote service like HDFS or S3, in contrast,
> > we would be able to check if the data was in our local cache first.  If it
> > wasn't, we could put the request in a purgatory and activate a background
> > thread to fetch the needed data, and then release the worker thread to be
> > used by some other request.  Having control of our own caching strategy
> > increases observability, maintainability, and performance.
> >
> > I can anticipate a possible counter-argument here: the size of the
> > metadata should be small and usually fully resident in memory anyway.
> > While this is true today, I don't think it will always be true.  The
> > current low limit of a few thousand partitions is not competitive in the
> > long term and needs to be lifted.  We'd like to get to at least a million
> > partitions with KIP-500, and much more later.  Also, when you give people
> > the ability to have unlimited retention, they will want to make use of it.
> > That means lots of historical log segments to track.  This scenario is by
> > no means hypothetical.  Even with the current software, it's easy to think
> > of cases where someone misconfigured the log segment roll settings and
> > overwhelmed the system with segments.  So overall, I like to understand why
> > we want to store metadata on local disk rather than remote, and what the
> > options are for the future.
> >
> > best,
> > Colin
> >
> >
> > On Thu, Jul 9, 2020, at 09:55, Harsha Chintalapani wrote:
> > > Hi Jun,
> > >   Thanks for the replies and feedback on design and giving input.
> > > We are coming close to finish the implementation.
> > > We also did several perf tests as well at our peak production loads and
> > > with tiered storage we didn't see any degradation on write throughputs
> > and
> > > latencies.
> > > Ying already added some of the perf tests results in the KIP itself.
> > >  It will be great if we can get design and code reviews from you
> > > and others in the community as we make progress.
> > > Thanks,
> > > Harsha
> > >
> > > On Tue, Jul 7, 2020 at 10:34 AM Jun Rao  wrote:
> > >
> > > > Hi, Ying,
> > > >
> > > > Thanks for the update. It's good to see the progress on this. Please
> > let
> > > > us know when you are done updating the KIP wiki.
> > > >
> > > > Jun
> > > >
> > > > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng 
> > wrote:
> > > >
> > > >> Hi Jun,
> > > >>
> > > >> Satish and I have added more design details in the KIP, including how
> > to
> > > >> keep consistency between replicas (especially when there is leadership
> > > >> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-12 Thread Satish Duggana
Hi Colin,
Thanks for looking into the KIP.

I guess you are talking about the option mentioned for a cache used in
the default implementation of RemoteLogMetadataManager.

RocksDB is put as an option of cache for the initial version of the
default implementation because it is easy and works with the
partitions that are generally deployed on brokers. Each remote log
segment metadata may not take more than 200 bytes. If segment size is
500MB(default is 1GB) then 10PB of segments takes around 4GB.
Generally, large internet scale companies have data lakes around 100s
of PBs.
RLMM is not accessed in hot paths but it is used by background threads
to copy/delete remote log segments.

Our main focus in the initial version is to get the overall
architecture work well and it should be extensible for future
improvements. We have plans to improve this further in future KIP to
store in remote storage and fetch when it is needed.

Thanks,
Satish.

On Sat, Jul 11, 2020 at 1:27 AM Colin McCabe  wrote:
>
> Hi all,
>
> Thanks for the KIP.
>
> I took a look and one thing that stood out to me is that the more metadata we 
> have, the more storage we will need on local disk for the rocksDB database.  
> This seems like it contradicts some of the goals of the project.  Ideally the 
> space we need on local disk should be related only to the size of the hot 
> set, not the size of the cold set.  It also seems like it could lead to 
> extremely long rocksdb rebuild times if we somehow lose a broker's local 
> storage and have to rebuild it.
>
> Instead, I think it would be more reasonable to store cold metadata in the 
> "remote" storage (HDFS, s3, etc.).  Not only does this free up space on the 
> local and avoid long rebuild times, but it also gives us more control over 
> the management of our cache.  With rocksDB we are delegating cache management 
> to an external library that doesn't really understand our use-case.
>
> To give a concrete example of how this is bad, imagine that we have 10 worker 
> threads and we get  10 requests for something that requires us to fetch cold 
> tiered storage metadata.  Now every worker thread is blocked inside rocksDB 
> and the broker can do nothing until it finishes fetching from disk.  When 
> accessing a remote service like HDFS or S3, in contrast, we would be able to 
> check if the data was in our local cache first.  If it wasn't, we could put 
> the request in a purgatory and activate a background thread to fetch the 
> needed data, and then release the worker thread to be used by some other 
> request.  Having control of our own caching strategy increases observability, 
> maintainability, and performance.
>
> I can anticipate a possible counter-argument here: the size of the metadata 
> should be small and usually fully resident in memory anyway.  While this is 
> true today, I don't think it will always be true.  The current low limit of a 
> few thousand partitions is not competitive in the long term and needs to be 
> lifted.  We'd like to get to at least a million partitions with KIP-500, and 
> much more later.  Also, when you give people the ability to have unlimited 
> retention, they will want to make use of it.  That means lots of historical 
> log segments to track.  This scenario is by no means hypothetical.  Even with 
> the current software, it's easy to think of cases where someone misconfigured 
> the log segment roll settings and overwhelmed the system with segments.  So 
> overall, I like to understand why we want to store metadata on local disk 
> rather than remote, and what the options are for the future.
>
> best,
> Colin
>
>
> On Thu, Jul 9, 2020, at 09:55, Harsha Chintalapani wrote:
> > Hi Jun,
> >   Thanks for the replies and feedback on design and giving input.
> > We are coming close to finish the implementation.
> > We also did several perf tests as well at our peak production loads and
> > with tiered storage we didn't see any degradation on write throughputs and
> > latencies.
> > Ying already added some of the perf tests results in the KIP itself.
> >  It will be great if we can get design and code reviews from you
> > and others in the community as we make progress.
> > Thanks,
> > Harsha
> >
> > On Tue, Jul 7, 2020 at 10:34 AM Jun Rao  wrote:
> >
> > > Hi, Ying,
> > >
> > > Thanks for the update. It's good to see the progress on this. Please let
> > > us know when you are done updating the KIP wiki.
> > >
> > > Jun
> > >
> > > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng  wrote:
> > >
> > >> Hi Jun,
> > >>
> > >> Satish and I have added more design details in the KIP, including how to
> > >> keep consistency between replicas (especially when there is leadership
> > >> changes / log truncations) and new metrics. We also made some other minor
> > >> changes in the doc. We will finish the KIP changes in the next couple of
> > >> days. We will let you know when we are done. Most of the changes are
> > >> already updated to the wiki KIP. 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-11 Thread ignacio gioya
Hi, i wan to unsubscribe from this list.
Can i do it?
Please :)

Thank u!
Regards!!

El sáb., 11 jul. 2020 a las 22:06, Adam Bellemare ()
escribió:

> My 2 cents -
>
> I agree with Colin. I think that it's important that the metadata not grow
> unbounded without being delegated to external storage. Indefinite long-term
> storage of entity data in Kafka can result in extremely large datasets
> where the vast majority of data is stored in the external tier. I would be
> very disappointed to have the metadata storage be a limiting factor to
> exactly how much data I can store in Kafka. Additionally, and for example,
> I think it's very reasonable that an AWS metadata store could be
> implemented with DynamoDB (key-value store) paired with S3 - faster
> random-access metadata lookup than plain S3, but without needing to rebuild
> rocksDB state locally.
>
>
>
> On Fri, Jul 10, 2020 at 3:57 PM Colin McCabe  wrote:
>
> > Hi all,
> >
> > Thanks for the KIP.
> >
> > I took a look and one thing that stood out to me is that the more
> metadata
> > we have, the more storage we will need on local disk for the rocksDB
> > database.  This seems like it contradicts some of the goals of the
> > project.  Ideally the space we need on local disk should be related only
> to
> > the size of the hot set, not the size of the cold set.  It also seems
> like
> > it could lead to extremely long rocksdb rebuild times if we somehow lose
> a
> > broker's local storage and have to rebuild it.
> >
> > Instead, I think it would be more reasonable to store cold metadata in
> the
> > "remote" storage (HDFS, s3, etc.).  Not only does this free up space on
> the
> > local and avoid long rebuild times, but it also gives us more control
> over
> > the management of our cache.  With rocksDB we are delegating cache
> > management to an external library that doesn't really understand our
> > use-case.
> >
> > To give a concrete example of how this is bad, imagine that we have 10
> > worker threads and we get  10 requests for something that requires us to
> > fetch cold tiered storage metadata.  Now every worker thread is blocked
> > inside rocksDB and the broker can do nothing until it finishes fetching
> > from disk.  When accessing a remote service like HDFS or S3, in contrast,
> > we would be able to check if the data was in our local cache first.  If
> it
> > wasn't, we could put the request in a purgatory and activate a background
> > thread to fetch the needed data, and then release the worker thread to be
> > used by some other request.  Having control of our own caching strategy
> > increases observability, maintainability, and performance.
> >
> > I can anticipate a possible counter-argument here: the size of the
> > metadata should be small and usually fully resident in memory anyway.
> > While this is true today, I don't think it will always be true.  The
> > current low limit of a few thousand partitions is not competitive in the
> > long term and needs to be lifted.  We'd like to get to at least a million
> > partitions with KIP-500, and much more later.  Also, when you give people
> > the ability to have unlimited retention, they will want to make use of
> it.
> > That means lots of historical log segments to track.  This scenario is by
> > no means hypothetical.  Even with the current software, it's easy to
> think
> > of cases where someone misconfigured the log segment roll settings and
> > overwhelmed the system with segments.  So overall, I like to understand
> why
> > we want to store metadata on local disk rather than remote, and what the
> > options are for the future.
> >
> > best,
> > Colin
> >
> >
> > On Thu, Jul 9, 2020, at 09:55, Harsha Chintalapani wrote:
> > > Hi Jun,
> > >   Thanks for the replies and feedback on design and giving
> input.
> > > We are coming close to finish the implementation.
> > > We also did several perf tests as well at our peak production loads and
> > > with tiered storage we didn't see any degradation on write throughputs
> > and
> > > latencies.
> > > Ying already added some of the perf tests results in the KIP itself.
> > >  It will be great if we can get design and code reviews from
> you
> > > and others in the community as we make progress.
> > > Thanks,
> > > Harsha
> > >
> > > On Tue, Jul 7, 2020 at 10:34 AM Jun Rao  wrote:
> > >
> > > > Hi, Ying,
> > > >
> > > > Thanks for the update. It's good to see the progress on this. Please
> > let
> > > > us know when you are done updating the KIP wiki.
> > > >
> > > > Jun
> > > >
> > > > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng 
> > wrote:
> > > >
> > > >> Hi Jun,
> > > >>
> > > >> Satish and I have added more design details in the KIP, including
> how
> > to
> > > >> keep consistency between replicas (especially when there is
> leadership
> > > >> changes / log truncations) and new metrics. We also made some other
> > minor
> > > >> changes in the doc. We will finish the KIP changes in the next
> couple
> > of
> > > >> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-11 Thread Adam Bellemare
My 2 cents -

I agree with Colin. I think that it's important that the metadata not grow
unbounded without being delegated to external storage. Indefinite long-term
storage of entity data in Kafka can result in extremely large datasets
where the vast majority of data is stored in the external tier. I would be
very disappointed to have the metadata storage be a limiting factor to
exactly how much data I can store in Kafka. Additionally, and for example,
I think it's very reasonable that an AWS metadata store could be
implemented with DynamoDB (key-value store) paired with S3 - faster
random-access metadata lookup than plain S3, but without needing to rebuild
rocksDB state locally.



On Fri, Jul 10, 2020 at 3:57 PM Colin McCabe  wrote:

> Hi all,
>
> Thanks for the KIP.
>
> I took a look and one thing that stood out to me is that the more metadata
> we have, the more storage we will need on local disk for the rocksDB
> database.  This seems like it contradicts some of the goals of the
> project.  Ideally the space we need on local disk should be related only to
> the size of the hot set, not the size of the cold set.  It also seems like
> it could lead to extremely long rocksdb rebuild times if we somehow lose a
> broker's local storage and have to rebuild it.
>
> Instead, I think it would be more reasonable to store cold metadata in the
> "remote" storage (HDFS, s3, etc.).  Not only does this free up space on the
> local and avoid long rebuild times, but it also gives us more control over
> the management of our cache.  With rocksDB we are delegating cache
> management to an external library that doesn't really understand our
> use-case.
>
> To give a concrete example of how this is bad, imagine that we have 10
> worker threads and we get  10 requests for something that requires us to
> fetch cold tiered storage metadata.  Now every worker thread is blocked
> inside rocksDB and the broker can do nothing until it finishes fetching
> from disk.  When accessing a remote service like HDFS or S3, in contrast,
> we would be able to check if the data was in our local cache first.  If it
> wasn't, we could put the request in a purgatory and activate a background
> thread to fetch the needed data, and then release the worker thread to be
> used by some other request.  Having control of our own caching strategy
> increases observability, maintainability, and performance.
>
> I can anticipate a possible counter-argument here: the size of the
> metadata should be small and usually fully resident in memory anyway.
> While this is true today, I don't think it will always be true.  The
> current low limit of a few thousand partitions is not competitive in the
> long term and needs to be lifted.  We'd like to get to at least a million
> partitions with KIP-500, and much more later.  Also, when you give people
> the ability to have unlimited retention, they will want to make use of it.
> That means lots of historical log segments to track.  This scenario is by
> no means hypothetical.  Even with the current software, it's easy to think
> of cases where someone misconfigured the log segment roll settings and
> overwhelmed the system with segments.  So overall, I like to understand why
> we want to store metadata on local disk rather than remote, and what the
> options are for the future.
>
> best,
> Colin
>
>
> On Thu, Jul 9, 2020, at 09:55, Harsha Chintalapani wrote:
> > Hi Jun,
> >   Thanks for the replies and feedback on design and giving input.
> > We are coming close to finish the implementation.
> > We also did several perf tests as well at our peak production loads and
> > with tiered storage we didn't see any degradation on write throughputs
> and
> > latencies.
> > Ying already added some of the perf tests results in the KIP itself.
> >  It will be great if we can get design and code reviews from you
> > and others in the community as we make progress.
> > Thanks,
> > Harsha
> >
> > On Tue, Jul 7, 2020 at 10:34 AM Jun Rao  wrote:
> >
> > > Hi, Ying,
> > >
> > > Thanks for the update. It's good to see the progress on this. Please
> let
> > > us know when you are done updating the KIP wiki.
> > >
> > > Jun
> > >
> > > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng 
> wrote:
> > >
> > >> Hi Jun,
> > >>
> > >> Satish and I have added more design details in the KIP, including how
> to
> > >> keep consistency between replicas (especially when there is leadership
> > >> changes / log truncations) and new metrics. We also made some other
> minor
> > >> changes in the doc. We will finish the KIP changes in the next couple
> of
> > >> days. We will let you know when we are done. Most of the changes are
> > >> already updated to the wiki KIP. You can take a look. But it's not the
> > >> final version yet.
> > >>
> > >> As for the implementation, the code is mostly done and we already had
> some
> > >> feature tests / system tests. I have added the performance test
> results in
> > >> the KIP. However the recent 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-10 Thread Colin McCabe
Hi all,

Thanks for the KIP.

I took a look and one thing that stood out to me is that the more metadata we 
have, the more storage we will need on local disk for the rocksDB database.  
This seems like it contradicts some of the goals of the project.  Ideally the 
space we need on local disk should be related only to the size of the hot set, 
not the size of the cold set.  It also seems like it could lead to extremely 
long rocksdb rebuild times if we somehow lose a broker's local storage and have 
to rebuild it.

Instead, I think it would be more reasonable to store cold metadata in the 
"remote" storage (HDFS, s3, etc.).  Not only does this free up space on the 
local and avoid long rebuild times, but it also gives us more control over the 
management of our cache.  With rocksDB we are delegating cache management to an 
external library that doesn't really understand our use-case.

To give a concrete example of how this is bad, imagine that we have 10 worker 
threads and we get  10 requests for something that requires us to fetch cold 
tiered storage metadata.  Now every worker thread is blocked inside rocksDB and 
the broker can do nothing until it finishes fetching from disk.  When accessing 
a remote service like HDFS or S3, in contrast, we would be able to check if the 
data was in our local cache first.  If it wasn't, we could put the request in a 
purgatory and activate a background thread to fetch the needed data, and then 
release the worker thread to be used by some other request.  Having control of 
our own caching strategy increases observability, maintainability, and 
performance.

I can anticipate a possible counter-argument here: the size of the metadata 
should be small and usually fully resident in memory anyway.  While this is 
true today, I don't think it will always be true.  The current low limit of a 
few thousand partitions is not competitive in the long term and needs to be 
lifted.  We'd like to get to at least a million partitions with KIP-500, and 
much more later.  Also, when you give people the ability to have unlimited 
retention, they will want to make use of it.  That means lots of historical log 
segments to track.  This scenario is by no means hypothetical.  Even with the 
current software, it's easy to think of cases where someone misconfigured the 
log segment roll settings and overwhelmed the system with segments.  So 
overall, I like to understand why we want to store metadata on local disk 
rather than remote, and what the options are for the future.

best,
Colin


On Thu, Jul 9, 2020, at 09:55, Harsha Chintalapani wrote:
> Hi Jun,
>   Thanks for the replies and feedback on design and giving input.
> We are coming close to finish the implementation.
> We also did several perf tests as well at our peak production loads and
> with tiered storage we didn't see any degradation on write throughputs and
> latencies.
> Ying already added some of the perf tests results in the KIP itself.
>  It will be great if we can get design and code reviews from you
> and others in the community as we make progress.
> Thanks,
> Harsha
> 
> On Tue, Jul 7, 2020 at 10:34 AM Jun Rao  wrote:
> 
> > Hi, Ying,
> >
> > Thanks for the update. It's good to see the progress on this. Please let
> > us know when you are done updating the KIP wiki.
> >
> > Jun
> >
> > On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng  wrote:
> >
> >> Hi Jun,
> >>
> >> Satish and I have added more design details in the KIP, including how to
> >> keep consistency between replicas (especially when there is leadership
> >> changes / log truncations) and new metrics. We also made some other minor
> >> changes in the doc. We will finish the KIP changes in the next couple of
> >> days. We will let you know when we are done. Most of the changes are
> >> already updated to the wiki KIP. You can take a look. But it's not the
> >> final version yet.
> >>
> >> As for the implementation, the code is mostly done and we already had some
> >> feature tests / system tests. I have added the performance test results in
> >> the KIP. However the recent design changes (e.g. leader epoch info
> >> management / log truncation / some of the new metrics) have not been
> >> implemented yet. It will take about 2 weeks for us to implement after you
> >> review and agree with those design changes.
> >>
> >>
> >>
> >> On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:
> >>
> >> > Hi, Satish, Harsha,
> >> >
> >> > Any new updates on the KIP? This feature is one of the most important
> >> and
> >> > most requested features in Apache Kafka right now. It would be helpful
> >> if
> >> > we can make sustained progress on this. Could you share how far along is
> >> > the design/implementation right now? Is there anything that other people
> >> > can help to get it across the line?
> >> >
> >> > As for "transactional support" and "follower requests/replication", no
> >> > further comments from me as long as the producer state and leader epoch
> >> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-09 Thread Harsha Chintalapani
Hi Jun,
  Thanks for the replies and feedback on design and giving input.
We are coming close to finish the implementation.
We also did several perf tests as well at our peak production loads and
with tiered storage we didn't see any degradation on write throughputs and
latencies.
Ying already added some of the perf tests results in the KIP itself.
 It will be great if we can get design and code reviews from you
and others in the community as we make progress.
Thanks,
Harsha

On Tue, Jul 7, 2020 at 10:34 AM Jun Rao  wrote:

> Hi, Ying,
>
> Thanks for the update. It's good to see the progress on this. Please let
> us know when you are done updating the KIP wiki.
>
> Jun
>
> On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng  wrote:
>
>> Hi Jun,
>>
>> Satish and I have added more design details in the KIP, including how to
>> keep consistency between replicas (especially when there is leadership
>> changes / log truncations) and new metrics. We also made some other minor
>> changes in the doc. We will finish the KIP changes in the next couple of
>> days. We will let you know when we are done. Most of the changes are
>> already updated to the wiki KIP. You can take a look. But it's not the
>> final version yet.
>>
>> As for the implementation, the code is mostly done and we already had some
>> feature tests / system tests. I have added the performance test results in
>> the KIP. However the recent design changes (e.g. leader epoch info
>> management / log truncation / some of the new metrics) have not been
>> implemented yet. It will take about 2 weeks for us to implement after you
>> review and agree with those design changes.
>>
>>
>>
>> On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:
>>
>> > Hi, Satish, Harsha,
>> >
>> > Any new updates on the KIP? This feature is one of the most important
>> and
>> > most requested features in Apache Kafka right now. It would be helpful
>> if
>> > we can make sustained progress on this. Could you share how far along is
>> > the design/implementation right now? Is there anything that other people
>> > can help to get it across the line?
>> >
>> > As for "transactional support" and "follower requests/replication", no
>> > further comments from me as long as the producer state and leader epoch
>> can
>> > be restored properly from the object store when needed.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana > >
>> > wrote:
>> >
>> > > We did not want to add many implementation details in the KIP. But we
>> > > decided to add them in the KIP as appendix or sub-sections(including
>> > > follower fetch protocol) to describe the flow with the main cases.
>> > > That will answer most of the queries. I will update on this mail
>> > > thread when the respective sections are updated.
>> > >
>> > > Thanks,
>> > > Satish.
>> > >
>> > > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
>> > >  wrote:
>> > > >
>> > > > Hi Satish,
>> > > >
>> > > > A couple of questions specific to the section "Follower
>> > > > Requests/Replication", pages 16:17 in the design document [1].
>> > > >
>> > > > 900. It is mentioned that followers fetch auxiliary states from the
>> > > > remote storage.
>> > > >
>> > > > 900.a Does the consistency model of the external storage impacts
>> reads
>> > > > of leader epochs and other auxiliary data?
>> > > >
>> > > > 900.b What are the benefits of using a mechanism to store and access
>> > > > the leader epochs which is different from other metadata associated
>> to
>> > > > tiered segments? What are the benefits of retrieving this
>> information
>> > > > on-demand from the follower rather than relying on propagation via
>> the
>> > > > topic __remote_log_metadata? What are the advantages over using a
>> > > > dedicated control structure (e.g. a new record type) propagated via
>> > > > this topic? Since in the document, different control paths are
>> > > > operating in the system, how are the metadata stored in
>> > > > __remote_log_metadata [which also include the epoch of the leader
>> > > > which offloaded a segment] and the remote auxiliary states, kept in
>> > > > sync?
>> > > >
>> > > > 900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE. Is
>> > > > this in response to a Fetch or OffsetForLeaderEpoch request?
>> > > >
>> > > > 900.d What happens if, after a follower encountered an
>> > > > OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
>> > > > leader epochs fail (for instance, because the remote storage is
>> > > > temporarily unavailable)? Does the follower fallbacks to a mode
>> where
>> > > > it ignores tiered segments, and applies truncation using only
>> locally
>> > > > available information? What happens when access to the remote
>> storage
>> > > > is restored? How is the replica lineage inferred by the remote
>> leader
>> > > > epochs reconciled with the follower's replica lineage, which has
>> > > > evolved? Does the follower remember fetching auxiliary states failed

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-07 Thread Jun Rao
Hi, Ying,

Thanks for the update. It's good to see the progress on this. Please let us
know when you are done updating the KIP wiki.

Jun

On Tue, Jul 7, 2020 at 10:13 AM Ying Zheng  wrote:

> Hi Jun,
>
> Satish and I have added more design details in the KIP, including how to
> keep consistency between replicas (especially when there is leadership
> changes / log truncations) and new metrics. We also made some other minor
> changes in the doc. We will finish the KIP changes in the next couple of
> days. We will let you know when we are done. Most of the changes are
> already updated to the wiki KIP. You can take a look. But it's not the
> final version yet.
>
> As for the implementation, the code is mostly done and we already had some
> feature tests / system tests. I have added the performance test results in
> the KIP. However the recent design changes (e.g. leader epoch info
> management / log truncation / some of the new metrics) have not been
> implemented yet. It will take about 2 weeks for us to implement after you
> review and agree with those design changes.
>
>
>
> On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:
>
> > Hi, Satish, Harsha,
> >
> > Any new updates on the KIP? This feature is one of the most important and
> > most requested features in Apache Kafka right now. It would be helpful if
> > we can make sustained progress on this. Could you share how far along is
> > the design/implementation right now? Is there anything that other people
> > can help to get it across the line?
> >
> > As for "transactional support" and "follower requests/replication", no
> > further comments from me as long as the producer state and leader epoch
> can
> > be restored properly from the object store when needed.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana 
> > wrote:
> >
> > > We did not want to add many implementation details in the KIP. But we
> > > decided to add them in the KIP as appendix or sub-sections(including
> > > follower fetch protocol) to describe the flow with the main cases.
> > > That will answer most of the queries. I will update on this mail
> > > thread when the respective sections are updated.
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
> > >  wrote:
> > > >
> > > > Hi Satish,
> > > >
> > > > A couple of questions specific to the section "Follower
> > > > Requests/Replication", pages 16:17 in the design document [1].
> > > >
> > > > 900. It is mentioned that followers fetch auxiliary states from the
> > > > remote storage.
> > > >
> > > > 900.a Does the consistency model of the external storage impacts
> reads
> > > > of leader epochs and other auxiliary data?
> > > >
> > > > 900.b What are the benefits of using a mechanism to store and access
> > > > the leader epochs which is different from other metadata associated
> to
> > > > tiered segments? What are the benefits of retrieving this information
> > > > on-demand from the follower rather than relying on propagation via
> the
> > > > topic __remote_log_metadata? What are the advantages over using a
> > > > dedicated control structure (e.g. a new record type) propagated via
> > > > this topic? Since in the document, different control paths are
> > > > operating in the system, how are the metadata stored in
> > > > __remote_log_metadata [which also include the epoch of the leader
> > > > which offloaded a segment] and the remote auxiliary states, kept in
> > > > sync?
> > > >
> > > > 900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE. Is
> > > > this in response to a Fetch or OffsetForLeaderEpoch request?
> > > >
> > > > 900.d What happens if, after a follower encountered an
> > > > OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
> > > > leader epochs fail (for instance, because the remote storage is
> > > > temporarily unavailable)? Does the follower fallbacks to a mode where
> > > > it ignores tiered segments, and applies truncation using only locally
> > > > available information? What happens when access to the remote storage
> > > > is restored? How is the replica lineage inferred by the remote leader
> > > > epochs reconciled with the follower's replica lineage, which has
> > > > evolved? Does the follower remember fetching auxiliary states failed
> > > > in the past and attempt reconciliation? Is there a plan to offer
> > > > different strategies in this scenario, configurable via
> configuration?
> > > >
> > > > 900.e Is the leader epoch cache offloaded with every segment? Or when
> > > > a new checkpoint is detected? If that information is not always
> > > > offloaded to avoid duplicating data, how does the remote storage
> > > > satisfy the request to retrieve it?
> > > >
> > > > 900.f Since the leader epoch cache covers the entire replica lineage,
> > > > what happens if, after a leader epoch cache file is offloaded with a
> > > > given segment, the local epoch cache is truncated [not necessarily
> for
> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-07 Thread Ying Zheng
Hi Jun,

Satish and I have added more design details in the KIP, including how to
keep consistency between replicas (especially when there is leadership
changes / log truncations) and new metrics. We also made some other minor
changes in the doc. We will finish the KIP changes in the next couple of
days. We will let you know when we are done. Most of the changes are
already updated to the wiki KIP. You can take a look. But it's not the
final version yet.

As for the implementation, the code is mostly done and we already had some
feature tests / system tests. I have added the performance test results in
the KIP. However the recent design changes (e.g. leader epoch info
management / log truncation / some of the new metrics) have not been
implemented yet. It will take about 2 weeks for us to implement after you
review and agree with those design changes.



On Tue, Jul 7, 2020 at 9:23 AM Jun Rao  wrote:

> Hi, Satish, Harsha,
>
> Any new updates on the KIP? This feature is one of the most important and
> most requested features in Apache Kafka right now. It would be helpful if
> we can make sustained progress on this. Could you share how far along is
> the design/implementation right now? Is there anything that other people
> can help to get it across the line?
>
> As for "transactional support" and "follower requests/replication", no
> further comments from me as long as the producer state and leader epoch can
> be restored properly from the object store when needed.
>
> Thanks,
>
> Jun
>
> On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana 
> wrote:
>
> > We did not want to add many implementation details in the KIP. But we
> > decided to add them in the KIP as appendix or sub-sections(including
> > follower fetch protocol) to describe the flow with the main cases.
> > That will answer most of the queries. I will update on this mail
> > thread when the respective sections are updated.
> >
> > Thanks,
> > Satish.
> >
> > On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
> >  wrote:
> > >
> > > Hi Satish,
> > >
> > > A couple of questions specific to the section "Follower
> > > Requests/Replication", pages 16:17 in the design document [1].
> > >
> > > 900. It is mentioned that followers fetch auxiliary states from the
> > > remote storage.
> > >
> > > 900.a Does the consistency model of the external storage impacts reads
> > > of leader epochs and other auxiliary data?
> > >
> > > 900.b What are the benefits of using a mechanism to store and access
> > > the leader epochs which is different from other metadata associated to
> > > tiered segments? What are the benefits of retrieving this information
> > > on-demand from the follower rather than relying on propagation via the
> > > topic __remote_log_metadata? What are the advantages over using a
> > > dedicated control structure (e.g. a new record type) propagated via
> > > this topic? Since in the document, different control paths are
> > > operating in the system, how are the metadata stored in
> > > __remote_log_metadata [which also include the epoch of the leader
> > > which offloaded a segment] and the remote auxiliary states, kept in
> > > sync?
> > >
> > > 900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE. Is
> > > this in response to a Fetch or OffsetForLeaderEpoch request?
> > >
> > > 900.d What happens if, after a follower encountered an
> > > OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
> > > leader epochs fail (for instance, because the remote storage is
> > > temporarily unavailable)? Does the follower fallbacks to a mode where
> > > it ignores tiered segments, and applies truncation using only locally
> > > available information? What happens when access to the remote storage
> > > is restored? How is the replica lineage inferred by the remote leader
> > > epochs reconciled with the follower's replica lineage, which has
> > > evolved? Does the follower remember fetching auxiliary states failed
> > > in the past and attempt reconciliation? Is there a plan to offer
> > > different strategies in this scenario, configurable via configuration?
> > >
> > > 900.e Is the leader epoch cache offloaded with every segment? Or when
> > > a new checkpoint is detected? If that information is not always
> > > offloaded to avoid duplicating data, how does the remote storage
> > > satisfy the request to retrieve it?
> > >
> > > 900.f Since the leader epoch cache covers the entire replica lineage,
> > > what happens if, after a leader epoch cache file is offloaded with a
> > > given segment, the local epoch cache is truncated [not necessarily for
> > > a range of offset included in tiered segments]? How are remote and
> > > local leader epoch caches kept consistent?
> > >
> > > 900.g Consumer can also use leader epochs (e.g. to enable fencing to
> > > protect against stale leaders). What differences would there be
> > > between consumer and follower fetches? Especially, would consumers
> > > also fetch leader epoch information from 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-07-07 Thread Jun Rao
Hi, Satish, Harsha,

Any new updates on the KIP? This feature is one of the most important and
most requested features in Apache Kafka right now. It would be helpful if
we can make sustained progress on this. Could you share how far along is
the design/implementation right now? Is there anything that other people
can help to get it across the line?

As for "transactional support" and "follower requests/replication", no
further comments from me as long as the producer state and leader epoch can
be restored properly from the object store when needed.

Thanks,

Jun

On Tue, Jun 9, 2020 at 3:39 AM Satish Duggana 
wrote:

> We did not want to add many implementation details in the KIP. But we
> decided to add them in the KIP as appendix or sub-sections(including
> follower fetch protocol) to describe the flow with the main cases.
> That will answer most of the queries. I will update on this mail
> thread when the respective sections are updated.
>
> Thanks,
> Satish.
>
> On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
>  wrote:
> >
> > Hi Satish,
> >
> > A couple of questions specific to the section "Follower
> > Requests/Replication", pages 16:17 in the design document [1].
> >
> > 900. It is mentioned that followers fetch auxiliary states from the
> > remote storage.
> >
> > 900.a Does the consistency model of the external storage impacts reads
> > of leader epochs and other auxiliary data?
> >
> > 900.b What are the benefits of using a mechanism to store and access
> > the leader epochs which is different from other metadata associated to
> > tiered segments? What are the benefits of retrieving this information
> > on-demand from the follower rather than relying on propagation via the
> > topic __remote_log_metadata? What are the advantages over using a
> > dedicated control structure (e.g. a new record type) propagated via
> > this topic? Since in the document, different control paths are
> > operating in the system, how are the metadata stored in
> > __remote_log_metadata [which also include the epoch of the leader
> > which offloaded a segment] and the remote auxiliary states, kept in
> > sync?
> >
> > 900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE. Is
> > this in response to a Fetch or OffsetForLeaderEpoch request?
> >
> > 900.d What happens if, after a follower encountered an
> > OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
> > leader epochs fail (for instance, because the remote storage is
> > temporarily unavailable)? Does the follower fallbacks to a mode where
> > it ignores tiered segments, and applies truncation using only locally
> > available information? What happens when access to the remote storage
> > is restored? How is the replica lineage inferred by the remote leader
> > epochs reconciled with the follower's replica lineage, which has
> > evolved? Does the follower remember fetching auxiliary states failed
> > in the past and attempt reconciliation? Is there a plan to offer
> > different strategies in this scenario, configurable via configuration?
> >
> > 900.e Is the leader epoch cache offloaded with every segment? Or when
> > a new checkpoint is detected? If that information is not always
> > offloaded to avoid duplicating data, how does the remote storage
> > satisfy the request to retrieve it?
> >
> > 900.f Since the leader epoch cache covers the entire replica lineage,
> > what happens if, after a leader epoch cache file is offloaded with a
> > given segment, the local epoch cache is truncated [not necessarily for
> > a range of offset included in tiered segments]? How are remote and
> > local leader epoch caches kept consistent?
> >
> > 900.g Consumer can also use leader epochs (e.g. to enable fencing to
> > protect against stale leaders). What differences would there be
> > between consumer and follower fetches? Especially, would consumers
> > also fetch leader epoch information from the remote storage?
> >
> > 900.h Assume a newly elected leader of a topic-partition detects more
> > recent segments are available in the external storage, with epochs >
> > its local epoch. Does it ignore these segments and their associated
> > epoch-to-offset vectors? Or try to reconstruct its local replica
> > lineage based on the data remotely available?
> >
> > Thanks,
> > Alexandre
> >
> > [1]
> https://docs.google.com/document/d/18tnobSas3mKFZFr8oRguZoj_tkD_sGzivuLRlMloEMs/edit?usp=sharing
> >
> > Le jeu. 4 juin 2020 à 19:55, Satish Duggana 
> a écrit :
> > >
> > > Hi Jun,
> > > Please let us know if you have any comments on "transactional support"
> > > and "follower requests/replication" mentioned in the wiki.
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Tue, Jun 2, 2020 at 9:25 PM Satish Duggana <
> satish.dugg...@gmail.com> wrote:
> > > >
> > > > Thanks Jun for your comments.
> > > >
> > > > >100. It would be useful to provide more details on how those apis
> are used. Otherwise, it's kind of hard to really assess whether the new
> apis are 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-06-09 Thread Satish Duggana
We did not want to add many implementation details in the KIP. But we
decided to add them in the KIP as appendix or sub-sections(including
follower fetch protocol) to describe the flow with the main cases.
That will answer most of the queries. I will update on this mail
thread when the respective sections are updated.

Thanks,
Satish.

On Sat, Jun 6, 2020 at 7:49 PM Alexandre Dupriez
 wrote:
>
> Hi Satish,
>
> A couple of questions specific to the section "Follower
> Requests/Replication", pages 16:17 in the design document [1].
>
> 900. It is mentioned that followers fetch auxiliary states from the
> remote storage.
>
> 900.a Does the consistency model of the external storage impacts reads
> of leader epochs and other auxiliary data?
>
> 900.b What are the benefits of using a mechanism to store and access
> the leader epochs which is different from other metadata associated to
> tiered segments? What are the benefits of retrieving this information
> on-demand from the follower rather than relying on propagation via the
> topic __remote_log_metadata? What are the advantages over using a
> dedicated control structure (e.g. a new record type) propagated via
> this topic? Since in the document, different control paths are
> operating in the system, how are the metadata stored in
> __remote_log_metadata [which also include the epoch of the leader
> which offloaded a segment] and the remote auxiliary states, kept in
> sync?
>
> 900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE. Is
> this in response to a Fetch or OffsetForLeaderEpoch request?
>
> 900.d What happens if, after a follower encountered an
> OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
> leader epochs fail (for instance, because the remote storage is
> temporarily unavailable)? Does the follower fallbacks to a mode where
> it ignores tiered segments, and applies truncation using only locally
> available information? What happens when access to the remote storage
> is restored? How is the replica lineage inferred by the remote leader
> epochs reconciled with the follower's replica lineage, which has
> evolved? Does the follower remember fetching auxiliary states failed
> in the past and attempt reconciliation? Is there a plan to offer
> different strategies in this scenario, configurable via configuration?
>
> 900.e Is the leader epoch cache offloaded with every segment? Or when
> a new checkpoint is detected? If that information is not always
> offloaded to avoid duplicating data, how does the remote storage
> satisfy the request to retrieve it?
>
> 900.f Since the leader epoch cache covers the entire replica lineage,
> what happens if, after a leader epoch cache file is offloaded with a
> given segment, the local epoch cache is truncated [not necessarily for
> a range of offset included in tiered segments]? How are remote and
> local leader epoch caches kept consistent?
>
> 900.g Consumer can also use leader epochs (e.g. to enable fencing to
> protect against stale leaders). What differences would there be
> between consumer and follower fetches? Especially, would consumers
> also fetch leader epoch information from the remote storage?
>
> 900.h Assume a newly elected leader of a topic-partition detects more
> recent segments are available in the external storage, with epochs >
> its local epoch. Does it ignore these segments and their associated
> epoch-to-offset vectors? Or try to reconstruct its local replica
> lineage based on the data remotely available?
>
> Thanks,
> Alexandre
>
> [1] 
> https://docs.google.com/document/d/18tnobSas3mKFZFr8oRguZoj_tkD_sGzivuLRlMloEMs/edit?usp=sharing
>
> Le jeu. 4 juin 2020 à 19:55, Satish Duggana  a 
> écrit :
> >
> > Hi Jun,
> > Please let us know if you have any comments on "transactional support"
> > and "follower requests/replication" mentioned in the wiki.
> >
> > Thanks,
> > Satish.
> >
> > On Tue, Jun 2, 2020 at 9:25 PM Satish Duggana  
> > wrote:
> > >
> > > Thanks Jun for your comments.
> > >
> > > >100. It would be useful to provide more details on how those apis are 
> > > >used. Otherwise, it's kind of hard to really assess whether the new apis 
> > > >are sufficient/redundant. A few examples below.
> > >
> > > We will update the wiki and let you know.
> > >
> > > >100.1 deleteRecords seems to only advance the logStartOffset in Log. How 
> > > >does that trigger the deletion of remote log segments?
> > >
> > > RLMTask for leader partition periodically checks whether there are
> > > remote log segments earlier to logStartOffset and the respective
> > > remote log segment metadata and data are deleted by using RLMM and
> > > RSM.
> > >
> > > >100.2 stopReplica with deletion is used in 2 cases (a) replica 
> > > >reassignment; (b) topic deletion. We only want to delete the tiered 
> > > >metadata in the second case. Also, in the second case, who initiates the 
> > > >deletion of the remote segment since the leader may not exist?
> > >
> > > Right, it is deleted 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-06-06 Thread Alexandre Dupriez
Hi Satish,

A couple of questions specific to the section "Follower
Requests/Replication", pages 16:17 in the design document [1].

900. It is mentioned that followers fetch auxiliary states from the
remote storage.

900.a Does the consistency model of the external storage impacts reads
of leader epochs and other auxiliary data?

900.b What are the benefits of using a mechanism to store and access
the leader epochs which is different from other metadata associated to
tiered segments? What are the benefits of retrieving this information
on-demand from the follower rather than relying on propagation via the
topic __remote_log_metadata? What are the advantages over using a
dedicated control structure (e.g. a new record type) propagated via
this topic? Since in the document, different control paths are
operating in the system, how are the metadata stored in
__remote_log_metadata [which also include the epoch of the leader
which offloaded a segment] and the remote auxiliary states, kept in
sync?

900.c A follower can encounter an OFFSET_MOVED_TO_TIERED_STORAGE. Is
this in response to a Fetch or OffsetForLeaderEpoch request?

900.d What happens if, after a follower encountered an
OFFSET_MOVED_TO_TIERED_STORAGE response, its attempts to retrieve
leader epochs fail (for instance, because the remote storage is
temporarily unavailable)? Does the follower fallbacks to a mode where
it ignores tiered segments, and applies truncation using only locally
available information? What happens when access to the remote storage
is restored? How is the replica lineage inferred by the remote leader
epochs reconciled with the follower's replica lineage, which has
evolved? Does the follower remember fetching auxiliary states failed
in the past and attempt reconciliation? Is there a plan to offer
different strategies in this scenario, configurable via configuration?

900.e Is the leader epoch cache offloaded with every segment? Or when
a new checkpoint is detected? If that information is not always
offloaded to avoid duplicating data, how does the remote storage
satisfy the request to retrieve it?

900.f Since the leader epoch cache covers the entire replica lineage,
what happens if, after a leader epoch cache file is offloaded with a
given segment, the local epoch cache is truncated [not necessarily for
a range of offset included in tiered segments]? How are remote and
local leader epoch caches kept consistent?

900.g Consumer can also use leader epochs (e.g. to enable fencing to
protect against stale leaders). What differences would there be
between consumer and follower fetches? Especially, would consumers
also fetch leader epoch information from the remote storage?

900.h Assume a newly elected leader of a topic-partition detects more
recent segments are available in the external storage, with epochs >
its local epoch. Does it ignore these segments and their associated
epoch-to-offset vectors? Or try to reconstruct its local replica
lineage based on the data remotely available?

Thanks,
Alexandre

[1] 
https://docs.google.com/document/d/18tnobSas3mKFZFr8oRguZoj_tkD_sGzivuLRlMloEMs/edit?usp=sharing

Le jeu. 4 juin 2020 à 19:55, Satish Duggana  a écrit :
>
> Hi Jun,
> Please let us know if you have any comments on "transactional support"
> and "follower requests/replication" mentioned in the wiki.
>
> Thanks,
> Satish.
>
> On Tue, Jun 2, 2020 at 9:25 PM Satish Duggana  
> wrote:
> >
> > Thanks Jun for your comments.
> >
> > >100. It would be useful to provide more details on how those apis are 
> > >used. Otherwise, it's kind of hard to really assess whether the new apis 
> > >are sufficient/redundant. A few examples below.
> >
> > We will update the wiki and let you know.
> >
> > >100.1 deleteRecords seems to only advance the logStartOffset in Log. How 
> > >does that trigger the deletion of remote log segments?
> >
> > RLMTask for leader partition periodically checks whether there are
> > remote log segments earlier to logStartOffset and the respective
> > remote log segment metadata and data are deleted by using RLMM and
> > RSM.
> >
> > >100.2 stopReplica with deletion is used in 2 cases (a) replica 
> > >reassignment; (b) topic deletion. We only want to delete the tiered 
> > >metadata in the second case. Also, in the second case, who initiates the 
> > >deletion of the remote segment since the leader may not exist?
> >
> > Right, it is deleted only incase of topic deletion only. We will cover
> > the details in the KIP.
> >
> > >100.3 "LogStartOffset of a topic can be either in local or in remote 
> > >storage." If LogStartOffset exists in both places, which one is the source 
> > >of truth?
> >
> > I meant the logStartOffset can point to either of local segment or
> > remote segment but it is initialised and maintained in the Log class
> > like now.
> >
> > >100.4 List listRemoteLogSegments(TopicPartition 
> > >topicPartition, long minOffset): How is minOffset supposed to be used?
> >
> > Returns list of remote segments, 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-06-04 Thread Satish Duggana
Hi Jun,
Please let us know if you have any comments on "transactional support"
and "follower requests/replication" mentioned in the wiki.

Thanks,
Satish.

On Tue, Jun 2, 2020 at 9:25 PM Satish Duggana  wrote:
>
> Thanks Jun for your comments.
>
> >100. It would be useful to provide more details on how those apis are used. 
> >Otherwise, it's kind of hard to really assess whether the new apis are 
> >sufficient/redundant. A few examples below.
>
> We will update the wiki and let you know.
>
> >100.1 deleteRecords seems to only advance the logStartOffset in Log. How 
> >does that trigger the deletion of remote log segments?
>
> RLMTask for leader partition periodically checks whether there are
> remote log segments earlier to logStartOffset and the respective
> remote log segment metadata and data are deleted by using RLMM and
> RSM.
>
> >100.2 stopReplica with deletion is used in 2 cases (a) replica reassignment; 
> >(b) topic deletion. We only want to delete the tiered metadata in the second 
> >case. Also, in the second case, who initiates the deletion of the remote 
> >segment since the leader may not exist?
>
> Right, it is deleted only incase of topic deletion only. We will cover
> the details in the KIP.
>
> >100.3 "LogStartOffset of a topic can be either in local or in remote 
> >storage." If LogStartOffset exists in both places, which one is the source 
> >of truth?
>
> I meant the logStartOffset can point to either of local segment or
> remote segment but it is initialised and maintained in the Log class
> like now.
>
> >100.4 List listRemoteLogSegments(TopicPartition 
> >topicPartition, long minOffset): How is minOffset supposed to be used?
>
> Returns list of remote segments, sorted by baseOffset in ascending
> order that have baseOffset >= the given min Offset.
>
> >100.5 When copying a segment to remote storage, it seems we are calling the 
> >same RLMM.putRemoteLogSegmentData() twice before and after copyLogSegment(). 
> >Could you explain why?
>
> This is more about prepare/commit/rollback as you suggested.  We will
> update the wiki with the new APIs.
>
> >100.6 LogSegmentData includes leaderEpochCache, but there is no api in 
> >RemoteStorageManager to retrieve it.
>
> Nice catch, copy/paste issue. There is an API to retrieve it.
>
> >101. If the __remote_log_metadata is for production usage, could you provide 
> >more details? For example, what is the schema of the data (both key and 
> >value)? How is the topic maintained,delete or compact?
>
> It is with delete config and it’s retention period is suggested to be
> more than the remote retention period.
>
> >110. Is the cache implementation in RemoteLogMetadataManager meant for 
> >production usage? If so, could you provide more details on the schema and 
> >how/where the data is stored?
>
> The proposal is to have a cache (with default implementation backed by
> rocksdb) but it will be added in later versions. We will add this to
> future work items.
>
> >111. "Committed offsets can be stored in a local file". Could you describe 
> >the format of the file and where it's stored?
>
> We will cover this in the KIP.
>
> >112. Truncation of remote segments under unclean leader election: I am not 
> >sure who figures out the truncated remote segments and how that information 
> >is propagated to all replicas?
>
> We will add this in detail in the KIP.
>
> >113. "If there are any failures in removing remote log segments then those 
> >are stored in a specific topic (default as 
> >__remote_segments_to_be_deleted)". Is it necessary to add yet another 
> >internal topic? Could we just keep retrying?
>
> This is not really an internal topic, it will be exposed as a user
> configurable topic. After a few retries, we want user to know about
> the failure so that they can take an action later by consuming from
> this topic. We want to keep this simple instead of retrying
> continuously and maintaining the deletion state etc.
>
> >114. "We may not need to copy producer-id-snapshot as we are copying only 
> >segments earlier to last-stable-offset." Hmm, not sure about that. The 
> >producer snapshot includes things like the last timestamp of each open 
> >producer id and can affect when those producer ids are expired.
>
> Sure, this will be added as part of the LogSegmentData.
>
> Thanks,
> Satish.
>
>
> On Fri, May 29, 2020 at 6:39 AM Jun Rao  wrote:
> >
> > Hi, Satish,
> >
> > Made another pass on the wiki. A few more comments below.
> >
> > 100. It would be useful to provide more details on how those apis are used. 
> > Otherwise, it's kind of hard to really assess whether the new apis are 
> > sufficient/redundant. A few examples below.
> > 100.1 deleteRecords seems to only advance the logStartOffset in Log. How 
> > does that trigger the deletion of remote log segments?
> > 100.2 stopReplica with deletion is used in 2 cases (a) replica 
> > reassignment; (b) topic deletion. We only want to delete the tiered 
> > metadata in the second case. 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-06-02 Thread Satish Duggana
Thanks Jun for your comments.

>100. It would be useful to provide more details on how those apis are used. 
>Otherwise, it's kind of hard to really assess whether the new apis are 
>sufficient/redundant. A few examples below.

We will update the wiki and let you know.

>100.1 deleteRecords seems to only advance the logStartOffset in Log. How does 
>that trigger the deletion of remote log segments?

RLMTask for leader partition periodically checks whether there are
remote log segments earlier to logStartOffset and the respective
remote log segment metadata and data are deleted by using RLMM and
RSM.

>100.2 stopReplica with deletion is used in 2 cases (a) replica reassignment; 
>(b) topic deletion. We only want to delete the tiered metadata in the second 
>case. Also, in the second case, who initiates the deletion of the remote 
>segment since the leader may not exist?

Right, it is deleted only incase of topic deletion only. We will cover
the details in the KIP.

>100.3 "LogStartOffset of a topic can be either in local or in remote storage." 
>If LogStartOffset exists in both places, which one is the source of truth?

I meant the logStartOffset can point to either of local segment or
remote segment but it is initialised and maintained in the Log class
like now.

>100.4 List listRemoteLogSegments(TopicPartition 
>topicPartition, long minOffset): How is minOffset supposed to be used?

Returns list of remote segments, sorted by baseOffset in ascending
order that have baseOffset >= the given min Offset.

>100.5 When copying a segment to remote storage, it seems we are calling the 
>same RLMM.putRemoteLogSegmentData() twice before and after copyLogSegment(). 
>Could you explain why?

This is more about prepare/commit/rollback as you suggested.  We will
update the wiki with the new APIs.

>100.6 LogSegmentData includes leaderEpochCache, but there is no api in 
>RemoteStorageManager to retrieve it.

Nice catch, copy/paste issue. There is an API to retrieve it.

>101. If the __remote_log_metadata is for production usage, could you provide 
>more details? For example, what is the schema of the data (both key and 
>value)? How is the topic maintained,delete or compact?

It is with delete config and it’s retention period is suggested to be
more than the remote retention period.

>110. Is the cache implementation in RemoteLogMetadataManager meant for 
>production usage? If so, could you provide more details on the schema and 
>how/where the data is stored?

The proposal is to have a cache (with default implementation backed by
rocksdb) but it will be added in later versions. We will add this to
future work items.

>111. "Committed offsets can be stored in a local file". Could you describe the 
>format of the file and where it's stored?

We will cover this in the KIP.

>112. Truncation of remote segments under unclean leader election: I am not 
>sure who figures out the truncated remote segments and how that information is 
>propagated to all replicas?

We will add this in detail in the KIP.

>113. "If there are any failures in removing remote log segments then those are 
>stored in a specific topic (default as __remote_segments_to_be_deleted)". Is 
>it necessary to add yet another internal topic? Could we just keep retrying?

This is not really an internal topic, it will be exposed as a user
configurable topic. After a few retries, we want user to know about
the failure so that they can take an action later by consuming from
this topic. We want to keep this simple instead of retrying
continuously and maintaining the deletion state etc.

>114. "We may not need to copy producer-id-snapshot as we are copying only 
>segments earlier to last-stable-offset." Hmm, not sure about that. The 
>producer snapshot includes things like the last timestamp of each open 
>producer id and can affect when those producer ids are expired.

Sure, this will be added as part of the LogSegmentData.

Thanks,
Satish.


On Fri, May 29, 2020 at 6:39 AM Jun Rao  wrote:
>
> Hi, Satish,
>
> Made another pass on the wiki. A few more comments below.
>
> 100. It would be useful to provide more details on how those apis are used. 
> Otherwise, it's kind of hard to really assess whether the new apis are 
> sufficient/redundant. A few examples below.
> 100.1 deleteRecords seems to only advance the logStartOffset in Log. How does 
> that trigger the deletion of remote log segments?
> 100.2 stopReplica with deletion is used in 2 cases (a) replica reassignment; 
> (b) topic deletion. We only want to delete the tiered metadata in the second 
> case. Also, in the second case, who initiates the deletion of the remote 
> segment since the leader may not exist?
> 100.3 "LogStartOffset of a topic can be either in local or in remote 
> storage." If LogStartOffset exists in both places, which one is the source of 
> truth?
> 100.4 List listRemoteLogSegments(TopicPartition 
> topicPartition, long minOffset): How is minOffset supposed to be used?
> 100.5 When 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-05-28 Thread Jun Rao
Hi, Satish,

Made another pass on the wiki. A few more comments below.

100. It would be useful to provide more details on how those apis are used.
Otherwise, it's kind of hard to really assess whether the new apis are
sufficient/redundant. A few examples below.
100.1 deleteRecords seems to only advance the logStartOffset in Log. How
does that trigger the deletion of remote log segments?
100.2 stopReplica with deletion is used in 2 cases (a) replica
reassignment; (b) topic deletion. We only want to delete the tiered
metadata in the second case. Also, in the second case, who initiates the
deletion of the remote segment since the leader may not exist?
100.3 "LogStartOffset of a topic can be either in local or in remote
storage." If LogStartOffset exists in both places, which one is the source
of truth?
100.4 List listRemoteLogSegments(TopicPartition
topicPartition, long minOffset): How is minOffset supposed to be used?
100.5 When copying a segment to remote storage, it seems we are calling the
same RLMM.putRemoteLogSegmentData() twice before and
after copyLogSegment(). Could you explain why?
100.6 LogSegmentData includes leaderEpochCache, but there is no api
in RemoteStorageManager to retrieve it.

101. If the __remote_log_metadata is for production usage, could you
provide more details? For example, what is the schema of the data (both
key and value)? How is the topic maintained,delete or compact?

110. Is the cache implementation in RemoteLogMetadataManager meant for
production usage? If so, could you provide more details on the schema and
how/where the data is stored?

111. "Committed offsets can be stored in a local file". Could you describe
the format of the file and where it's stored?

112. Truncation of remote segments under unclean leader election: I am not
sure who figures out the truncated remote segments and how that information
is propagated to all replicas?

113. "If there are any failures in removing remote log segments then those
are stored in a specific topic (default as
__remote_segments_to_be_deleted)". Is it necessary to add yet another
internal topic? Could we just keep retrying?

114. "We may not need to copy producer-id-snapshot as we are copying only
segments earlier to last-stable-offset." Hmm, not sure about that. The
producer snapshot includes things like the last timestamp of each open
producer id and can affect when those producer ids are expired.

Thanks,

Jun

On Thu, May 28, 2020 at 5:38 AM Satish Duggana 
wrote:

> Hi Jun,
> Gentle reminder. Please go through the updated wiki and let us know your
> comments.
>
> Thanks,
> Satish.
>
> On Tue, May 19, 2020 at 3:50 PM Satish Duggana 
> wrote:
>
>> Hi Jun,
>> Please go through the wiki which has the latest updates. Google doc is
>> updated frequently to be in sync with wiki.
>>
>> Thanks,
>> Satish.
>>
>> On Tue, May 19, 2020 at 12:30 AM Jun Rao  wrote:
>>
>>> Hi, Satish,
>>>
>>> Thanks for the update. Just to clarify. Which doc has the latest
>>> updates, the wiki or the google doc?
>>>
>>> Jun
>>>
>>> On Thu, May 14, 2020 at 10:38 AM Satish Duggana <
>>> satish.dugg...@gmail.com> wrote:
>>>
 Hi Jun,
 Thanks for your comments.  We updated the KIP with more details.

 >100. For each of the operations related to tiering, it would be useful
 to provide a description on how it works with the new API. These include
 things like consumer fetch, replica fetch, offsetForTimestamp, retention
 (remote and local) by size, time and logStartOffset, topic deletion, etc.
 This will tell us if the proposed APIs are sufficient.

 We addressed most of these APIs in the KIP. We can add more details if
 needed.

 >101. For the default implementation based on internal topic, is it
 meant as a proof of concept or for production usage? I assume that it's the
 former. However, if it's the latter, then the KIP needs to describe the
 design in more detail.

 It is production usage as was mentioned in an earlier mail. We plan to
 update this section in the next few days.

 >102. When tiering a segment, the segment is first written to the
 object store and then its metadata is written to RLMM using the api "void
 putRemoteLogSegmentData()". One potential issue with this approach is
 that if the system fails after the first operation, it leaves a garbage in
 the object store that's never reclaimed. One way to improve this is to have
 two separate APIs, sth like preparePutRemoteLogSegmentData() and
 commitPutRemoteLogSegmentData().

 That is a good point. We currently have a different way using markers
 in the segment but your suggestion is much better.

 >103. It seems that the transactional support and the ability to read
 from follower are missing.

 KIP is updated with transactional support, follower fetch semantics,
 and reading from a follower.

 >104. It would be useful to provide a testing plan for 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-05-28 Thread Satish Duggana
Hi Jun,
Gentle reminder. Please go through the updated wiki and let us know your
comments.

Thanks,
Satish.

On Tue, May 19, 2020 at 3:50 PM Satish Duggana 
wrote:

> Hi Jun,
> Please go through the wiki which has the latest updates. Google doc is
> updated frequently to be in sync with wiki.
>
> Thanks,
> Satish.
>
> On Tue, May 19, 2020 at 12:30 AM Jun Rao  wrote:
>
>> Hi, Satish,
>>
>> Thanks for the update. Just to clarify. Which doc has the latest updates,
>> the wiki or the google doc?
>>
>> Jun
>>
>> On Thu, May 14, 2020 at 10:38 AM Satish Duggana 
>> wrote:
>>
>>> Hi Jun,
>>> Thanks for your comments.  We updated the KIP with more details.
>>>
>>> >100. For each of the operations related to tiering, it would be useful
>>> to provide a description on how it works with the new API. These include
>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>>> (remote and local) by size, time and logStartOffset, topic deletion, etc.
>>> This will tell us if the proposed APIs are sufficient.
>>>
>>> We addressed most of these APIs in the KIP. We can add more details if
>>> needed.
>>>
>>> >101. For the default implementation based on internal topic, is it
>>> meant as a proof of concept or for production usage? I assume that it's the
>>> former. However, if it's the latter, then the KIP needs to describe the
>>> design in more detail.
>>>
>>> It is production usage as was mentioned in an earlier mail. We plan to
>>> update this section in the next few days.
>>>
>>> >102. When tiering a segment, the segment is first written to the object
>>> store and then its metadata is written to RLMM using the api "void 
>>> putRemoteLogSegmentData()".
>>> One potential issue with this approach is that if the system fails after
>>> the first operation, it leaves a garbage in the object store that's never
>>> reclaimed. One way to improve this is to have two separate APIs, sth like
>>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>>
>>> That is a good point. We currently have a different way using markers in
>>> the segment but your suggestion is much better.
>>>
>>> >103. It seems that the transactional support and the ability to read
>>> from follower are missing.
>>>
>>> KIP is updated with transactional support, follower fetch semantics, and
>>> reading from a follower.
>>>
>>> >104. It would be useful to provide a testing plan for this KIP.
>>>
>>> We added a few tests by introducing test util for tiered storage in the
>>> PR. We will provide the testing plan in the next few days.
>>>
>>> Thanks,
>>> Satish.
>>>
>>>
>>> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani 
>>> wrote:
>>>




 On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao  wrote:

> Hi, Satish,
>
> Thanks for the updated doc. The new API seems to be an improvement
> overall. A few more comments below.
>
> 100. For each of the operations related to tiering, it would be useful
> to provide a description on how it works with the new API. These include
> things like consumer fetch, replica fetch, offsetForTimestamp, retention
> (remote and local) by size, time and logStartOffset, topic deletion,
> etc. This will tell us if the proposed APIs are sufficient.
>

 Thanks for the feedback Jun. We will add more details around this.


> 101. For the default implementation based on internal topic, is it
> meant as a proof of concept or for production usage? I assume that it's 
> the
> former. However, if it's the latter, then the KIP needs to describe the
> design in more detail.
>

 Yes it meant to be for production use.  Ideally it would be good to
 merge this in as the default implementation for metadata service. We can
 add more details around design and testing.

 102. When tiering a segment, the segment is first written to the object
> store and then its metadata is written to RLMM using the api "void
> putRemoteLogSegmentData()".
> One potential issue with this approach is that if the system fails
> after the first operation, it leaves a garbage in the object store that's
> never reclaimed. One way to improve this is to have two separate APIs, sth
> like preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>
> 103. It seems that the transactional support and the ability to read
> from follower are missing.
>
> 104. It would be useful to provide a testing plan for this KIP.
>

 We are working on adding more details around transactional support and
 coming up with test plan.
 Add system tests and integration tests.

 Thanks,
>
> Jun
>
> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana <
> satish.dugg...@gmail.com> wrote:
>
> Hi Jun,
> Please look at the earlier reply and let us know your comments.
>
> Thanks,
> Satish.
>
> On Wed, Feb 12, 2020 at 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-05-19 Thread Satish Duggana
Hi Jun,
Please go through the wiki which has the latest updates. Google doc is
updated frequently to be in sync with wiki.

Thanks,
Satish.

On Tue, May 19, 2020 at 12:30 AM Jun Rao  wrote:

> Hi, Satish,
>
> Thanks for the update. Just to clarify. Which doc has the latest updates,
> the wiki or the google doc?
>
> Jun
>
> On Thu, May 14, 2020 at 10:38 AM Satish Duggana 
> wrote:
>
>> Hi Jun,
>> Thanks for your comments.  We updated the KIP with more details.
>>
>> >100. For each of the operations related to tiering, it would be useful
>> to provide a description on how it works with the new API. These include
>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>> (remote and local) by size, time and logStartOffset, topic deletion, etc.
>> This will tell us if the proposed APIs are sufficient.
>>
>> We addressed most of these APIs in the KIP. We can add more details if
>> needed.
>>
>> >101. For the default implementation based on internal topic, is it meant
>> as a proof of concept or for production usage? I assume that it's the
>> former. However, if it's the latter, then the KIP needs to describe the
>> design in more detail.
>>
>> It is production usage as was mentioned in an earlier mail. We plan to
>> update this section in the next few days.
>>
>> >102. When tiering a segment, the segment is first written to the object
>> store and then its metadata is written to RLMM using the api "void 
>> putRemoteLogSegmentData()".
>> One potential issue with this approach is that if the system fails after
>> the first operation, it leaves a garbage in the object store that's never
>> reclaimed. One way to improve this is to have two separate APIs, sth like
>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>
>> That is a good point. We currently have a different way using markers in
>> the segment but your suggestion is much better.
>>
>> >103. It seems that the transactional support and the ability to read
>> from follower are missing.
>>
>> KIP is updated with transactional support, follower fetch semantics, and
>> reading from a follower.
>>
>> >104. It would be useful to provide a testing plan for this KIP.
>>
>> We added a few tests by introducing test util for tiered storage in the
>> PR. We will provide the testing plan in the next few days.
>>
>> Thanks,
>> Satish.
>>
>>
>> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani 
>> wrote:
>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao  wrote:
>>>
 Hi, Satish,

 Thanks for the updated doc. The new API seems to be an improvement
 overall. A few more comments below.

 100. For each of the operations related to tiering, it would be useful
 to provide a description on how it works with the new API. These include
 things like consumer fetch, replica fetch, offsetForTimestamp, retention
 (remote and local) by size, time and logStartOffset, topic deletion,
 etc. This will tell us if the proposed APIs are sufficient.

>>>
>>> Thanks for the feedback Jun. We will add more details around this.
>>>
>>>
 101. For the default implementation based on internal topic, is it
 meant as a proof of concept or for production usage? I assume that it's the
 former. However, if it's the latter, then the KIP needs to describe the
 design in more detail.

>>>
>>> Yes it meant to be for production use.  Ideally it would be good to
>>> merge this in as the default implementation for metadata service. We can
>>> add more details around design and testing.
>>>
>>> 102. When tiering a segment, the segment is first written to the object
 store and then its metadata is written to RLMM using the api "void
 putRemoteLogSegmentData()".
 One potential issue with this approach is that if the system fails
 after the first operation, it leaves a garbage in the object store that's
 never reclaimed. One way to improve this is to have two separate APIs, sth
 like preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().

 103. It seems that the transactional support and the ability to read
 from follower are missing.

 104. It would be useful to provide a testing plan for this KIP.

>>>
>>> We are working on adding more details around transactional support and
>>> coming up with test plan.
>>> Add system tests and integration tests.
>>>
>>> Thanks,

 Jun

 On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana <
 satish.dugg...@gmail.com> wrote:

 Hi Jun,
 Please look at the earlier reply and let us know your comments.

 Thanks,
 Satish.

 On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana <
 satish.dugg...@gmail.com> wrote:

 Hi Jun,
 Thanks for your comments on the separation of remote log metadata
 storage and remote log storage.
 We had a few discussions since early Jan on how to support eventually
 consistent stores like S3 by uncoupling 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-05-18 Thread Jun Rao
Hi, Satish,

Thanks for the update. Just to clarify. Which doc has the latest updates,
the wiki or the google doc?

Jun

On Thu, May 14, 2020 at 10:38 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for your comments.  We updated the KIP with more details.
>
> >100. For each of the operations related to tiering, it would be useful to
> provide a description on how it works with the new API. These include
> things like consumer fetch, replica fetch, offsetForTimestamp, retention
> (remote and local) by size, time and logStartOffset, topic deletion, etc.
> This will tell us if the proposed APIs are sufficient.
>
> We addressed most of these APIs in the KIP. We can add more details if
> needed.
>
> >101. For the default implementation based on internal topic, is it meant
> as a proof of concept or for production usage? I assume that it's the
> former. However, if it's the latter, then the KIP needs to describe the
> design in more detail.
>
> It is production usage as was mentioned in an earlier mail. We plan to
> update this section in the next few days.
>
> >102. When tiering a segment, the segment is first written to the object
> store and then its metadata is written to RLMM using the api "void 
> putRemoteLogSegmentData()".
> One potential issue with this approach is that if the system fails after
> the first operation, it leaves a garbage in the object store that's never
> reclaimed. One way to improve this is to have two separate APIs, sth like
> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>
> That is a good point. We currently have a different way using markers in
> the segment but your suggestion is much better.
>
> >103. It seems that the transactional support and the ability to read
> from follower are missing.
>
> KIP is updated with transactional support, follower fetch semantics, and
> reading from a follower.
>
> >104. It would be useful to provide a testing plan for this KIP.
>
> We added a few tests by introducing test util for tiered storage in the
> PR. We will provide the testing plan in the next few days.
>
> Thanks,
> Satish.
>
>
> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani 
> wrote:
>
>>
>>
>>
>>
>> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao  wrote:
>>
>>> Hi, Satish,
>>>
>>> Thanks for the updated doc. The new API seems to be an improvement
>>> overall. A few more comments below.
>>>
>>> 100. For each of the operations related to tiering, it would be useful
>>> to provide a description on how it works with the new API. These include
>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>>> (remote and local) by size, time and logStartOffset, topic deletion,
>>> etc. This will tell us if the proposed APIs are sufficient.
>>>
>>
>> Thanks for the feedback Jun. We will add more details around this.
>>
>>
>>> 101. For the default implementation based on internal topic, is it meant
>>> as a proof of concept or for production usage? I assume that it's the
>>> former. However, if it's the latter, then the KIP needs to describe the
>>> design in more detail.
>>>
>>
>> Yes it meant to be for production use.  Ideally it would be good to merge
>> this in as the default implementation for metadata service. We can add more
>> details around design and testing.
>>
>> 102. When tiering a segment, the segment is first written to the object
>>> store and then its metadata is written to RLMM using the api "void
>>> putRemoteLogSegmentData()".
>>> One potential issue with this approach is that if the system fails after
>>> the first operation, it leaves a garbage in the object store that's never
>>> reclaimed. One way to improve this is to have two separate APIs, sth like
>>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>>
>>> 103. It seems that the transactional support and the ability to read
>>> from follower are missing.
>>>
>>> 104. It would be useful to provide a testing plan for this KIP.
>>>
>>
>> We are working on adding more details around transactional support and
>> coming up with test plan.
>> Add system tests and integration tests.
>>
>> Thanks,
>>>
>>> Jun
>>>
>>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana 
>>> wrote:
>>>
>>> Hi Jun,
>>> Please look at the earlier reply and let us know your comments.
>>>
>>> Thanks,
>>> Satish.
>>>
>>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana 
>>> wrote:
>>>
>>> Hi Jun,
>>> Thanks for your comments on the separation of remote log metadata
>>> storage and remote log storage.
>>> We had a few discussions since early Jan on how to support eventually
>>> consistent stores like S3 by uncoupling remote log segment metadata and
>>> remote log storage. It is written with details in the doc here(1). Below is
>>> the brief summary of the discussion from that doc.
>>>
>>> The current approach consists of pulling the remote log segment metadata
>>> from remote log storage APIs. It worked fine for storages like HDFS. But
>>> one of the problems of relying on the remote storage 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-05-17 Thread Alexandre Dupriez
Hi Satish,

Thank you for your updates.

I have some questions around potential use cases when unclean leader
election is enabled.

It is possible that a range of offsets of a segment which is already
offloaded to a tier storage is included in the range of offsets to be
truncated. A follower, which is far behind its leader and is uncleanly
becoming the topic-partition leader replica, may even require new
followers to discard entire segments which were previously offloaded.

It happens seldomly in practice, yet it is a valid case which is
handled by the log truncation semantics (as described in KIP-279), and
is required to prevent divergence of log lineage between replicas
after clean and unclean election of new leaders.

The log truncation algorithm is currently implemented for local
(non-tiered) log segments. How should remote segments, which would be
truncated if they were local, be processed?

Applying the same type of transformation to tiered segments (that is,
deletion and truncation of files) would generate additional traffic
make leader migration more resource intensive, time consuming, and
introduce significant delays which are not incurred with local file
systems operations.

Another approach could be to design remote segment metadata such that
they can handle truncation use cases. They would need to support
segments which are part of a discarded log lineage, and support
segment truncation.

In the case of local segments, all modifications to a log is protected
by a log-wide lock. The same determinism and associated semantics
needs to be preserved for remote segments, which makes update of
metadata an integral part of log modifications. We need to think about
how such updates are propagated to other brokers such that replication
semantics are not violated. For instance, when a replica is being
truncated, it should not be possible for consumers to access the
replica under truncation from another broker via their common storage
tier.

This assumes the same guarantees on log lineage apply to remote
segments as they do for local segments, including under unclean leader
election.

I think these guarantees are too fundamental to tolerate a diverging
behaviour depending on whether a segment is “local” or “remote”.

What do you think?

Thanks,
Alexandre

Le jeu. 14 mai 2020 à 18:38, Satish Duggana  a écrit :
>
> Hi Jun,
> Thanks for your comments.  We updated the KIP with more details.
>
> >100. For each of the operations related to tiering, it would be useful to
> provide a description on how it works with the new API. These include
> things like consumer fetch, replica fetch, offsetForTimestamp, retention
> (remote and local) by size, time and logStartOffset, topic deletion, etc.
> This will tell us if the proposed APIs are sufficient.
>
> We addressed most of these APIs in the KIP. We can add more details if
> needed.
>
> >101. For the default implementation based on internal topic, is it meant
> as a proof of concept or for production usage? I assume that it's the
> former. However, if it's the latter, then the KIP needs to describe the
> design in more detail.
>
> It is production usage as was mentioned in an earlier mail. We plan to
> update this section in the next few days.
>
> >102. When tiering a segment, the segment is first written to the object
> store and then its metadata is written to RLMM using the api "void
> putRemoteLogSegmentData()".
> One potential issue with this approach is that if the system fails after
> the first operation, it leaves a garbage in the object store that's never
> reclaimed. One way to improve this is to have two separate APIs, sth like
> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>
> That is a good point. We currently have a different way using markers in
> the segment but your suggestion is much better.
>
> >103. It seems that the transactional support and the ability to read from
> follower are missing.
>
> KIP is updated with transactional support, follower fetch semantics, and
> reading from a follower.
>
> >104. It would be useful to provide a testing plan for this KIP.
>
> We added a few tests by introducing test util for tiered storage in the PR.
> We will provide the testing plan in the next few days.
>
> Thanks,
> Satish.
>
>
> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani  wrote:
>
> >
> >
> >
> >
> > On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao  wrote:
> >
> >> Hi, Satish,
> >>
> >> Thanks for the updated doc. The new API seems to be an improvement
> >> overall. A few more comments below.
> >>
> >> 100. For each of the operations related to tiering, it would be useful to
> >> provide a description on how it works with the new API. These include
> >> things like consumer fetch, replica fetch, offsetForTimestamp, retention
> >> (remote and local) by size, time and logStartOffset, topic deletion, etc.
> >> This will tell us if the proposed APIs are sufficient.
> >>
> >
> > Thanks for the feedback Jun. We will add more details 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-05-14 Thread Satish Duggana
Hi Jun,
Thanks for your comments.  We updated the KIP with more details.

>100. For each of the operations related to tiering, it would be useful to
provide a description on how it works with the new API. These include
things like consumer fetch, replica fetch, offsetForTimestamp, retention
(remote and local) by size, time and logStartOffset, topic deletion, etc.
This will tell us if the proposed APIs are sufficient.

We addressed most of these APIs in the KIP. We can add more details if
needed.

>101. For the default implementation based on internal topic, is it meant
as a proof of concept or for production usage? I assume that it's the
former. However, if it's the latter, then the KIP needs to describe the
design in more detail.

It is production usage as was mentioned in an earlier mail. We plan to
update this section in the next few days.

>102. When tiering a segment, the segment is first written to the object
store and then its metadata is written to RLMM using the api "void
putRemoteLogSegmentData()".
One potential issue with this approach is that if the system fails after
the first operation, it leaves a garbage in the object store that's never
reclaimed. One way to improve this is to have two separate APIs, sth like
preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().

That is a good point. We currently have a different way using markers in
the segment but your suggestion is much better.

>103. It seems that the transactional support and the ability to read from
follower are missing.

KIP is updated with transactional support, follower fetch semantics, and
reading from a follower.

>104. It would be useful to provide a testing plan for this KIP.

We added a few tests by introducing test util for tiered storage in the PR.
We will provide the testing plan in the next few days.

Thanks,
Satish.


On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani  wrote:

>
>
>
>
> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao  wrote:
>
>> Hi, Satish,
>>
>> Thanks for the updated doc. The new API seems to be an improvement
>> overall. A few more comments below.
>>
>> 100. For each of the operations related to tiering, it would be useful to
>> provide a description on how it works with the new API. These include
>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>> (remote and local) by size, time and logStartOffset, topic deletion, etc.
>> This will tell us if the proposed APIs are sufficient.
>>
>
> Thanks for the feedback Jun. We will add more details around this.
>
>
>> 101. For the default implementation based on internal topic, is it meant
>> as a proof of concept or for production usage? I assume that it's the
>> former. However, if it's the latter, then the KIP needs to describe the
>> design in more detail.
>>
>
> Yes it meant to be for production use.  Ideally it would be good to merge
> this in as the default implementation for metadata service. We can add more
> details around design and testing.
>
> 102. When tiering a segment, the segment is first written to the object
>> store and then its metadata is written to RLMM using the api "void
>> putRemoteLogSegmentData()".
>> One potential issue with this approach is that if the system fails after
>> the first operation, it leaves a garbage in the object store that's never
>> reclaimed. One way to improve this is to have two separate APIs, sth like
>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>
>> 103. It seems that the transactional support and the ability to read from
>> follower are missing.
>>
>> 104. It would be useful to provide a testing plan for this KIP.
>>
>
> We are working on adding more details around transactional support and
> coming up with test plan.
> Add system tests and integration tests.
>
> Thanks,
>>
>> Jun
>>
>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana 
>> wrote:
>>
>> Hi Jun,
>> Please look at the earlier reply and let us know your comments.
>>
>> Thanks,
>> Satish.
>>
>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana 
>> wrote:
>>
>> Hi Jun,
>> Thanks for your comments on the separation of remote log metadata storage
>> and remote log storage.
>> We had a few discussions since early Jan on how to support eventually
>> consistent stores like S3 by uncoupling remote log segment metadata and
>> remote log storage. It is written with details in the doc here(1). Below is
>> the brief summary of the discussion from that doc.
>>
>> The current approach consists of pulling the remote log segment metadata
>> from remote log storage APIs. It worked fine for storages like HDFS. But
>> one of the problems of relying on the remote storage to maintain metadata
>> is that tiered-storage needs to be strongly consistent, with an impact not
>> only on the metadata(e.g. LIST in S3) but also on the segment data(e.g. GET
>> after a DELETE in S3). The cost of maintaining metadata in remote storage
>> needs to be factored in. This is true in the case of S3, LIST APIs incur
>> huge 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-02-26 Thread Harsha Chintalapani
On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao  wrote:

> Hi, Satish,
>
> Thanks for the updated doc. The new API seems to be an improvement
> overall. A few more comments below.
>
> 100. For each of the operations related to tiering, it would be useful to
> provide a description on how it works with the new API. These include
> things like consumer fetch, replica fetch, offsetForTimestamp, retention
> (remote and local) by size, time and logStartOffset, topic deletion, etc.
> This will tell us if the proposed APIs are sufficient.
>

Thanks for the feedback Jun. We will add more details around this.


> 101. For the default implementation based on internal topic, is it meant
> as a proof of concept or for production usage? I assume that it's the
> former. However, if it's the latter, then the KIP needs to describe the
> design in more detail.
>

Yes it meant to be for production use.  Ideally it would be good to merge
this in as the default implementation for metadata service. We can add more
details around design and testing.

102. When tiering a segment, the segment is first written to the object
> store and then its metadata is written to RLMM using the api "void
> putRemoteLogSegmentData()".
> One potential issue with this approach is that if the system fails after
> the first operation, it leaves a garbage in the object store that's never
> reclaimed. One way to improve this is to have two separate APIs, sth like
> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>
> 103. It seems that the transactional support and the ability to read from
> follower are missing.
>
> 104. It would be useful to provide a testing plan for this KIP.
>

We are working on adding more details around transactional support and
coming up with test plan.
Add system tests and integration tests.

Thanks,
>
> Jun
>
> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana 
> wrote:
>
> Hi Jun,
> Please look at the earlier reply and let us know your comments.
>
> Thanks,
> Satish.
>
> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana 
> wrote:
>
> Hi Jun,
> Thanks for your comments on the separation of remote log metadata storage
> and remote log storage.
> We had a few discussions since early Jan on how to support eventually
> consistent stores like S3 by uncoupling remote log segment metadata and
> remote log storage. It is written with details in the doc here(1). Below is
> the brief summary of the discussion from that doc.
>
> The current approach consists of pulling the remote log segment metadata
> from remote log storage APIs. It worked fine for storages like HDFS. But
> one of the problems of relying on the remote storage to maintain metadata
> is that tiered-storage needs to be strongly consistent, with an impact not
> only on the metadata(e.g. LIST in S3) but also on the segment data(e.g. GET
> after a DELETE in S3). The cost of maintaining metadata in remote storage
> needs to be factored in. This is true in the case of S3, LIST APIs incur
> huge costs as you raised earlier.
> So, it is good to separate the remote storage from the remote log metadata
> store. We refactored the existing RemoteStorageManager and introduced
> RemoteLogMetadataManager. Remote log metadata store should give strong
> consistency semantics but remote log storage can be eventually consistent.
> We can have a default implementation for RemoteLogMetadataManager which
> uses an internal topic(as mentioned in one of our earlier emails) as
> storage. But users can always plugin their own RemoteLogMetadataManager
> implementation based on their environment.
>
> Please go through the updated KIP and let us know your comments. We have
> started refactoring for the changes mentioned in the KIP and there may be a
> few more updates to the APIs.
>
> [1]
>
> https://docs.google.com/document/d/
> 1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
>
> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko 
>
> wrote:
>
> Hi all,
>
> Jun:
>
> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If
>
> you
>
> have 100,000 partitions and want to pull the metadata for each
>
> partition
>
> at
>
> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per
>
> day.
>
> I want to note here, that no reasonably durable storage will be cheap at
> 100k RPS. For example, DynamoDB might give the same ballpark
>
> figures.
>
> If we want to keep the pull-based approach, we can try to reduce this
>
> number
>
> in several ways: doing listings less frequently (as Satish mentioned, with
> the current defaults it's ~3.33k RPS for your example), batching listing
> operations in some way (depending on the storage; it might require the
> change of RSM's interface).
>
> There are different ways for doing push based metadata propagation.
>
> Some
>
> object stores may support that already. For example, S3 supports
>
> events
>
> notification
>
> This sounds interesting. However, I see a couple of issues using it:
> 1. As I understand the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-02-25 Thread Jun Rao
Hi, Satish,

Thanks for the updated doc. The new API seems to be an improvement overall.
A few more comments below.

100. For each of the operations related to tiering, it would be useful to
provide a description on how it works with the new API. These include
things like consumer fetch, replica fetch, offsetForTimestamp, retention
(remote and local) by size, time and logStartOffset, topic deletion, etc.
This will tell us if the proposed APIs are sufficient.

101. For the default implementation based on internal topic, is it meant as
a proof of concept or for production usage? I assume that it's the former.
However, if it's the latter, then the KIP needs to describe the design in
more detail.

102. When tiering a segment, the segment is first written to the object
store and then its metadata is written to RLMM using the api "void
putRemoteLogSegmentData()".
One potential issue with this approach is that if the system fails after
the first operation, it leaves a garbage in the object store that's never
reclaimed. One way to improve this is to have two separate APIs, sth like
preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().

103. It seems that the transactional support and the ability to read from
follower are missing.

104. It would be useful to provide a testing plan for this KIP.

Thanks,

Jun

On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana 
wrote:

> Hi Jun,
> Please look at the earlier reply and let us know your comments.
>
> Thanks,
> Satish.
>
> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana 
> wrote:
> >
> > Hi Jun,
> > Thanks for your comments on the separation of remote log metadata
> > storage and remote log storage.
> > We had a few discussions since early Jan on how to support eventually
> > consistent stores like S3 by uncoupling remote log segment metadata
> > and remote log storage. It is written with details in the doc here(1).
> > Below is the brief summary of the discussion from that doc.
> >
> > The current approach consists of pulling the remote log segment
> > metadata from remote log storage APIs. It worked fine for storages
> > like HDFS. But one of the problems of relying on the remote storage to
> > maintain metadata is that tiered-storage needs to be strongly
> > consistent, with an impact not only on the metadata(e.g. LIST in S3)
> > but also on the segment data(e.g. GET after a DELETE in S3). The cost
> > of maintaining metadata in remote storage needs to be factored in.
> > This is true in the case of S3, LIST APIs incur huge costs as you
> > raised earlier.
> > So, it is good to separate the remote storage from the remote log
> > metadata store. We refactored the existing RemoteStorageManager and
> > introduced RemoteLogMetadataManager. Remote log metadata store should
> > give strong consistency semantics but remote log storage can be
> > eventually consistent.
> > We can have a default implementation for RemoteLogMetadataManager
> > which uses an internal topic(as mentioned in one of our earlier
> > emails) as storage. But users can always plugin their own
> > RemoteLogMetadataManager implementation based on their environment.
> >
> > Please go through the updated KIP and let us know your comments. We
> > have started refactoring for the changes mentioned in the KIP and
> > there may be a few more updates to the APIs.
> >
> > [1]
> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
> >
> > On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko 
> wrote:
> > >
> > > Hi all,
> > >
> > >
> > > Jun:
> > > > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If
> you
> > > > have 100,000 partitions and want to pull the metadata for each
> partition
> > > at
> > > > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per
> day.
> > >
> > > I want to note here, that no reasonably durable storage will be cheap
> > > at 100k RPS. For example, DynamoDB might give the same ballpark
> figures.
> > > If we want to keep the pull-based approach, we can try to reduce this
> number
> > > in several ways: doing listings less frequently (as Satish mentioned,
> > > with the current defaults it's ~3.33k RPS for your example),
> > > batching listing operations in some way (depending on the storage;
> > > it might require the change of RSM's interface).
> > >
> > >
> > > > There are different ways for doing push based metadata propagation.
> Some
> > > > object stores may support that already. For example, S3 supports
> events
> > > > notification
> > > This sounds interesting. However, I see a couple of issues using it:
> > >   1. As I understand the documentation, notification delivery is not
> > > guaranteed
> > > and it's recommended to periodically do LIST to fill the gaps.
> > > Which brings us back to the same LIST consistency guarantees issue.
> > >   2. The same goes for the broker start: to get the current state, we
> need
> > > to LIST.
> > >   3. The dynamic set of multiple consumers (RSMs): 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-02-24 Thread Satish Duggana
Hi Jun,
Please look at the earlier reply and let us know your comments.

Thanks,
Satish.

On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana  wrote:
>
> Hi Jun,
> Thanks for your comments on the separation of remote log metadata
> storage and remote log storage.
> We had a few discussions since early Jan on how to support eventually
> consistent stores like S3 by uncoupling remote log segment metadata
> and remote log storage. It is written with details in the doc here(1).
> Below is the brief summary of the discussion from that doc.
>
> The current approach consists of pulling the remote log segment
> metadata from remote log storage APIs. It worked fine for storages
> like HDFS. But one of the problems of relying on the remote storage to
> maintain metadata is that tiered-storage needs to be strongly
> consistent, with an impact not only on the metadata(e.g. LIST in S3)
> but also on the segment data(e.g. GET after a DELETE in S3). The cost
> of maintaining metadata in remote storage needs to be factored in.
> This is true in the case of S3, LIST APIs incur huge costs as you
> raised earlier.
> So, it is good to separate the remote storage from the remote log
> metadata store. We refactored the existing RemoteStorageManager and
> introduced RemoteLogMetadataManager. Remote log metadata store should
> give strong consistency semantics but remote log storage can be
> eventually consistent.
> We can have a default implementation for RemoteLogMetadataManager
> which uses an internal topic(as mentioned in one of our earlier
> emails) as storage. But users can always plugin their own
> RemoteLogMetadataManager implementation based on their environment.
>
> Please go through the updated KIP and let us know your comments. We
> have started refactoring for the changes mentioned in the KIP and
> there may be a few more updates to the APIs.
>
> [1] 
> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
>
> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko  
> wrote:
> >
> > Hi all,
> >
> >
> > Jun:
> > > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> > > have 100,000 partitions and want to pull the metadata for each partition
> > at
> > > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
> >
> > I want to note here, that no reasonably durable storage will be cheap
> > at 100k RPS. For example, DynamoDB might give the same ballpark figures.
> > If we want to keep the pull-based approach, we can try to reduce this number
> > in several ways: doing listings less frequently (as Satish mentioned,
> > with the current defaults it's ~3.33k RPS for your example),
> > batching listing operations in some way (depending on the storage;
> > it might require the change of RSM's interface).
> >
> >
> > > There are different ways for doing push based metadata propagation. Some
> > > object stores may support that already. For example, S3 supports events
> > > notification
> > This sounds interesting. However, I see a couple of issues using it:
> >   1. As I understand the documentation, notification delivery is not
> > guaranteed
> > and it's recommended to periodically do LIST to fill the gaps.
> > Which brings us back to the same LIST consistency guarantees issue.
> >   2. The same goes for the broker start: to get the current state, we need
> > to LIST.
> >   3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS aren't
> > designed for such a case.
> >
> >
> > Alexandre:
> > > A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM
> > cannot
> > > relies solely on S3 APIs to guarantee the expected strong consistency. The
> > > proposed implementation [3] would need to be updated to take this into
> > > account. Let’s talk more about this.
> >
> > Thank you for the feedback. I clearly see the need for changing the S3
> > implementation
> > to provide stronger consistency guarantees. As it see from this thread,
> > there are
> > several possible approaches to this. Let's discuss RemoteLogManager's
> > contract and
> > behavior (like pull vs push model) further before picking one (or several -
> > ?) of them.
> > I'm going to do some evaluation of DynamoDB for the pull-based approach,
> > if it's possible to apply it paying a reasonable bill. Also, of the
> > push-based approach
> > with a Kafka topic as the medium.
> >
> >
> > > A.2.3 Atomicity – what does an implementation of RSM need to provide with
> > > respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
> > > deleteTopicPartition? If a partial failure happens in any of those (e.g.
> > in
> > > the S3 implementation, if one of the multiple uploads fails [4]),
> >
> > The S3 implementation is going to change, but it's worth clarifying anyway.
> > The segment log file is being uploaded after S3 has acked uploading of
> > all other files associated with the segment and only after this the whole
> > segment file set becomes visible 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2020-02-12 Thread Satish Duggana
Hi Jun,
Thanks for your comments on the separation of remote log metadata
storage and remote log storage.
We had a few discussions since early Jan on how to support eventually
consistent stores like S3 by uncoupling remote log segment metadata
and remote log storage. It is written with details in the doc here(1).
Below is the brief summary of the discussion from that doc.

The current approach consists of pulling the remote log segment
metadata from remote log storage APIs. It worked fine for storages
like HDFS. But one of the problems of relying on the remote storage to
maintain metadata is that tiered-storage needs to be strongly
consistent, with an impact not only on the metadata(e.g. LIST in S3)
but also on the segment data(e.g. GET after a DELETE in S3). The cost
of maintaining metadata in remote storage needs to be factored in.
This is true in the case of S3, LIST APIs incur huge costs as you
raised earlier.
So, it is good to separate the remote storage from the remote log
metadata store. We refactored the existing RemoteStorageManager and
introduced RemoteLogMetadataManager. Remote log metadata store should
give strong consistency semantics but remote log storage can be
eventually consistent.
We can have a default implementation for RemoteLogMetadataManager
which uses an internal topic(as mentioned in one of our earlier
emails) as storage. But users can always plugin their own
RemoteLogMetadataManager implementation based on their environment.

Please go through the updated KIP and let us know your comments. We
have started refactoring for the changes mentioned in the KIP and
there may be a few more updates to the APIs.

[1] 
https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#

On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko  wrote:
>
> Hi all,
>
>
> Jun:
> > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> > have 100,000 partitions and want to pull the metadata for each partition
> at
> > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
>
> I want to note here, that no reasonably durable storage will be cheap
> at 100k RPS. For example, DynamoDB might give the same ballpark figures.
> If we want to keep the pull-based approach, we can try to reduce this number
> in several ways: doing listings less frequently (as Satish mentioned,
> with the current defaults it's ~3.33k RPS for your example),
> batching listing operations in some way (depending on the storage;
> it might require the change of RSM's interface).
>
>
> > There are different ways for doing push based metadata propagation. Some
> > object stores may support that already. For example, S3 supports events
> > notification
> This sounds interesting. However, I see a couple of issues using it:
>   1. As I understand the documentation, notification delivery is not
> guaranteed
> and it's recommended to periodically do LIST to fill the gaps.
> Which brings us back to the same LIST consistency guarantees issue.
>   2. The same goes for the broker start: to get the current state, we need
> to LIST.
>   3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS aren't
> designed for such a case.
>
>
> Alexandre:
> > A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM
> cannot
> > relies solely on S3 APIs to guarantee the expected strong consistency. The
> > proposed implementation [3] would need to be updated to take this into
> > account. Let’s talk more about this.
>
> Thank you for the feedback. I clearly see the need for changing the S3
> implementation
> to provide stronger consistency guarantees. As it see from this thread,
> there are
> several possible approaches to this. Let's discuss RemoteLogManager's
> contract and
> behavior (like pull vs push model) further before picking one (or several -
> ?) of them.
> I'm going to do some evaluation of DynamoDB for the pull-based approach,
> if it's possible to apply it paying a reasonable bill. Also, of the
> push-based approach
> with a Kafka topic as the medium.
>
>
> > A.2.3 Atomicity – what does an implementation of RSM need to provide with
> > respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
> > deleteTopicPartition? If a partial failure happens in any of those (e.g.
> in
> > the S3 implementation, if one of the multiple uploads fails [4]),
>
> The S3 implementation is going to change, but it's worth clarifying anyway.
> The segment log file is being uploaded after S3 has acked uploading of
> all other files associated with the segment and only after this the whole
> segment file set becomes visible remotely for operations like
> listRemoteSegments [1].
> In case of upload failure, the files that has been successfully uploaded
> stays
> as invisible garbage that is collected by cleanupLogUntil (or overwritten
> successfully later).
> And the opposite happens during the deletion: log files are deleted first.
> This approach should generally 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-27 Thread Ivan Yurchenko
Hi all,


Jun:
> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> have 100,000 partitions and want to pull the metadata for each partition
at
> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.

I want to note here, that no reasonably durable storage will be cheap
at 100k RPS. For example, DynamoDB might give the same ballpark figures.
If we want to keep the pull-based approach, we can try to reduce this number
in several ways: doing listings less frequently (as Satish mentioned,
with the current defaults it's ~3.33k RPS for your example),
batching listing operations in some way (depending on the storage;
it might require the change of RSM's interface).


> There are different ways for doing push based metadata propagation. Some
> object stores may support that already. For example, S3 supports events
> notification
This sounds interesting. However, I see a couple of issues using it:
  1. As I understand the documentation, notification delivery is not
guaranteed
and it's recommended to periodically do LIST to fill the gaps.
Which brings us back to the same LIST consistency guarantees issue.
  2. The same goes for the broker start: to get the current state, we need
to LIST.
  3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS aren't
designed for such a case.


Alexandre:
> A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM
cannot
> relies solely on S3 APIs to guarantee the expected strong consistency. The
> proposed implementation [3] would need to be updated to take this into
> account. Let’s talk more about this.

Thank you for the feedback. I clearly see the need for changing the S3
implementation
to provide stronger consistency guarantees. As it see from this thread,
there are
several possible approaches to this. Let's discuss RemoteLogManager's
contract and
behavior (like pull vs push model) further before picking one (or several -
?) of them.
I'm going to do some evaluation of DynamoDB for the pull-based approach,
if it's possible to apply it paying a reasonable bill. Also, of the
push-based approach
with a Kafka topic as the medium.


> A.2.3 Atomicity – what does an implementation of RSM need to provide with
> respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
> deleteTopicPartition? If a partial failure happens in any of those (e.g.
in
> the S3 implementation, if one of the multiple uploads fails [4]),

The S3 implementation is going to change, but it's worth clarifying anyway.
The segment log file is being uploaded after S3 has acked uploading of
all other files associated with the segment and only after this the whole
segment file set becomes visible remotely for operations like
listRemoteSegments [1].
In case of upload failure, the files that has been successfully uploaded
stays
as invisible garbage that is collected by cleanupLogUntil (or overwritten
successfully later).
And the opposite happens during the deletion: log files are deleted first.
This approach should generally work when we solve consistency issues
by adding a strongly consistent storage: a segment's uploaded files remain
invisible garbage until some metadata about them is written.


> A.3 Caching – storing locally the segments retrieved from the remote
> storage is excluded as it does not align with the original intent and even
> defeat some of its purposes (save disk space etc.). That said, could there
> be other types of use cases where the pattern of access to the remotely
> stored segments would benefit from local caching (and potentially
> read-ahead)? Consider the use case of a large pool of consumers which
start
> a backfill at the same time for one day worth of data from one year ago
> stored remotely. Caching the segments locally would allow to uncouple the
> load on the remote storage from the load on the Kafka cluster. Maybe the
> RLM could expose a configuration parameter to switch that feature on/off?

I tend to agree here, caching remote segments locally and making
this configurable sounds pretty practical to me. We should implement this,
maybe not in the first iteration.


Br,
Ivan

[1]
https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152

On Thu, 19 Dec 2019 at 19:49, Alexandre Dupriez 
wrote:

> Hi Jun,
>
> Thank you for the feedback. I am trying to understand how a push-based
> approach would work.
> In order for the metadata to be propagated (under the assumption you
> stated), would you plan to add a new API in Kafka to allow the
> metadata store to send them directly to the brokers?
>
> Thanks,
> Alexandre
>
>
> Le mer. 18 déc. 2019 à 20:14, Jun Rao  a écrit :
> >
> > Hi, Satish and Ivan,
> >
> > There are different ways for doing push based metadata propagation. Some
> > object stores may support that already. For example, S3 supports events
> > notification (
> > https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html).
> > Otherwise one could 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-19 Thread Alexandre Dupriez
Hi Jun,

Thank you for the feedback. I am trying to understand how a push-based
approach would work.
In order for the metadata to be propagated (under the assumption you
stated), would you plan to add a new API in Kafka to allow the
metadata store to send them directly to the brokers?

Thanks,
Alexandre


Le mer. 18 déc. 2019 à 20:14, Jun Rao  a écrit :
>
> Hi, Satish and Ivan,
>
> There are different ways for doing push based metadata propagation. Some
> object stores may support that already. For example, S3 supports events
> notification (
> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html).
> Otherwise one could use a separate metadata store that supports push-based
> change propagation. Other people have mentioned using a Kafka topic. The
> best approach may depend on the object store and the operational
> environment (e.g. whether an external metadata store is already available).
>
> The above discussion is based on the assumption that we need to cache the
> object metadata locally in every broker. I mentioned earlier that an
> alternative is to just store/retrieve those metadata in an external
> metadata store. That may simplify the implementation in some cases.
>
> Thanks,
>
> Jun
>
> On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana 
> wrote:
>
> > Hi Jun,
> > Thanks for your reply.
> >
> > Currently, `listRemoteSegments` is called at the configured
> > interval(not every second, defaults to 30secs). Storing remote log
> > metadata in a strongly consistent store for S3 RSM is raised in
> > PR-comment[1].
> > RLM invokes RSM at regular intervals and RSM can give remote segment
> > metadata if it is available. RSM is responsible for maintaining and
> > fetching those entries. It should be based on whatever mechanism is
> > consistent and efficient with the respective remote storage.
> >
> > Can you give more details about push based mechanism from RSM?
> >
> > 1. https://github.com/apache/kafka/pull/7561#discussion_r344576223
> >
> > Thanks,
> > Satish.
> >
> > On Thu, Dec 5, 2019 at 4:23 AM Jun Rao  wrote:
> > >
> > > Hi, Harsha,
> > >
> > > Thanks for the reply.
> > >
> > > 40/41. I am curious which block storages you have tested. S3 seems to be
> > > one of the popular block stores. The concerns that I have with pull based
> > > approach are the following.
> > > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> > > have 100,000 partitions and want to pull the metadata for each partition
> > at
> > > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
> > > (b) Semantics: S3 list objects are eventually consistent. So, when you
> > do a
> > > list object request, there is no guarantee that you can see all uploaded
> > > objects. This could impact the correctness of subsequent logics.
> > > (c) Efficiency: Blindly pulling metadata when there is no change adds
> > > unnecessary overhead in the broker as well as in the block store.
> > >
> > > So, have you guys tested S3? If so, could you share your experience in
> > > terms of cost, semantics and efficiency?
> > >
> > > Jun
> > >
> > >
> > > On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani 
> > wrote:
> > >
> > > > Hi Jun,
> > > >   Thanks for the reply.
> > > >
> > > >
> > > >
> > > > On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao  wrote:
> > > >
> > > > > Hi, Satish and Ying,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 40/41. There are two different ways that we can approach this. One is
> > > > what
> > > > > you said. We can have an opinionated way of storing and populating
> > the
> > > > tier
> > > > > metadata that we think is good enough for everyone. I am not sure if
> > this
> > > > > is the case based on what's currently proposed in the KIP. For
> > example, I
> > > > > am not sure that (1) everyone always needs local metadata; (2) the
> > > > current
> > > > > local storage format is general enough and (3) everyone wants to use
> > the
> > > > > pull based approach to propagate the metadata. Another approach is to
> > > > make
> > > > > this pluggable and let the implementor implements the best approach
> > for a
> > > > > particular block storage. I haven't seen any comments from
> > Slack/AirBnb
> > > > in
> > > > > the mailing list on this topic. It would be great if they can provide
> > > > > feedback directly here.
> > > > >
> > > >
> > > > The current interfaces are designed with most popular block storages
> > > > available today  and we did 2 implementations with these interfaces and
> > > > they both are yielding good results as we going through the testing of
> > it.
> > > > If there is ever a need for pull based approach  we can definitely
> > evolve
> > > > the interface.
> > > > In the past we did mark interfaces to be evolving to make room for
> > unknowns
> > > > in the future.
> > > > If you have any suggestions around the current interfaces please
> > propose we
> > > > are happy to see if we can work them into it.
> > > >
> > > >
> > > > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-18 Thread Jun Rao
Hi, Satish and Ivan,

There are different ways for doing push based metadata propagation. Some
object stores may support that already. For example, S3 supports events
notification (
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html).
Otherwise one could use a separate metadata store that supports push-based
change propagation. Other people have mentioned using a Kafka topic. The
best approach may depend on the object store and the operational
environment (e.g. whether an external metadata store is already available).

The above discussion is based on the assumption that we need to cache the
object metadata locally in every broker. I mentioned earlier that an
alternative is to just store/retrieve those metadata in an external
metadata store. That may simplify the implementation in some cases.

Thanks,

Jun

On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for your reply.
>
> Currently, `listRemoteSegments` is called at the configured
> interval(not every second, defaults to 30secs). Storing remote log
> metadata in a strongly consistent store for S3 RSM is raised in
> PR-comment[1].
> RLM invokes RSM at regular intervals and RSM can give remote segment
> metadata if it is available. RSM is responsible for maintaining and
> fetching those entries. It should be based on whatever mechanism is
> consistent and efficient with the respective remote storage.
>
> Can you give more details about push based mechanism from RSM?
>
> 1. https://github.com/apache/kafka/pull/7561#discussion_r344576223
>
> Thanks,
> Satish.
>
> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao  wrote:
> >
> > Hi, Harsha,
> >
> > Thanks for the reply.
> >
> > 40/41. I am curious which block storages you have tested. S3 seems to be
> > one of the popular block stores. The concerns that I have with pull based
> > approach are the following.
> > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> > have 100,000 partitions and want to pull the metadata for each partition
> at
> > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
> > (b) Semantics: S3 list objects are eventually consistent. So, when you
> do a
> > list object request, there is no guarantee that you can see all uploaded
> > objects. This could impact the correctness of subsequent logics.
> > (c) Efficiency: Blindly pulling metadata when there is no change adds
> > unnecessary overhead in the broker as well as in the block store.
> >
> > So, have you guys tested S3? If so, could you share your experience in
> > terms of cost, semantics and efficiency?
> >
> > Jun
> >
> >
> > On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani 
> wrote:
> >
> > > Hi Jun,
> > >   Thanks for the reply.
> > >
> > >
> > >
> > > On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao  wrote:
> > >
> > > > Hi, Satish and Ying,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 40/41. There are two different ways that we can approach this. One is
> > > what
> > > > you said. We can have an opinionated way of storing and populating
> the
> > > tier
> > > > metadata that we think is good enough for everyone. I am not sure if
> this
> > > > is the case based on what's currently proposed in the KIP. For
> example, I
> > > > am not sure that (1) everyone always needs local metadata; (2) the
> > > current
> > > > local storage format is general enough and (3) everyone wants to use
> the
> > > > pull based approach to propagate the metadata. Another approach is to
> > > make
> > > > this pluggable and let the implementor implements the best approach
> for a
> > > > particular block storage. I haven't seen any comments from
> Slack/AirBnb
> > > in
> > > > the mailing list on this topic. It would be great if they can provide
> > > > feedback directly here.
> > > >
> > >
> > > The current interfaces are designed with most popular block storages
> > > available today  and we did 2 implementations with these interfaces and
> > > they both are yielding good results as we going through the testing of
> it.
> > > If there is ever a need for pull based approach  we can definitely
> evolve
> > > the interface.
> > > In the past we did mark interfaces to be evolving to make room for
> unknowns
> > > in the future.
> > > If you have any suggestions around the current interfaces please
> propose we
> > > are happy to see if we can work them into it.
> > >
> > >
> > > 43. To offer tier storage as a general feature, ideally all existing
> > > > capabilities should still be supported. It's fine if the uber
> > > > implementation doesn't support all capabilities for internal usage.
> > > > However, the framework should be general enough.
> > > >
> > >
> > > We agree on that as a principle. But all of these major features mostly
> > > coming right now and to have a new big feature such as tiered storage
> to
> > > support all the new features will be a big ask. We can document on how
> do
> > > we approach solving these in future iterations.
> > > Our goal is 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-18 Thread Alexandre Dupriez
Hi all,

A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM cannot
relies solely on S3 APIs to guarantee the expected strong consistency. The
proposed implementation [3] would need to be updated to take this into
account. Let’s talk more about this.

A.2 Contract for the RSM API (an API call is loosely defined as an
“operation” here):

A.2.1 The KIP mentions “*If the process of a topic-partition is failed due
to remote storage error, its scheduled processing time is set to ( now() +
rlm_retry_interval_ms ). rlm_retry_interval_ms can be configured in broker
config file.*”. Do you still plan to implement such retries?

A.2.2 Idempotency – Are operations exposed by the RSM idempotent? What is
the risk of an operation being retried with the same input (irrespective of
the state of the remote storage)? If the same successful operation is
retried with the same input, should an error be propagated? How would this
error be discriminated from I/O or other type of failures?

A.2.3 Atomicity – what does an implementation of RSM need to provide with
respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
deleteTopicPartition? If a partial failure happens in any of those (e.g. in
the S3 implementation, if one of the multiple uploads fails [4]), what
guarantees are to be provided to the RLM on the state of the remote
storage, and what if it is left in an unconsistent state? In case the
operation is meant to be retried from RLM, this means RSM is expected to
recover from partial failures? What if an unrecoverable failure affects the
RSM? In the RLMTask an exception is logged [5] but it seems the task
continues to be scheduled -> is there a mode where a topic partition stops
to be transferred to the remote storage?

A.2.4 Consistency – already discussed.

A.2.5 Failure modes – currently RSM propagates failures as IOExceptions.
Wouldn’t we need a slightly different contract for the RSM? As opposed to
the I/O errors which Kafka handles in its innermost layers when accessing
the file system, should the implementations of RSM deal with low-level
errors and retries (wherever it can), and not expose them via its API?
Since the RLM is agnostic of the implementation behind the RSM, it is
virtually impossible to know how to deal with an I/O type of exception
without prior assumptions on the implementation of RSM exercised.

A.3 Caching – storing locally the segments retrieved from the remote
storage is excluded as it does not align with the original intent and even
defeat some of its purposes (save disk space etc.). That said, could there
be other types of use cases where the pattern of access to the remotely
stored segments would benefit from local caching (and potentially
read-ahead)? Consider the use case of a large pool of consumers which start
a backfill at the same time for one day worth of data from one year ago
stored remotely. Caching the segments locally would allow to uncouple the
load on the remote storage from the load on the Kafka cluster. Maybe the
RLM could expose a configuration parameter to switch that feature on/off?

[1]
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
[2]
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Warning_.231:_S3_Consistency_model
[3] https://github.com/harshach/kafka/pull/18
[4]
https://github.com/harshach/kafka/pull/18/files#diff-39e2143514ed06d5d066708309263424R124
[5]
https://github.com/apache/kafka/pull/7561/files#diff-a597bd0c7d627789e73d1fa38eb1abfaR278

Le jeu. 5 déc. 2019 à 15:01, Satish Duggana  a
écrit :

> Hi Jun,
> Thanks for your reply.
>
> Currently, `listRemoteSegments` is called at the configured
> interval(not every second, defaults to 30secs). Storing remote log
> metadata in a strongly consistent store for S3 RSM is raised in
> PR-comment[1].
> RLM invokes RSM at regular intervals and RSM can give remote segment
> metadata if it is available. RSM is responsible for maintaining and
> fetching those entries. It should be based on whatever mechanism is
> consistent and efficient with the respective remote storage.
>
> Can you give more details about push based mechanism from RSM?
>
> 1. https://github.com/apache/kafka/pull/7561#discussion_r344576223
>
> Thanks,
> Satish.
>
> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao  wrote:
> >
> > Hi, Harsha,
> >
> > Thanks for the reply.
> >
> > 40/41. I am curious which block storages you have tested. S3 seems to be
> > one of the popular block stores. The concerns that I have with pull based
> > approach are the following.
> > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> > have 100,000 partitions and want to pull the metadata for each partition
> at
> > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
> > (b) Semantics: S3 list objects are eventually consistent. So, when you
> do a
> > list object request, there is no guarantee that you can see all uploaded
> > objects. This could impact the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-05 Thread Satish Duggana
Hi Jun,
Thanks for your reply.

Currently, `listRemoteSegments` is called at the configured
interval(not every second, defaults to 30secs). Storing remote log
metadata in a strongly consistent store for S3 RSM is raised in
PR-comment[1].
RLM invokes RSM at regular intervals and RSM can give remote segment
metadata if it is available. RSM is responsible for maintaining and
fetching those entries. It should be based on whatever mechanism is
consistent and efficient with the respective remote storage.

Can you give more details about push based mechanism from RSM?

1. https://github.com/apache/kafka/pull/7561#discussion_r344576223

Thanks,
Satish.

On Thu, Dec 5, 2019 at 4:23 AM Jun Rao  wrote:
>
> Hi, Harsha,
>
> Thanks for the reply.
>
> 40/41. I am curious which block storages you have tested. S3 seems to be
> one of the popular block stores. The concerns that I have with pull based
> approach are the following.
> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> have 100,000 partitions and want to pull the metadata for each partition at
> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
> (b) Semantics: S3 list objects are eventually consistent. So, when you do a
> list object request, there is no guarantee that you can see all uploaded
> objects. This could impact the correctness of subsequent logics.
> (c) Efficiency: Blindly pulling metadata when there is no change adds
> unnecessary overhead in the broker as well as in the block store.
>
> So, have you guys tested S3? If so, could you share your experience in
> terms of cost, semantics and efficiency?
>
> Jun
>
>
> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani  wrote:
>
> > Hi Jun,
> >   Thanks for the reply.
> >
> >
> >
> > On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao  wrote:
> >
> > > Hi, Satish and Ying,
> > >
> > > Thanks for the reply.
> > >
> > > 40/41. There are two different ways that we can approach this. One is
> > what
> > > you said. We can have an opinionated way of storing and populating the
> > tier
> > > metadata that we think is good enough for everyone. I am not sure if this
> > > is the case based on what's currently proposed in the KIP. For example, I
> > > am not sure that (1) everyone always needs local metadata; (2) the
> > current
> > > local storage format is general enough and (3) everyone wants to use the
> > > pull based approach to propagate the metadata. Another approach is to
> > make
> > > this pluggable and let the implementor implements the best approach for a
> > > particular block storage. I haven't seen any comments from Slack/AirBnb
> > in
> > > the mailing list on this topic. It would be great if they can provide
> > > feedback directly here.
> > >
> >
> > The current interfaces are designed with most popular block storages
> > available today  and we did 2 implementations with these interfaces and
> > they both are yielding good results as we going through the testing of it.
> > If there is ever a need for pull based approach  we can definitely evolve
> > the interface.
> > In the past we did mark interfaces to be evolving to make room for unknowns
> > in the future.
> > If you have any suggestions around the current interfaces please propose we
> > are happy to see if we can work them into it.
> >
> >
> > 43. To offer tier storage as a general feature, ideally all existing
> > > capabilities should still be supported. It's fine if the uber
> > > implementation doesn't support all capabilities for internal usage.
> > > However, the framework should be general enough.
> > >
> >
> > We agree on that as a principle. But all of these major features mostly
> > coming right now and to have a new big feature such as tiered storage to
> > support all the new features will be a big ask. We can document on how do
> > we approach solving these in future iterations.
> > Our goal is to make this tiered storage feature work for everyone.
> >
> > 43.3 This is more than just serving the tier-ed data from block storage.
> > > With KIP-392, the consumer now can resolve the conflicts with the replica
> > > based on leader epoch. So, we need to make sure that leader epoch can be
> > > recovered properly from tier storage.
> > >
> >
> > We are working on testing our approach and we will update the KIP with
> > design details.
> >
> > 43.4 For JBOD, if tier storage stores the tier metadata locally, we need to
> > > support moving such metadata across disk directories since JBOD supports
> > > moving data across disks.
> > >
> >
> > KIP is updated with JBOD details. Having said that JBOD tooling needs to
> > evolve to support production loads. Most of the users will be interested in
> > using tiered storage without JBOD support support on day 1.
> >
> > Thanks,
> > Harsha
> >
> > As for meeting, we could have a KIP e-meeting on this if needed, but it
> > > will be open to everyone and will be recorded and shared. Often, the
> > > details are still resolved through the mailing list.

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-05 Thread Ivan Yurchenko
Hi Jun and others.

Jun,
All these are really valid concerns.
Probably we should think about backing implementations like S3 with a
metadata storage whose consistency model and pricing is better that pure
S3, maybe even a Kafka topic (I guess this might be something you refer to
as push-based approach?)

Br,
Ivan

On Thu, 5 Dec 2019 at 00:53, Jun Rao  wrote:

> Hi, Harsha,
>
> Thanks for the reply.
>
> 40/41. I am curious which block storages you have tested. S3 seems to be
> one of the popular block stores. The concerns that I have with pull based
> approach are the following.
> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
> have 100,000 partitions and want to pull the metadata for each partition at
> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
> (b) Semantics: S3 list objects are eventually consistent. So, when you do a
> list object request, there is no guarantee that you can see all uploaded
> objects. This could impact the correctness of subsequent logics.
> (c) Efficiency: Blindly pulling metadata when there is no change adds
> unnecessary overhead in the broker as well as in the block store.
>
> So, have you guys tested S3? If so, could you share your experience in
> terms of cost, semantics and efficiency?
>
> Jun
>
>
> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani 
> wrote:
>
> > Hi Jun,
> >   Thanks for the reply.
> >
> >
> >
> > On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao  wrote:
> >
> > > Hi, Satish and Ying,
> > >
> > > Thanks for the reply.
> > >
> > > 40/41. There are two different ways that we can approach this. One is
> > what
> > > you said. We can have an opinionated way of storing and populating the
> > tier
> > > metadata that we think is good enough for everyone. I am not sure if
> this
> > > is the case based on what's currently proposed in the KIP. For
> example, I
> > > am not sure that (1) everyone always needs local metadata; (2) the
> > current
> > > local storage format is general enough and (3) everyone wants to use
> the
> > > pull based approach to propagate the metadata. Another approach is to
> > make
> > > this pluggable and let the implementor implements the best approach
> for a
> > > particular block storage. I haven't seen any comments from Slack/AirBnb
> > in
> > > the mailing list on this topic. It would be great if they can provide
> > > feedback directly here.
> > >
> >
> > The current interfaces are designed with most popular block storages
> > available today  and we did 2 implementations with these interfaces and
> > they both are yielding good results as we going through the testing of
> it.
> > If there is ever a need for pull based approach  we can definitely evolve
> > the interface.
> > In the past we did mark interfaces to be evolving to make room for
> unknowns
> > in the future.
> > If you have any suggestions around the current interfaces please propose
> we
> > are happy to see if we can work them into it.
> >
> >
> > 43. To offer tier storage as a general feature, ideally all existing
> > > capabilities should still be supported. It's fine if the uber
> > > implementation doesn't support all capabilities for internal usage.
> > > However, the framework should be general enough.
> > >
> >
> > We agree on that as a principle. But all of these major features mostly
> > coming right now and to have a new big feature such as tiered storage to
> > support all the new features will be a big ask. We can document on how do
> > we approach solving these in future iterations.
> > Our goal is to make this tiered storage feature work for everyone.
> >
> > 43.3 This is more than just serving the tier-ed data from block storage.
> > > With KIP-392, the consumer now can resolve the conflicts with the
> replica
> > > based on leader epoch. So, we need to make sure that leader epoch can
> be
> > > recovered properly from tier storage.
> > >
> >
> > We are working on testing our approach and we will update the KIP with
> > design details.
> >
> > 43.4 For JBOD, if tier storage stores the tier metadata locally, we need
> to
> > > support moving such metadata across disk directories since JBOD
> supports
> > > moving data across disks.
> > >
> >
> > KIP is updated with JBOD details. Having said that JBOD tooling needs to
> > evolve to support production loads. Most of the users will be interested
> in
> > using tiered storage without JBOD support support on day 1.
> >
> > Thanks,
> > Harsha
> >
> > As for meeting, we could have a KIP e-meeting on this if needed, but it
> > > will be open to everyone and will be recorded and shared. Often, the
> > > details are still resolved through the mailing list.
> > >
> > > Jun
> > >
> > > On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng 
> > > wrote:
> > >
> > >
> > > Please ignore my previous email
> > > I didn't know Apache requires all the discussions to be "open"
> > >
> > >
> > > On Tue, Nov 19, 2019, 5:40 PM Ying Zheng  wrote:
> > >
> > > Hi Jun,
> > >
> > > Thank you 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-04 Thread Jun Rao
Hi, Harsha,

Thanks for the reply.

40/41. I am curious which block storages you have tested. S3 seems to be
one of the popular block stores. The concerns that I have with pull based
approach are the following.
(a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you
have 100,000 partitions and want to pull the metadata for each partition at
the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day.
(b) Semantics: S3 list objects are eventually consistent. So, when you do a
list object request, there is no guarantee that you can see all uploaded
objects. This could impact the correctness of subsequent logics.
(c) Efficiency: Blindly pulling metadata when there is no change adds
unnecessary overhead in the broker as well as in the block store.

So, have you guys tested S3? If so, could you share your experience in
terms of cost, semantics and efficiency?

Jun


On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani  wrote:

> Hi Jun,
>   Thanks for the reply.
>
>
>
> On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao  wrote:
>
> > Hi, Satish and Ying,
> >
> > Thanks for the reply.
> >
> > 40/41. There are two different ways that we can approach this. One is
> what
> > you said. We can have an opinionated way of storing and populating the
> tier
> > metadata that we think is good enough for everyone. I am not sure if this
> > is the case based on what's currently proposed in the KIP. For example, I
> > am not sure that (1) everyone always needs local metadata; (2) the
> current
> > local storage format is general enough and (3) everyone wants to use the
> > pull based approach to propagate the metadata. Another approach is to
> make
> > this pluggable and let the implementor implements the best approach for a
> > particular block storage. I haven't seen any comments from Slack/AirBnb
> in
> > the mailing list on this topic. It would be great if they can provide
> > feedback directly here.
> >
>
> The current interfaces are designed with most popular block storages
> available today  and we did 2 implementations with these interfaces and
> they both are yielding good results as we going through the testing of it.
> If there is ever a need for pull based approach  we can definitely evolve
> the interface.
> In the past we did mark interfaces to be evolving to make room for unknowns
> in the future.
> If you have any suggestions around the current interfaces please propose we
> are happy to see if we can work them into it.
>
>
> 43. To offer tier storage as a general feature, ideally all existing
> > capabilities should still be supported. It's fine if the uber
> > implementation doesn't support all capabilities for internal usage.
> > However, the framework should be general enough.
> >
>
> We agree on that as a principle. But all of these major features mostly
> coming right now and to have a new big feature such as tiered storage to
> support all the new features will be a big ask. We can document on how do
> we approach solving these in future iterations.
> Our goal is to make this tiered storage feature work for everyone.
>
> 43.3 This is more than just serving the tier-ed data from block storage.
> > With KIP-392, the consumer now can resolve the conflicts with the replica
> > based on leader epoch. So, we need to make sure that leader epoch can be
> > recovered properly from tier storage.
> >
>
> We are working on testing our approach and we will update the KIP with
> design details.
>
> 43.4 For JBOD, if tier storage stores the tier metadata locally, we need to
> > support moving such metadata across disk directories since JBOD supports
> > moving data across disks.
> >
>
> KIP is updated with JBOD details. Having said that JBOD tooling needs to
> evolve to support production loads. Most of the users will be interested in
> using tiered storage without JBOD support support on day 1.
>
> Thanks,
> Harsha
>
> As for meeting, we could have a KIP e-meeting on this if needed, but it
> > will be open to everyone and will be recorded and shared. Often, the
> > details are still resolved through the mailing list.
> >
> > Jun
> >
> > On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng 
> > wrote:
> >
> >
> > Please ignore my previous email
> > I didn't know Apache requires all the discussions to be "open"
> >
> >
> > On Tue, Nov 19, 2019, 5:40 PM Ying Zheng  wrote:
> >
> > Hi Jun,
> >
> > Thank you very much for your feedback!
> >
> > Can we schedule a meeting in your Palo Alto office in December? I think a
> > face to face discussion is much more efficient than emails. Both Harsha
> >
> > and
> >
> > I can visit you. Satish may be able to join us remotely.
> >
> > On Fri, Nov 15, 2019 at 11:04 AM Jun Rao  wrote:
> >
> > Hi, Satish and Harsha,
> >
> > The following is a more detailed high level feedback for the KIP.
> >
> > Overall,
> >
> > the KIP seems useful. The challenge is how to design it such that it’s
> > general enough to support different ways of implementing this feature
> >
> > and
> >
> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-12-03 Thread Harsha Chintalapani
Hi Jun,
  Thanks for the reply.



On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao  wrote:

> Hi, Satish and Ying,
>
> Thanks for the reply.
>
> 40/41. There are two different ways that we can approach this. One is what
> you said. We can have an opinionated way of storing and populating the tier
> metadata that we think is good enough for everyone. I am not sure if this
> is the case based on what's currently proposed in the KIP. For example, I
> am not sure that (1) everyone always needs local metadata; (2) the current
> local storage format is general enough and (3) everyone wants to use the
> pull based approach to propagate the metadata. Another approach is to make
> this pluggable and let the implementor implements the best approach for a
> particular block storage. I haven't seen any comments from Slack/AirBnb in
> the mailing list on this topic. It would be great if they can provide
> feedback directly here.
>

The current interfaces are designed with most popular block storages
available today  and we did 2 implementations with these interfaces and
they both are yielding good results as we going through the testing of it.
If there is ever a need for pull based approach  we can definitely evolve
the interface.
In the past we did mark interfaces to be evolving to make room for unknowns
in the future.
If you have any suggestions around the current interfaces please propose we
are happy to see if we can work them into it.


43. To offer tier storage as a general feature, ideally all existing
> capabilities should still be supported. It's fine if the uber
> implementation doesn't support all capabilities for internal usage.
> However, the framework should be general enough.
>

We agree on that as a principle. But all of these major features mostly
coming right now and to have a new big feature such as tiered storage to
support all the new features will be a big ask. We can document on how do
we approach solving these in future iterations.
Our goal is to make this tiered storage feature work for everyone.

43.3 This is more than just serving the tier-ed data from block storage.
> With KIP-392, the consumer now can resolve the conflicts with the replica
> based on leader epoch. So, we need to make sure that leader epoch can be
> recovered properly from tier storage.
>

We are working on testing our approach and we will update the KIP with
design details.

43.4 For JBOD, if tier storage stores the tier metadata locally, we need to
> support moving such metadata across disk directories since JBOD supports
> moving data across disks.
>

KIP is updated with JBOD details. Having said that JBOD tooling needs to
evolve to support production loads. Most of the users will be interested in
using tiered storage without JBOD support support on day 1.

Thanks,
Harsha

As for meeting, we could have a KIP e-meeting on this if needed, but it
> will be open to everyone and will be recorded and shared. Often, the
> details are still resolved through the mailing list.
>
> Jun
>
> On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng 
> wrote:
>
>
> Please ignore my previous email
> I didn't know Apache requires all the discussions to be "open"
>
>
> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng  wrote:
>
> Hi Jun,
>
> Thank you very much for your feedback!
>
> Can we schedule a meeting in your Palo Alto office in December? I think a
> face to face discussion is much more efficient than emails. Both Harsha
>
> and
>
> I can visit you. Satish may be able to join us remotely.
>
> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao  wrote:
>
> Hi, Satish and Harsha,
>
> The following is a more detailed high level feedback for the KIP.
>
> Overall,
>
> the KIP seems useful. The challenge is how to design it such that it’s
> general enough to support different ways of implementing this feature
>
> and
>
> support existing features.
>
> 40. Local segment metadata storage: The KIP makes the assumption that
>
> the
>
> metadata for the archived log segments are cached locally in every
>
> broker
>
> and provides a specific implementation for the local storage in the
> framework. We probably should discuss this more. For example, some tier
> storage providers may not want to cache the metadata locally and just
>
> rely
>
>
> upon a remote key/value store if such a store is already present. If a
> local store is used, there could be different ways of implementing it
> (e.g., based on customized local files, an embedded local store like
> RocksDB, etc). An alternative of designing this is to just provide an
> interface for retrieving the tier segment metadata and leave the details of
> how to get the metadata outside of the framework.
>
>
> 41. RemoteStorageManager interface and the usage of the interface in the
> framework: I am not sure if the interface is general enough. For
>
> example,
>
> it seems that RemoteLogIndexEntry is tied to a specific way of storing
>
> the
>
> metadata in remote storage. The framework uses listRemoteSegments() api
>
> in
>
>
> a 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-26 Thread Jun Rao
Hi, Satish and Ying,

Thanks for the reply.

40/41. There are two different ways that we can approach this. One is what
you said. We can have an opinionated way of storing and populating the tier
metadata that we think is good enough for everyone. I am not sure if this
is the case based on what's currently proposed in the KIP. For example, I
am not sure that (1) everyone always needs local metadata; (2) the current
local storage format is general enough and (3) everyone wants to use the
pull based approach to propagate the metadata. Another approach is to make
this pluggable and let the implementor implements the best approach for a
particular block storage. I haven't seen any comments from Slack/AirBnb in
the mailing list on this topic. It would be great if they can provide
feedback directly here.

43. To offer tier storage as a general feature, ideally all existing
capabilities should still be supported. It's fine if the uber
implementation doesn't support all capabilities for internal usage.
However, the framework should be general enough.
43.3 This is more than just serving the tier-ed data from block storage.
With KIP-392, the consumer now can resolve the conflicts with the replica
based on leader epoch. So, we need to make sure that leader epoch can be
recovered properly from tier storage.
43.4 For JBOD, if tier storage stores the tier metadata locally, we need to
support moving such metadata across disk directories since JBOD supports
moving data across disks.

As for meeting, we could have a KIP e-meeting on this if needed, but it
will be open to everyone and will be recorded and shared. Often, the
details are still resolved through the mailing list.

Jun



On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng  wrote:

> Please ignore my previous email
> I didn't know Apache requires all the discussions to be "open"
>
> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng  wrote:
>
> > Hi Jun,
> >
> > Thank you very much for your feedback!
> >
> > Can we schedule a meeting in your Palo Alto office in December? I think a
> > face to face discussion is much more efficient than emails. Both Harsha
> and
> > I can visit you. Satish may be able to join us remotely.
> >
> > On Fri, Nov 15, 2019 at 11:04 AM Jun Rao  wrote:
> >
> >> Hi, Satish and Harsha,
> >>
> >> The following is a more detailed high level feedback for the KIP.
> Overall,
> >> the KIP seems useful. The challenge is how to design it such that it’s
> >> general enough to support different ways of implementing this feature
> and
> >> support existing features.
> >>
> >> 40. Local segment metadata storage: The KIP makes the assumption that
> the
> >> metadata for the archived log segments are cached locally in every
> broker
> >> and provides a specific implementation for the local storage in the
> >> framework. We probably should discuss this more. For example, some tier
> >> storage providers may not want to cache the metadata locally and just
> rely
> >> upon a remote key/value store if such a store is already present. If a
> >> local store is used, there could be different ways of implementing it
> >> (e.g., based on customized local files, an embedded local store like
> >> RocksDB, etc). An alternative of designing this is to just provide an
> >> interface for retrieving the tier segment metadata and leave the details
> >> of
> >> how to get the metadata outside of the framework.
> >>
> >> 41. RemoteStorageManager interface and the usage of the interface in the
> >> framework: I am not sure if the interface is general enough.  For
> example,
> >> it seems that RemoteLogIndexEntry is tied to a specific way of storing
> the
> >> metadata in remote storage. The framework uses listRemoteSegments() api
> in
> >> a pull based approach. However, in some other implementations, a push
> >> based
> >> approach may be more preferred. I don’t have a concrete proposal yet.
> But,
> >> it would be useful to give this area some more thoughts and see if we
> can
> >> make the interface more general.
> >>
> >> 42. In the diagram, the RemoteLogManager is side by side with
> LogManager.
> >> This KIP only discussed how the fetch request is handled between the two
> >> layer. However, we should also consider how other requests that touch
> the
> >> log can be handled. e.g., list offsets by timestamp, delete records,
> etc.
> >> Also, in this model, it's not clear which component is responsible for
> >> managing the log start offset. It seems that the log start offset could
> be
> >> changed by both RemoteLogManager and LogManager.
> >>
> >> 43. There are quite a few existing features not covered by the KIP. It
> >> would be useful to discuss each of those.
> >> 43.1 I won’t say that compacted topics are rarely used and always small.
> >> For example, KStreams uses compacted topics for storing the states and
> >> sometimes the size of the topic could be large. While it might be ok to
> >> not
> >> support compacted topics initially, it would be useful to have a high
> >> level

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-19 Thread Ying Zheng
Please ignore my previous email
I didn't know Apache requires all the discussions to be "open"

On Tue, Nov 19, 2019, 5:40 PM Ying Zheng  wrote:

> Hi Jun,
>
> Thank you very much for your feedback!
>
> Can we schedule a meeting in your Palo Alto office in December? I think a
> face to face discussion is much more efficient than emails. Both Harsha and
> I can visit you. Satish may be able to join us remotely.
>
> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao  wrote:
>
>> Hi, Satish and Harsha,
>>
>> The following is a more detailed high level feedback for the KIP. Overall,
>> the KIP seems useful. The challenge is how to design it such that it’s
>> general enough to support different ways of implementing this feature and
>> support existing features.
>>
>> 40. Local segment metadata storage: The KIP makes the assumption that the
>> metadata for the archived log segments are cached locally in every broker
>> and provides a specific implementation for the local storage in the
>> framework. We probably should discuss this more. For example, some tier
>> storage providers may not want to cache the metadata locally and just rely
>> upon a remote key/value store if such a store is already present. If a
>> local store is used, there could be different ways of implementing it
>> (e.g., based on customized local files, an embedded local store like
>> RocksDB, etc). An alternative of designing this is to just provide an
>> interface for retrieving the tier segment metadata and leave the details
>> of
>> how to get the metadata outside of the framework.
>>
>> 41. RemoteStorageManager interface and the usage of the interface in the
>> framework: I am not sure if the interface is general enough.  For example,
>> it seems that RemoteLogIndexEntry is tied to a specific way of storing the
>> metadata in remote storage. The framework uses listRemoteSegments() api in
>> a pull based approach. However, in some other implementations, a push
>> based
>> approach may be more preferred. I don’t have a concrete proposal yet. But,
>> it would be useful to give this area some more thoughts and see if we can
>> make the interface more general.
>>
>> 42. In the diagram, the RemoteLogManager is side by side with LogManager.
>> This KIP only discussed how the fetch request is handled between the two
>> layer. However, we should also consider how other requests that touch the
>> log can be handled. e.g., list offsets by timestamp, delete records, etc.
>> Also, in this model, it's not clear which component is responsible for
>> managing the log start offset. It seems that the log start offset could be
>> changed by both RemoteLogManager and LogManager.
>>
>> 43. There are quite a few existing features not covered by the KIP. It
>> would be useful to discuss each of those.
>> 43.1 I won’t say that compacted topics are rarely used and always small.
>> For example, KStreams uses compacted topics for storing the states and
>> sometimes the size of the topic could be large. While it might be ok to
>> not
>> support compacted topics initially, it would be useful to have a high
>> level
>> idea on how this might be supported down the road so that we don’t have to
>> make incompatible API changes in the future.
>> 43.2 We need to discuss how EOS is supported. In particular, how is the
>> producer state integrated with the remote storage.
>> 43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is
>> implemented, we need to discuss how reading from a follower replica is
>> supported with tier storage.
>> 43.4 We need to discuss how JBOD is supported with tier storage.
>>
>> Thanks,
>>
>> Jun
>>
>> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley  wrote:
>>
>> > Thanks for those insights Ying.
>> >
>> > On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng 
>> wrote:
>> >
>> > > >
>> > > >
>> > > >
>> > > > Thanks, I missed that point. However, there's still a point at which
>> > the
>> > > > consumer fetches start getting served from remote storage (even if
>> that
>> > > > point isn't as soon as the local log retention time/size). This
>> > > represents
>> > > > a kind of performance cliff edge and what I'm really interested in
>> is
>> > how
>> > > > easy it is for a consumer which falls off that cliff to catch up
>> and so
>> > > its
>> > > > fetches again come from local storage. Obviously this can depend on
>> all
>> > > > sorts of factors (like production rate, consumption rate), so it's
>> not
>> > > > guaranteed (just like it's not guaranteed for Kafka today), but this
>> > > would
>> > > > represent a new failure mode.
>> > > >
>> > >
>> > >  As I have explained in the last mail, it's a very rare case that a
>> > > consumer
>> > > need to read remote data. With our experience at Uber, this only
>> happens
>> > > when the consumer service had an outage for several hours.
>> > >
>> > > There is not a "performance cliff" as you assume. The remote storage
>> is
>> > > even faster than local disks in terms of bandwidth. Reading from
>> remote
>> 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-19 Thread Ying Zheng
Hi Jun,

Thank you very much for your feedback!

Can we schedule a meeting in your Palo Alto office in December? I think a
face to face discussion is much more efficient than emails. Both Harsha and
I can visit you. Satish may be able to join us remotely.

On Fri, Nov 15, 2019 at 11:04 AM Jun Rao  wrote:

> Hi, Satish and Harsha,
>
> The following is a more detailed high level feedback for the KIP. Overall,
> the KIP seems useful. The challenge is how to design it such that it’s
> general enough to support different ways of implementing this feature and
> support existing features.
>
> 40. Local segment metadata storage: The KIP makes the assumption that the
> metadata for the archived log segments are cached locally in every broker
> and provides a specific implementation for the local storage in the
> framework. We probably should discuss this more. For example, some tier
> storage providers may not want to cache the metadata locally and just rely
> upon a remote key/value store if such a store is already present. If a
> local store is used, there could be different ways of implementing it
> (e.g., based on customized local files, an embedded local store like
> RocksDB, etc). An alternative of designing this is to just provide an
> interface for retrieving the tier segment metadata and leave the details of
> how to get the metadata outside of the framework.
>
> 41. RemoteStorageManager interface and the usage of the interface in the
> framework: I am not sure if the interface is general enough.  For example,
> it seems that RemoteLogIndexEntry is tied to a specific way of storing the
> metadata in remote storage. The framework uses listRemoteSegments() api in
> a pull based approach. However, in some other implementations, a push based
> approach may be more preferred. I don’t have a concrete proposal yet. But,
> it would be useful to give this area some more thoughts and see if we can
> make the interface more general.
>
> 42. In the diagram, the RemoteLogManager is side by side with LogManager.
> This KIP only discussed how the fetch request is handled between the two
> layer. However, we should also consider how other requests that touch the
> log can be handled. e.g., list offsets by timestamp, delete records, etc.
> Also, in this model, it's not clear which component is responsible for
> managing the log start offset. It seems that the log start offset could be
> changed by both RemoteLogManager and LogManager.
>
> 43. There are quite a few existing features not covered by the KIP. It
> would be useful to discuss each of those.
> 43.1 I won’t say that compacted topics are rarely used and always small.
> For example, KStreams uses compacted topics for storing the states and
> sometimes the size of the topic could be large. While it might be ok to not
> support compacted topics initially, it would be useful to have a high level
> idea on how this might be supported down the road so that we don’t have to
> make incompatible API changes in the future.
> 43.2 We need to discuss how EOS is supported. In particular, how is the
> producer state integrated with the remote storage.
> 43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is
> implemented, we need to discuss how reading from a follower replica is
> supported with tier storage.
> 43.4 We need to discuss how JBOD is supported with tier storage.
>
> Thanks,
>
> Jun
>
> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley  wrote:
>
> > Thanks for those insights Ying.
> >
> > On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng 
> wrote:
> >
> > > >
> > > >
> > > >
> > > > Thanks, I missed that point. However, there's still a point at which
> > the
> > > > consumer fetches start getting served from remote storage (even if
> that
> > > > point isn't as soon as the local log retention time/size). This
> > > represents
> > > > a kind of performance cliff edge and what I'm really interested in is
> > how
> > > > easy it is for a consumer which falls off that cliff to catch up and
> so
> > > its
> > > > fetches again come from local storage. Obviously this can depend on
> all
> > > > sorts of factors (like production rate, consumption rate), so it's
> not
> > > > guaranteed (just like it's not guaranteed for Kafka today), but this
> > > would
> > > > represent a new failure mode.
> > > >
> > >
> > >  As I have explained in the last mail, it's a very rare case that a
> > > consumer
> > > need to read remote data. With our experience at Uber, this only
> happens
> > > when the consumer service had an outage for several hours.
> > >
> > > There is not a "performance cliff" as you assume. The remote storage is
> > > even faster than local disks in terms of bandwidth. Reading from remote
> > > storage is going to have higher latency than local disk. But since the
> > > consumer
> > > is catching up several hours data, it's not sensitive to the sub-second
> > > level
> > > latency, and each remote read request will read a large amount of data
> to
> > > make the 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-19 Thread Ying Zheng
On Fri, Nov 15, 2019 at 11:04 AM Jun Rao  wrote:

> Hi, Satish and Harsha,
>
> The following is a more detailed high level feedback for the KIP. Overall,
> the KIP seems useful. The challenge is how to design it such that it’s
> general enough to support different ways of implementing this feature and
> support existing features.
>
> 40. Local segment metadata storage: The KIP makes the assumption that the
> metadata for the archived log segments are cached locally in every broker
> and provides a specific implementation for the local storage in the
> framework. We probably should discuss this more. For example, some tier
> storage providers may not want to cache the metadata locally and just rely
> upon a remote key/value store if such a store is already present. If a
> local store is used, there could be different ways of implementing it
> (e.g., based on customized local files, an embedded local store like
> RocksDB, etc). An alternative of designing this is to just provide an
> interface for retrieving the tier segment metadata and leave the details of
> how to get the metadata outside of the framework.
>

[Ying]
Early this year, when we just started design tiered storage, we did plan to
make RemoteLogManager
a Kafka plugin. So that, there can be totally different implementations of
tiered storage.

However, one feedback we received from the community is that
developing RemoteLogManager
implementations are too hard for most potential users. People actually
prefer to one standard
implementation that can satisfy most of the requirements.

We accepted that feedback, and decided to trade some of the flexibility for
simplicity in the 1st version.
It's still possible to allow users provide different implementations in the
future.

We had discussions with different companies (e.g. Slack, AirBnb) that are
interested in tiered storage.
Our conclusion is that the current design (a standard RemoteLogManager that
caches remote metadata
locally + HDFS and S3 RemoteStorageMangers) is good enough for all of the
companies we have talked
with.

We don't have much knowledge about the use-cases out of Internet industry.
Do any consumers of
Confluent need to manage the metadata in different ways?





> 43. There are quite a few existing features not covered by the KIP. It
> would be useful to discuss each of those.
> 43.1 I won’t say that compacted topics are rarely used and always small.
> For example, KStreams uses compacted topics for storing the states and
> sometimes the size of the topic could be large. While it might be ok to not
> support compacted topics initially, it would be useful to have a high level
> idea on how this might be supported down the road so that we don’t have to
> make incompatible API changes in the future.
> 43.2 We need to discuss how EOS is supported. In particular, how is the
> producer state integrated with the remote storage.
> 43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is
> implemented, we need to discuss how reading from a follower replica is
> supported with tier storage.
> 43.4 We need to discuss how JBOD is supported with tier storage.
>
> [Ying]
The support of compacted topics and EOS are definitely possible. We will
discuss the possible design in the KIP.

But for the 1st version, we prefer to focus on a relatively small scope, and
develop a simple and just enough solution for most users. Most features
will be gradually added in the future releases.

For compacted topic, we can save a new version of remote segment files
after each compact. The old remote version will be deleted after the new
version is available on remote storage.

For EOS, the snapshots can also be shipped to remote storage.

KIP-392 will be supported in the 1st version of tiered storage. We will add
the design details in the KIP.

JBOD of in remote storage is provided by the remote storage system
(e.g. HDFS, S3). This should be totally transparent for Kafka.

Tiered storage will make Kafka local storage much smaller, and make
JBOD of local storage less needed. We should be able to support JBOD
in local storage in the future. This shouldn't require any changes in RSM,
because only Kafka and RemoteLogManager talk with local storage.
So, there shouldn't be any compatibility issue when we support local
storage JBOD in the next version.



> Thanks,
>
> Jun
>
>
>


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-18 Thread Satish Duggana
Hi Jun,
Thanks for your detailed review and comments.

>40. Local segment metadata storage: The KIP makes the assumption that the
metadata for the archived log segments are cached locally in every broker
and provides a specific implementation for the local storage in the
framework. We probably should discuss this more. For example, some tier
storage providers may not want to cache the metadata locally and just rely
upon a remote key/value store if such a store is already present. If a
local store is used, there could be different ways of implementing it
(e.g., based on customized local files, an embedded local store like
RocksDB, etc). An alternative of designing this is to just provide an
interface for retrieving the tier segment metadata and leave the details of
how to get the metadata outside of the framework.

I am fine with giving a way for RSM implementor to handle remote log
metadata. But we should give a default implementation if any RSM
implementers want to reuse that. The default implementation can be
storing them locally as it is mentioned in the KIP.

>41. RemoteStorageManager interface and the usage of the interface in the
framework: I am not sure if the interface is general enough.  For example,
it seems that RemoteLogIndexEntry is tied to a specific way of storing the
metadata in remote storage. The framework uses listRemoteSegments() api in
a pull-based approach. However, in some other implementations, a push-based
approach may be more preferred. I don’t have a concrete proposal yet. But,
it would be useful to give this area some more thoughts and see if we can
make the interface more general.

RemoteLogIndexEntry is aligned with record batch representation and it
also gives a generalized representation through RDI about the location
of that batch in the remote storage. If there are use cases to
represent them in a different way then we can have an interface and
refactor the current  RemoteLogIndexEntry as the default
implementation.
listRemoteSegments() API is to get metadata about a specific topic
partition’s remote log storage. We thought  RemoteLogManager should do
most of the heavy lifting as much as it can and it should use
RemoteStorageManager whenever it needs to retrieve remote log
metadata/data. We can start with this approach in the initial version.
If there are valid use cases to have push based mechanism we can add
them later.


>42. In the diagram, the RemoteLogManager is side by side with LogManager.
This KIP only discussed how the fetch request is handled between the two
layer. However, we should also consider how other requests that touch the
log can be handled. e.g., list offsets by timestamp, delete records, etc.
Also, in this model, it's not clear which component is responsible for
managing the log start offset. It seems that the log start offset could be
changed by both RemoteLogManager and LogManager.

Sure, we will add more details in the KIP about how different request
APIs which touch the log are handled.
With tiered storage, log will have local-log-start-offset,
remote-log-start-offset and effective-log-start-offset.
Existing log-start-offset field is effective-log-start-offset of the Log.
effective-log-start-offset = if(remote-log exists)
remote-log-start-offset else local-log-start-offset.
Log still manages log-start-offset but it can be updated by
RemoteLogManager if tiering is enabled.


>43. There are quite a few existing features not covered by the KIP. It
would be useful to discuss each of those.
>43.1 I won’t say that compacted topics are rarely used and always small.
For example, KStreams uses compacted topics for storing the states and
sometimes the size of the topic could be large. While it might be ok to not
support compacted topics initially, it would be useful to have a high level
idea on how this might be supported down the road so that we don’t have to
make incompatible API changes in the future.

As you know, any new APIs will evolve over the next couple of
versions, they may even be incompatible till stabilized. But we will
have the new APIs thinking through the possible usecases.
We can discuss a high level idea on how compact topics can be
supported but this is a lower priority for now.

>43.2 We need to discuss how EOS is supported. In particular, how is the
producer state integrated with the remote storage.

Right, EOS needs producer state snapshots of the log segments. These
snapshots can be maintained in remote storage like offset and time
indexes. I will update the KIP with the details.

>43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is
implemented, we need to discuss how reading from a follower replica is
supported with tier storage.

We plan to support consumer fetch requests on follower replicas with
remote log segments. Remote log contains only the committed
records(till log-stable-offset), This constraint allows us to support
the ask here. I will update the KIP to make it clear that this is
supported.

>43.4 We 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-15 Thread Jun Rao
Hi, Satish and Harsha,

The following is a more detailed high level feedback for the KIP. Overall,
the KIP seems useful. The challenge is how to design it such that it’s
general enough to support different ways of implementing this feature and
support existing features.

40. Local segment metadata storage: The KIP makes the assumption that the
metadata for the archived log segments are cached locally in every broker
and provides a specific implementation for the local storage in the
framework. We probably should discuss this more. For example, some tier
storage providers may not want to cache the metadata locally and just rely
upon a remote key/value store if such a store is already present. If a
local store is used, there could be different ways of implementing it
(e.g., based on customized local files, an embedded local store like
RocksDB, etc). An alternative of designing this is to just provide an
interface for retrieving the tier segment metadata and leave the details of
how to get the metadata outside of the framework.

41. RemoteStorageManager interface and the usage of the interface in the
framework: I am not sure if the interface is general enough.  For example,
it seems that RemoteLogIndexEntry is tied to a specific way of storing the
metadata in remote storage. The framework uses listRemoteSegments() api in
a pull based approach. However, in some other implementations, a push based
approach may be more preferred. I don’t have a concrete proposal yet. But,
it would be useful to give this area some more thoughts and see if we can
make the interface more general.

42. In the diagram, the RemoteLogManager is side by side with LogManager.
This KIP only discussed how the fetch request is handled between the two
layer. However, we should also consider how other requests that touch the
log can be handled. e.g., list offsets by timestamp, delete records, etc.
Also, in this model, it's not clear which component is responsible for
managing the log start offset. It seems that the log start offset could be
changed by both RemoteLogManager and LogManager.

43. There are quite a few existing features not covered by the KIP. It
would be useful to discuss each of those.
43.1 I won’t say that compacted topics are rarely used and always small.
For example, KStreams uses compacted topics for storing the states and
sometimes the size of the topic could be large. While it might be ok to not
support compacted topics initially, it would be useful to have a high level
idea on how this might be supported down the road so that we don’t have to
make incompatible API changes in the future.
43.2 We need to discuss how EOS is supported. In particular, how is the
producer state integrated with the remote storage.
43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is
implemented, we need to discuss how reading from a follower replica is
supported with tier storage.
43.4 We need to discuss how JBOD is supported with tier storage.

Thanks,

Jun

On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley  wrote:

> Thanks for those insights Ying.
>
> On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng  wrote:
>
> > >
> > >
> > >
> > > Thanks, I missed that point. However, there's still a point at which
> the
> > > consumer fetches start getting served from remote storage (even if that
> > > point isn't as soon as the local log retention time/size). This
> > represents
> > > a kind of performance cliff edge and what I'm really interested in is
> how
> > > easy it is for a consumer which falls off that cliff to catch up and so
> > its
> > > fetches again come from local storage. Obviously this can depend on all
> > > sorts of factors (like production rate, consumption rate), so it's not
> > > guaranteed (just like it's not guaranteed for Kafka today), but this
> > would
> > > represent a new failure mode.
> > >
> >
> >  As I have explained in the last mail, it's a very rare case that a
> > consumer
> > need to read remote data. With our experience at Uber, this only happens
> > when the consumer service had an outage for several hours.
> >
> > There is not a "performance cliff" as you assume. The remote storage is
> > even faster than local disks in terms of bandwidth. Reading from remote
> > storage is going to have higher latency than local disk. But since the
> > consumer
> > is catching up several hours data, it's not sensitive to the sub-second
> > level
> > latency, and each remote read request will read a large amount of data to
> > make the overall performance better than reading from local disks.
> >
> >
> >
> > > Another aspect I'd like to understand better is the effect of serving
> > fetch
> > > request from remote storage has on the broker's network utilization. If
> > > we're just trimming the amount of data held locally (without increasing
> > the
> > > overall local+remote retention), then we're effectively trading disk
> > > bandwidth for network bandwidth when serving fetch requests from remote
> > > storage (which I 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-08 Thread Tom Bentley
Thanks for those insights Ying.

On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng  wrote:

> >
> >
> >
> > Thanks, I missed that point. However, there's still a point at which the
> > consumer fetches start getting served from remote storage (even if that
> > point isn't as soon as the local log retention time/size). This
> represents
> > a kind of performance cliff edge and what I'm really interested in is how
> > easy it is for a consumer which falls off that cliff to catch up and so
> its
> > fetches again come from local storage. Obviously this can depend on all
> > sorts of factors (like production rate, consumption rate), so it's not
> > guaranteed (just like it's not guaranteed for Kafka today), but this
> would
> > represent a new failure mode.
> >
>
>  As I have explained in the last mail, it's a very rare case that a
> consumer
> need to read remote data. With our experience at Uber, this only happens
> when the consumer service had an outage for several hours.
>
> There is not a "performance cliff" as you assume. The remote storage is
> even faster than local disks in terms of bandwidth. Reading from remote
> storage is going to have higher latency than local disk. But since the
> consumer
> is catching up several hours data, it's not sensitive to the sub-second
> level
> latency, and each remote read request will read a large amount of data to
> make the overall performance better than reading from local disks.
>
>
>
> > Another aspect I'd like to understand better is the effect of serving
> fetch
> > request from remote storage has on the broker's network utilization. If
> > we're just trimming the amount of data held locally (without increasing
> the
> > overall local+remote retention), then we're effectively trading disk
> > bandwidth for network bandwidth when serving fetch requests from remote
> > storage (which I understand to be a good thing, since brokers are
> > often/usually disk bound). But if we're increasing the overall
> local+remote
> > retention then it's more likely that network itself becomes the
> bottleneck.
> > I appreciate this is all rather hand wavy, I'm just trying to understand
> > how this would affect broker performance, so I'd be grateful for any
> > insights you can offer.
> >
> >
> Network bandwidth is a function of produce speed, it has nothing to do with
> remote retention. As long as the data is shipped to remote storage, you can
> keep the data there for 1 day or 1 year or 100 years, it doesn't consume
> any
> network resources.
>


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-07 Thread Ying Zheng
>
>
>
> Thanks, I missed that point. However, there's still a point at which the
> consumer fetches start getting served from remote storage (even if that
> point isn't as soon as the local log retention time/size). This represents
> a kind of performance cliff edge and what I'm really interested in is how
> easy it is for a consumer which falls off that cliff to catch up and so its
> fetches again come from local storage. Obviously this can depend on all
> sorts of factors (like production rate, consumption rate), so it's not
> guaranteed (just like it's not guaranteed for Kafka today), but this would
> represent a new failure mode.
>

 As I have explained in the last mail, it's a very rare case that a consumer
need to read remote data. With our experience at Uber, this only happens
when the consumer service had an outage for several hours.

There is not a "performance cliff" as you assume. The remote storage is
even faster than local disks in terms of bandwidth. Reading from remote
storage is going to have higher latency than local disk. But since the
consumer
is catching up several hours data, it's not sensitive to the sub-second
level
latency, and each remote read request will read a large amount of data to
make the overall performance better than reading from local disks.



> Another aspect I'd like to understand better is the effect of serving fetch
> request from remote storage has on the broker's network utilization. If
> we're just trimming the amount of data held locally (without increasing the
> overall local+remote retention), then we're effectively trading disk
> bandwidth for network bandwidth when serving fetch requests from remote
> storage (which I understand to be a good thing, since brokers are
> often/usually disk bound). But if we're increasing the overall local+remote
> retention then it's more likely that network itself becomes the bottleneck.
> I appreciate this is all rather hand wavy, I'm just trying to understand
> how this would affect broker performance, so I'd be grateful for any
> insights you can offer.
>
>
Network bandwidth is a function of produce speed, it has nothing to do with
remote retention. As long as the data is shipped to remote storage, you can
keep the data there for 1 day or 1 year or 100 years, it doesn't consume any
network resources.


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-07 Thread Ying Zheng
On Wed, Nov 6, 2019 at 6:28 PM Tom Bentley  wrote:

> Hi Ying,
>
> Because only inactive segments can be shipped to remote storage, to be able
> > to ship log data as soon
> > as possible, we will roll log segment very fast (e.g. every half hour).
> >
>
> So that means a consumer which gets behind by half an hour will find its
> reads being served from remote storage.


No, the segments are shipped to remote storage as soon as possible. But
the local segment is not deleted until a configurable time (e.g. 6 hours).
The consumer request is served from local storage as long as the local
copy is still available. After 6 hour or longer, the consumer request will
be
served by remote storage.


> And, if I understand the proposed
> algorithm, each such consumer fetch request could result in a separate
> fetch request from the remote storage. I.e. there's no mechanism to
> amortize the cost of the fetching between multiple consumers fetching
> similar ranges?
>
>
We can have a small in memory cache on the broker. But this is not a high
priority right now. In any normal case, a Kafka consumer should not lag for
more than several hours. Only in some very extreme cases, a Kafka
consumer have to read from remote storage. It's very rare that 2 or more
consumers are read the same piece of remote data at about the same time.

(Actually the doc for RemoteStorageManager.read() says "It will read at
> least one batch, if the 1st batch size is larger than maxBytes.". Does that
> mean the broker might have to retry with increased maxBytes if the first
> request fails to read a batch? If so, how does it know how much to increase
> maxBytes by?)
>
>
No, there is no retry, just continuously reading until a full batch is
received.
The logic is exactly the same with the existing local segment read.


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Satish Duggana
>>Depends on the implementation, the data of one segment may not necessary be
stored in a single file.
There could be a maximum object / chunk / file size restriction on the
remote storage. So, one Kafka
segment could be saved in multiple chunks in remote storage.

>Having one local segment can be stored in multiple files and each file
can have a base position as part of the metadata(like name) of file or
object etc.
File/object name can be --. So
any read request for a position with in that segment can be found by
computing relative position viz `fetchPosition-basePosition`.

Let me elaborate further on how to address a single local segment file
being copied to multiple files/blocks in remote storage without the
need to map local segment positions to remote segment positions.
Let us say a local segment file has offsets from 1000-95000. This may
be copied to remote storage in multiple files/blocks. Each file or
block can be created with name or any other metadata containing
--. This does not require
recomputing positions for the remote segments.

local segment file has offsets: 1000 - 95000

remote segment file suffix format can be :
--
remote-segment-file-1: 1000-20200-0
remote-segment-file-2: 20201-45003-942346
remote-segment-file-3: 45004-78008-6001235
remote-segment-file-4: 78009-95000-20024761

If a read comes for 52340 offset and position as 7321236, relative
position in remote segment-3 is: 7321236-6001235 = 1320001

Thanks,
Satish.

On Thu, Nov 7, 2019 at 7:55 AM Satish Duggana  wrote:
>
> >Depends on the implementation, the data of one segment may not necessary be
> stored in a single file.
> There could be a maximum object / chunk / file size restriction on the
> remote storage. So, one Kafka
> segment could be saved in multiple chunks in remote storage.
>
> Having one local segment can be stored in multiple files and each file
> can have a base position as part of the metadata(like name) of file or
> object etc.
> File/object name can be --. So
> any read request for a position with in that segment can be found by
> computing relative position viz `fetchPosition-basePosition`.
>
>
>
> On Thu, Nov 7, 2019 at 6:04 AM Ying Zheng  wrote:
> >
> > 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > its relationship with RemoteLogSegmentInfo. It seems
> > that RemoteLogIndexEntry are offset index entries pointing to record
> > batches inside a segment. That seems to be the same as the .index file?
> >
> > We do not assume the how the data is stored in the remote storage.
> > Depends on the implementation, the data of one segment may not necessary be
> > stored in a single file.
> > There could be a maximum object / chunk / file size restriction on the
> > remote storage. So, one Kafka
> > segment could be saved in multiple chunks in remote storage.
> >
> > The remote log index also have a larger index interval. The default
> > interval of the local .index file
> > (log.index.interval.bytes) is 4KB. In the current HDFS RSM implementation,
> > the default remote
> > index interval (hdfs.remote.index.interval.bytes) is 256KB. The
> > coarse-grained remote index saves
> > some local disk space. The smaller size also makes it more likely to be
> > cached in physical memory.
> >
> >
> >
> >
> > On Thu, Oct 31, 2019 at 1:58 PM Jun Rao  wrote:
> >
> > > Hi, Harsha,
> > >
> > > I am still looking at the KIP and the PR. A couple of quick
> > > comments/questions.
> > >
> > > 20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> > > to remove it before it's merged to trunk. As Victor mentioned, we can
> > > provide a reference implementation based on a mocked version of remote
> > > storage.
> > >
> > > 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > > its relationship with RemoteLogSegmentInfo. It seems
> > > that RemoteLogIndexEntry are offset index entries pointing to record
> > > batches inside a segment. That seems to be the same as the .index file?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana 
> > > wrote:
> > >
> > > > Hi Viktor,
> > > > >1. Can we allow RLM Followers to serve read requests? After all 
> > > > >segments
> > > > on
> > > > the cold storage are closed ones, no modification is allowed. Besides
> > > > KIP-392 (
> > > >
> > > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_KAFKA_KIP-2D392-253A-2BAllow-2Bconsumers-2Bto-2Bfetch-2Bfrom-2Bclosest-2Breplica=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=HTPACirRO-wVmOHmGEMlTIAov4szGHn38xrbFbMZK_I=
> > > > )
> > > > would introduce follower fetching too, so I think it would be nice to
> > > > prepare RLM for this as well.
> > > >
> > > > That is a good point. We plan to support fetching remote storage from
> > > > followers too. Current code in the PR work fine for this scenario
> > > > though there may be some edge cases to 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Tom Bentley
Hi Satish,

>So that means a consumer which gets behind by half an hour will find its
> reads being served from remote storage. And, if I understand the proposed
> algorithm, each such consumer fetch request could result in a separate
> fetch request from the remote storage. I.e. there's no mechanism to
> amortize the cost of the fetching between multiple consumers fetching
> similar ranges?
>
> local log segments are deleted according to the local
> log.retention.time/.size settings though they may have been already
> copied to remote storage. Consumers would still be able to fetch the
> messages from local storage if they are not yet deleted based on the
> retention. They will be served from remote storage only when they are
> not locally available.
>

Thanks, I missed that point. However, there's still a point at which the
consumer fetches start getting served from remote storage (even if that
point isn't as soon as the local log retention time/size). This represents
a kind of performance cliff edge and what I'm really interested in is how
easy it is for a consumer which falls off that cliff to catch up and so its
fetches again come from local storage. Obviously this can depend on all
sorts of factors (like production rate, consumption rate), so it's not
guaranteed (just like it's not guaranteed for Kafka today), but this would
represent a new failure mode.

Another aspect I'd like to understand better is the effect of serving fetch
request from remote storage has on the broker's network utilization. If
we're just trimming the amount of data held locally (without increasing the
overall local+remote retention), then we're effectively trading disk
bandwidth for network bandwidth when serving fetch requests from remote
storage (which I understand to be a good thing, since brokers are
often/usually disk bound). But if we're increasing the overall local+remote
retention then it's more likely that network itself becomes the bottleneck.
I appreciate this is all rather hand wavy, I'm just trying to understand
how this would affect broker performance, so I'd be grateful for any
insights you can offer.

Cheers,

Tom


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Satish Duggana
Hi Tom,'
Sorry, I missed the other question.

>(Actually the doc for RemoteStorageManager.read() says "It will read at
least one batch, if the 1st batch size is larger than maxBytes.". Does that
mean the broker might have to retry with increased maxBytes if the first
request fails to read a batch? If so, how does it know how much to increase
maxBytes by?)

broker or RemoteLogManager does not need to retry here.
RemoteStorageManager can return `Records` which can be more than
maxBytes if the first batch containing startOffset has more than
maxBytes.

Thanks,
Satish.

On Thu, Nov 7, 2019 at 8:33 AM Satish Duggana  wrote:
>
> >So that means a consumer which gets behind by half an hour will find its
> reads being served from remote storage. And, if I understand the proposed
> algorithm, each such consumer fetch request could result in a separate
> fetch request from the remote storage. I.e. there's no mechanism to
> amortize the cost of the fetching between multiple consumers fetching
> similar ranges?
>
> local log segments are deleted according to the local
> log.retention.time/.size settings though they may have been already
> copied to remote storage. Consumers would still be able to fetch the
> messages from local storage if they are not yet deleted based on the
> retention. They will be served from remote storage only when they are
> not locally available.
>
> Thanks,
> Satish.
>
> On Thu, Nov 7, 2019 at 7:58 AM Tom Bentley  wrote:
> >
> > Hi Ying,
> >
> > Because only inactive segments can be shipped to remote storage, to be able
> > > to ship log data as soon
> > > as possible, we will roll log segment very fast (e.g. every half hour).
> > >
> >
> > So that means a consumer which gets behind by half an hour will find its
> > reads being served from remote storage. And, if I understand the proposed
> > algorithm, each such consumer fetch request could result in a separate
> > fetch request from the remote storage. I.e. there's no mechanism to
> > amortize the cost of the fetching between multiple consumers fetching
> > similar ranges?
> >
> > (Actually the doc for RemoteStorageManager.read() says "It will read at
> > least one batch, if the 1st batch size is larger than maxBytes.". Does that
> > mean the broker might have to retry with increased maxBytes if the first
> > request fails to read a batch? If so, how does it know how much to increase
> > maxBytes by?)
> >
> > Thanks,
> >
> > Tom


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Satish Duggana
>So that means a consumer which gets behind by half an hour will find its
reads being served from remote storage. And, if I understand the proposed
algorithm, each such consumer fetch request could result in a separate
fetch request from the remote storage. I.e. there's no mechanism to
amortize the cost of the fetching between multiple consumers fetching
similar ranges?

local log segments are deleted according to the local
log.retention.time/.size settings though they may have been already
copied to remote storage. Consumers would still be able to fetch the
messages from local storage if they are not yet deleted based on the
retention. They will be served from remote storage only when they are
not locally available.

Thanks,
Satish.

On Thu, Nov 7, 2019 at 7:58 AM Tom Bentley  wrote:
>
> Hi Ying,
>
> Because only inactive segments can be shipped to remote storage, to be able
> > to ship log data as soon
> > as possible, we will roll log segment very fast (e.g. every half hour).
> >
>
> So that means a consumer which gets behind by half an hour will find its
> reads being served from remote storage. And, if I understand the proposed
> algorithm, each such consumer fetch request could result in a separate
> fetch request from the remote storage. I.e. there's no mechanism to
> amortize the cost of the fetching between multiple consumers fetching
> similar ranges?
>
> (Actually the doc for RemoteStorageManager.read() says "It will read at
> least one batch, if the 1st batch size is larger than maxBytes.". Does that
> mean the broker might have to retry with increased maxBytes if the first
> request fails to read a batch? If so, how does it know how much to increase
> maxBytes by?)
>
> Thanks,
>
> Tom


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Tom Bentley
Hi Ying,

Because only inactive segments can be shipped to remote storage, to be able
> to ship log data as soon
> as possible, we will roll log segment very fast (e.g. every half hour).
>

So that means a consumer which gets behind by half an hour will find its
reads being served from remote storage. And, if I understand the proposed
algorithm, each such consumer fetch request could result in a separate
fetch request from the remote storage. I.e. there's no mechanism to
amortize the cost of the fetching between multiple consumers fetching
similar ranges?

(Actually the doc for RemoteStorageManager.read() says "It will read at
least one batch, if the 1st batch size is larger than maxBytes.". Does that
mean the broker might have to retry with increased maxBytes if the first
request fails to read a batch? If so, how does it know how much to increase
maxBytes by?)

Thanks,

Tom


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Satish Duggana
>Depends on the implementation, the data of one segment may not necessary be
stored in a single file.
There could be a maximum object / chunk / file size restriction on the
remote storage. So, one Kafka
segment could be saved in multiple chunks in remote storage.

Having one local segment can be stored in multiple files and each file
can have a base position as part of the metadata(like name) of file or
object etc.
File/object name can be --. So
any read request for a position with in that segment can be found by
computing relative position viz `fetchPosition-basePosition`.



On Thu, Nov 7, 2019 at 6:04 AM Ying Zheng  wrote:
>
> 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> We do not assume the how the data is stored in the remote storage.
> Depends on the implementation, the data of one segment may not necessary be
> stored in a single file.
> There could be a maximum object / chunk / file size restriction on the
> remote storage. So, one Kafka
> segment could be saved in multiple chunks in remote storage.
>
> The remote log index also have a larger index interval. The default
> interval of the local .index file
> (log.index.interval.bytes) is 4KB. In the current HDFS RSM implementation,
> the default remote
> index interval (hdfs.remote.index.interval.bytes) is 256KB. The
> coarse-grained remote index saves
> some local disk space. The smaller size also makes it more likely to be
> cached in physical memory.
>
>
>
>
> On Thu, Oct 31, 2019 at 1:58 PM Jun Rao  wrote:
>
> > Hi, Harsha,
> >
> > I am still looking at the KIP and the PR. A couple of quick
> > comments/questions.
> >
> > 20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> > to remove it before it's merged to trunk. As Victor mentioned, we can
> > provide a reference implementation based on a mocked version of remote
> > storage.
> >
> > 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > its relationship with RemoteLogSegmentInfo. It seems
> > that RemoteLogIndexEntry are offset index entries pointing to record
> > batches inside a segment. That seems to be the same as the .index file?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana 
> > wrote:
> >
> > > Hi Viktor,
> > > >1. Can we allow RLM Followers to serve read requests? After all segments
> > > on
> > > the cold storage are closed ones, no modification is allowed. Besides
> > > KIP-392 (
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_KAFKA_KIP-2D392-253A-2BAllow-2Bconsumers-2Bto-2Bfetch-2Bfrom-2Bclosest-2Breplica=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=HTPACirRO-wVmOHmGEMlTIAov4szGHn38xrbFbMZK_I=
> > > )
> > > would introduce follower fetching too, so I think it would be nice to
> > > prepare RLM for this as well.
> > >
> > > That is a good point. We plan to support fetching remote storage from
> > > followers too. Current code in the PR work fine for this scenario
> > > though there may be some edge cases to be handled. We have not yet
> > > tested this scenario.
> > >
> > > >2. I think the remote.log.storage.enable config is redundant. By
> > > specifying
> > >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__remote.log.storage.manager.class.name=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=QsUunkBFX3dne_4caCiEAbp9xKUPrFx1srwznOR_Sfc=
> > one already declares that they want
> > > to use remote storage. Would it make sense to remove
> > > the remote.log.storage.enable config?
> > >
> > > I do not think it is really needed. `remote.log.storage.enable`
> > > property can be removed.
> > >
> > > Thanks,
> > > Satish.
> > >
> > >
> > > On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
> > >  wrote:
> > > >
> > > > Hi Harsha,
> > > >
> > > > A couple more questions:
> > > > 1. Can we allow RLM Followers to serve read requests? After all
> > segments
> > > on
> > > > the cold storage are closed ones, no modification is allowed. Besides
> > > > KIP-392 (
> > > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_KAFKA_KIP-2D392-253A-2BAllow-2Bconsumers-2Bto-2Bfetch-2Bfrom-2Bclosest-2Breplica=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=HTPACirRO-wVmOHmGEMlTIAov4szGHn38xrbFbMZK_I=
> > > )
> > > > would introduce follower fetching too, so I think it would be nice to
> > > > prepare RLM for this as well.
> > > > 2. I think the remote.log.storage.enable config is redundant. By
> > > specifying
> > > >
> > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Ying Zheng
On Wed, Nov 6, 2019 at 4:33 PM Ying Zheng  wrote:

> 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> We do not assume the how the data is stored in the remote storage.
> Depends on the implementation, the data of one segment may not necessary
> be stored in a single file.
> There could be a maximum object / chunk / file size restriction on the
> remote storage. So, one Kafka
> segment could be saved in multiple chunks in remote storage.
>
> The remote log index also have a larger index interval. The default
> interval of the local .index file
> (log.index.interval.bytes) is 4KB. In the current HDFS RSM implementation,
> the default remote
> index interval (hdfs.remote.index.interval.bytes) is 256KB. The
> coarse-grained remote index saves
> some local disk space. The smaller size also makes it more likely to be
> cached in physical memory.
>

The remote log index file is also very different from the existing .index
file. With the current design,
one .index file correspond to one segment file. But one remote log index
file can correspond to many
remote segments.

Because only inactive segments can be shipped to remote storage, to be able
to ship log data as soon
as possible, we will roll log segment very fast (e.g. every half hour).
This will lead to a large number of
small segments. If we maintain one remote index file for each remote
segment, we can easily hit some
OS limitations, like the maximum # of open files or the maximum # of
mmapped files.

So, instead of creating a new remote index file, we append
the RemoteLogIndexEntries of multiple
remote segments to one local file. We will roll the remote index file at a
configurable size or time interval.


Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Ying Zheng
21. I am not sure that I understood the need for RemoteLogIndexEntry and
its relationship with RemoteLogSegmentInfo. It seems
that RemoteLogIndexEntry are offset index entries pointing to record
batches inside a segment. That seems to be the same as the .index file?

We do not assume the how the data is stored in the remote storage.
Depends on the implementation, the data of one segment may not necessary be
stored in a single file.
There could be a maximum object / chunk / file size restriction on the
remote storage. So, one Kafka
segment could be saved in multiple chunks in remote storage.

The remote log index also have a larger index interval. The default
interval of the local .index file
(log.index.interval.bytes) is 4KB. In the current HDFS RSM implementation,
the default remote
index interval (hdfs.remote.index.interval.bytes) is 256KB. The
coarse-grained remote index saves
some local disk space. The smaller size also makes it more likely to be
cached in physical memory.




On Thu, Oct 31, 2019 at 1:58 PM Jun Rao  wrote:

> Hi, Harsha,
>
> I am still looking at the KIP and the PR. A couple of quick
> comments/questions.
>
> 20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> to remove it before it's merged to trunk. As Victor mentioned, we can
> provide a reference implementation based on a mocked version of remote
> storage.
>
> 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> Thanks,
>
> Jun
>
> On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana 
> wrote:
>
> > Hi Viktor,
> > >1. Can we allow RLM Followers to serve read requests? After all segments
> > on
> > the cold storage are closed ones, no modification is allowed. Besides
> > KIP-392 (
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_KAFKA_KIP-2D392-253A-2BAllow-2Bconsumers-2Bto-2Bfetch-2Bfrom-2Bclosest-2Breplica=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=HTPACirRO-wVmOHmGEMlTIAov4szGHn38xrbFbMZK_I=
> > )
> > would introduce follower fetching too, so I think it would be nice to
> > prepare RLM for this as well.
> >
> > That is a good point. We plan to support fetching remote storage from
> > followers too. Current code in the PR work fine for this scenario
> > though there may be some edge cases to be handled. We have not yet
> > tested this scenario.
> >
> > >2. I think the remote.log.storage.enable config is redundant. By
> > specifying
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__remote.log.storage.manager.class.name=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=QsUunkBFX3dne_4caCiEAbp9xKUPrFx1srwznOR_Sfc=
> one already declares that they want
> > to use remote storage. Would it make sense to remove
> > the remote.log.storage.enable config?
> >
> > I do not think it is really needed. `remote.log.storage.enable`
> > property can be removed.
> >
> > Thanks,
> > Satish.
> >
> >
> > On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
> >  wrote:
> > >
> > > Hi Harsha,
> > >
> > > A couple more questions:
> > > 1. Can we allow RLM Followers to serve read requests? After all
> segments
> > on
> > > the cold storage are closed ones, no modification is allowed. Besides
> > > KIP-392 (
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_KAFKA_KIP-2D392-253A-2BAllow-2Bconsumers-2Bto-2Bfetch-2Bfrom-2Bclosest-2Breplica=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=HTPACirRO-wVmOHmGEMlTIAov4szGHn38xrbFbMZK_I=
> > )
> > > would introduce follower fetching too, so I think it would be nice to
> > > prepare RLM for this as well.
> > > 2. I think the remote.log.storage.enable config is redundant. By
> > specifying
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__remote.log.storage.manager.class.name=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=g7ujYPRBvNrON18SBeCt4g=CKNMp77DfMghjYo1JqbWr5jl-DRDBGF2owao5zUXDeE=QsUunkBFX3dne_4caCiEAbp9xKUPrFx1srwznOR_Sfc=
> one already declares that they
> > want
> > > to use remote storage. Would it make sense to remove
> > > the remote.log.storage.enable config?
> > >
> > > Thanks,
> > > Viktor
> > >
> > >
> > > On Thu, Oct 24, 2019 at 10:37 AM Viktor Somogyi-Vass <
> > > viktorsomo...@gmail.com> wrote:
> > >
> > > > Hi Jun & Harsha,
> > > >
> > > > I think it would be beneficial to at least provide one simple
> reference
> > > > implementation (file system based?) as we do with connect too.
> > > > That would as a simple example and would help plugin developers to
> > better
> > > > understand the concept and the interfaces.
> > > >
> > > > Best,
> > > > Viktor
> > > >
> > > > On Wed, Oct 23, 2019 at 8:49 PM Jun Rao  

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-06 Thread Satish Duggana
Hi Jun,

>21. Could you elaborate a bit why the positions in remote segment is
different from the local one? I thought that they are identical copies.

They may not always be the same. Let me take an example here. If
remote storage is enabled with encryption then those local positions
may not be the same as the positions copied to remote storage.

Thanks,
Satish.


On Tue, Nov 5, 2019 at 3:46 AM Jun Rao  wrote:
>
> Hi, Satish,
>
> Thanks for the response.
>
> 21. Could you elaborate a bit why the positions in remote segment is
> different from the local one? I thought that they are identical copies.
>
> Jun
>
>
> On Fri, Nov 1, 2019 at 4:26 AM Satish Duggana 
> wrote:
>
> > Hi Jun,
> > Thanks for looking into the updated KIP and clarifying our earlier queries.
> >
> > >20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> > to remove it before it's merged to trunk. As Victor mentioned, we can
> > provide a reference implementation based on a mocked version of remote
> > storage.
> >
> > Sure, sounds good.
> >
> > >21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > its relationship with RemoteLogSegmentInfo. It seems
> > that RemoteLogIndexEntry are offset index entries pointing to record
> > batches inside a segment. That seems to be the same as the .index file?
> >
> > That is a good point. `RemoteLogManager` does not put a restriction on
> > `RemoteStorageManager(RSM)` for maintaining positions in the remote
> > segment same as the local segments or keeping a correlation between
> > local segment's positions to the remote segment positions. RSM gives
> > back the respective entries for a given log segment, call RSM to fetch
> > the data by giving the respective entry. This allows RSM to have
> > better control in managing the given log segments.
> >
> > Thanks,
> > Satish.
> >
> > On Fri, Nov 1, 2019 at 2:28 AM Jun Rao  wrote:
> > >
> > > Hi, Harsha,
> > >
> > > I am still looking at the KIP and the PR. A couple of quick
> > > comments/questions.
> > >
> > > 20. It's fine to keep the HDFS binding temporarily in the PR. We just
> > need
> > > to remove it before it's merged to trunk. As Victor mentioned, we can
> > > provide a reference implementation based on a mocked version of remote
> > > storage.
> > >
> > > 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > > its relationship with RemoteLogSegmentInfo. It seems
> > > that RemoteLogIndexEntry are offset index entries pointing to record
> > > batches inside a segment. That seems to be the same as the .index file?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana  > >
> > > wrote:
> > >
> > > > Hi Viktor,
> > > > >1. Can we allow RLM Followers to serve read requests? After all
> > segments
> > > > on
> > > > the cold storage are closed ones, no modification is allowed. Besides
> > > > KIP-392 (
> > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > > )
> > > > would introduce follower fetching too, so I think it would be nice to
> > > > prepare RLM for this as well.
> > > >
> > > > That is a good point. We plan to support fetching remote storage from
> > > > followers too. Current code in the PR work fine for this scenario
> > > > though there may be some edge cases to be handled. We have not yet
> > > > tested this scenario.
> > > >
> > > > >2. I think the remote.log.storage.enable config is redundant. By
> > > > specifying
> > > > remote.log.storage.manager.class.name one already declares that they
> > want
> > > > to use remote storage. Would it make sense to remove
> > > > the remote.log.storage.enable config?
> > > >
> > > > I do not think it is really needed. `remote.log.storage.enable`
> > > > property can be removed.
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > >
> > > > On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
> > > >  wrote:
> > > > >
> > > > > Hi Harsha,
> > > > >
> > > > > A couple more questions:
> > > > > 1. Can we allow RLM Followers to serve read requests? After all
> > segments
> > > > on
> > > > > the cold storage are closed ones, no modification is allowed. Besides
> > > > > KIP-392 (
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > > )
> > > > > would introduce follower fetching too, so I think it would be nice to
> > > > > prepare RLM for this as well.
> > > > > 2. I think the remote.log.storage.enable config is redundant. By
> > > > specifying
> > > > > remote.log.storage.manager.class.name one already declares that they
> > > > want
> > > > > to use remote storage. Would it make sense to remove
> > > > > the remote.log.storage.enable config?
> > > > >
> > > > > Thanks,
> > > > > Viktor
> > > > >
> > > > >
> > > > > On Thu, Oct 24, 2019 at 10:37 AM Viktor Somogyi-Vass <
> > > > > viktorsomo...@gmail.com> wrote:
> > > > >
> > > > > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-04 Thread Jun Rao
Hi, Satish,

Thanks for the response.

21. Could you elaborate a bit why the positions in remote segment is
different from the local one? I thought that they are identical copies.

Jun


On Fri, Nov 1, 2019 at 4:26 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for looking into the updated KIP and clarifying our earlier queries.
>
> >20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> to remove it before it's merged to trunk. As Victor mentioned, we can
> provide a reference implementation based on a mocked version of remote
> storage.
>
> Sure, sounds good.
>
> >21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> That is a good point. `RemoteLogManager` does not put a restriction on
> `RemoteStorageManager(RSM)` for maintaining positions in the remote
> segment same as the local segments or keeping a correlation between
> local segment's positions to the remote segment positions. RSM gives
> back the respective entries for a given log segment, call RSM to fetch
> the data by giving the respective entry. This allows RSM to have
> better control in managing the given log segments.
>
> Thanks,
> Satish.
>
> On Fri, Nov 1, 2019 at 2:28 AM Jun Rao  wrote:
> >
> > Hi, Harsha,
> >
> > I am still looking at the KIP and the PR. A couple of quick
> > comments/questions.
> >
> > 20. It's fine to keep the HDFS binding temporarily in the PR. We just
> need
> > to remove it before it's merged to trunk. As Victor mentioned, we can
> > provide a reference implementation based on a mocked version of remote
> > storage.
> >
> > 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > its relationship with RemoteLogSegmentInfo. It seems
> > that RemoteLogIndexEntry are offset index entries pointing to record
> > batches inside a segment. That seems to be the same as the .index file?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana  >
> > wrote:
> >
> > > Hi Viktor,
> > > >1. Can we allow RLM Followers to serve read requests? After all
> segments
> > > on
> > > the cold storage are closed ones, no modification is allowed. Besides
> > > KIP-392 (
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > )
> > > would introduce follower fetching too, so I think it would be nice to
> > > prepare RLM for this as well.
> > >
> > > That is a good point. We plan to support fetching remote storage from
> > > followers too. Current code in the PR work fine for this scenario
> > > though there may be some edge cases to be handled. We have not yet
> > > tested this scenario.
> > >
> > > >2. I think the remote.log.storage.enable config is redundant. By
> > > specifying
> > > remote.log.storage.manager.class.name one already declares that they
> want
> > > to use remote storage. Would it make sense to remove
> > > the remote.log.storage.enable config?
> > >
> > > I do not think it is really needed. `remote.log.storage.enable`
> > > property can be removed.
> > >
> > > Thanks,
> > > Satish.
> > >
> > >
> > > On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
> > >  wrote:
> > > >
> > > > Hi Harsha,
> > > >
> > > > A couple more questions:
> > > > 1. Can we allow RLM Followers to serve read requests? After all
> segments
> > > on
> > > > the cold storage are closed ones, no modification is allowed. Besides
> > > > KIP-392 (
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > )
> > > > would introduce follower fetching too, so I think it would be nice to
> > > > prepare RLM for this as well.
> > > > 2. I think the remote.log.storage.enable config is redundant. By
> > > specifying
> > > > remote.log.storage.manager.class.name one already declares that they
> > > want
> > > > to use remote storage. Would it make sense to remove
> > > > the remote.log.storage.enable config?
> > > >
> > > > Thanks,
> > > > Viktor
> > > >
> > > >
> > > > On Thu, Oct 24, 2019 at 10:37 AM Viktor Somogyi-Vass <
> > > > viktorsomo...@gmail.com> wrote:
> > > >
> > > > > Hi Jun & Harsha,
> > > > >
> > > > > I think it would be beneficial to at least provide one simple
> reference
> > > > > implementation (file system based?) as we do with connect too.
> > > > > That would as a simple example and would help plugin developers to
> > > better
> > > > > understand the concept and the interfaces.
> > > > >
> > > > > Best,
> > > > > Viktor
> > > > >
> > > > > On Wed, Oct 23, 2019 at 8:49 PM Jun Rao  wrote:
> > > > >
> > > > >> Hi, Harsha,
> > > > >>
> > > > >> Regarding feature branch, if the goal is faster collaboration, it
> > > seems
> > > > >> that doing the development on your own fork is better since
> > > non-committers
> > 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-04 Thread Harsha Chintalapani
Hi Jun,
  Can you please take a look at Satish's reply. Let us know if that
answers your question.
I would like to get yours and the rest of the community thoughts on the
general direction we are going as we continue
to make progress.

Thanks,
Harsha

On Fri, Nov 1, 2019 at 3:06 AM Satish Duggana 
wrote:

> Hi Jun,
> Thanks for looking into the updated KIP and clarifying our earlier queries.
>
> >20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> to remove it before it's merged to trunk. As Victor mentioned, we can
> provide a reference implementation based on a mocked version of remote
> storage.
>
> Sure, sounds good.
>
> >21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> That is a good point. `RemoteLogManager` does not put a restriction on
> `RemoteStorageManager(RSM)` for maintaining positions in the remote
> segment same as the local segments or keeping a correlation between
> local segment's positions to the remote segment positions. RSM gives
> back the respective entries for a given log segment, call RSM to fetch
> the data by giving the respective entry. This allows RSM to have
> better control in managing the given log segments.
>
> Thanks,
> Satish.
>
> On Fri, Nov 1, 2019 at 2:28 AM Jun Rao  wrote:
> >
> > Hi, Harsha,
> >
> > I am still looking at the KIP and the PR. A couple of quick
> > comments/questions.
> >
> > 20. It's fine to keep the HDFS binding temporarily in the PR. We just
> need
> > to remove it before it's merged to trunk. As Victor mentioned, we can
> > provide a reference implementation based on a mocked version of remote
> > storage.
> >
> > 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> > its relationship with RemoteLogSegmentInfo. It seems
> > that RemoteLogIndexEntry are offset index entries pointing to record
> > batches inside a segment. That seems to be the same as the .index file?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana  >
> > wrote:
> >
> > > Hi Viktor,
> > > >1. Can we allow RLM Followers to serve read requests? After all
> segments
> > > on
> > > the cold storage are closed ones, no modification is allowed. Besides
> > > KIP-392 (
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > )
> > > would introduce follower fetching too, so I think it would be nice to
> > > prepare RLM for this as well.
> > >
> > > That is a good point. We plan to support fetching remote storage from
> > > followers too. Current code in the PR work fine for this scenario
> > > though there may be some edge cases to be handled. We have not yet
> > > tested this scenario.
> > >
> > > >2. I think the remote.log.storage.enable config is redundant. By
> > > specifying
> > > remote.log.storage.manager.class.name one already declares that they
> want
> > > to use remote storage. Would it make sense to remove
> > > the remote.log.storage.enable config?
> > >
> > > I do not think it is really needed. `remote.log.storage.enable`
> > > property can be removed.
> > >
> > > Thanks,
> > > Satish.
> > >
> > >
> > > On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
> > >  wrote:
> > > >
> > > > Hi Harsha,
> > > >
> > > > A couple more questions:
> > > > 1. Can we allow RLM Followers to serve read requests? After all
> segments
> > > on
> > > > the cold storage are closed ones, no modification is allowed. Besides
> > > > KIP-392 (
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > )
> > > > would introduce follower fetching too, so I think it would be nice to
> > > > prepare RLM for this as well.
> > > > 2. I think the remote.log.storage.enable config is redundant. By
> > > specifying
> > > > remote.log.storage.manager.class.name one already declares that they
> > > want
> > > > to use remote storage. Would it make sense to remove
> > > > the remote.log.storage.enable config?
> > > >
> > > > Thanks,
> > > > Viktor
> > > >
> > > >
> > > > On Thu, Oct 24, 2019 at 10:37 AM Viktor Somogyi-Vass <
> > > > viktorsomo...@gmail.com> wrote:
> > > >
> > > > > Hi Jun & Harsha,
> > > > >
> > > > > I think it would be beneficial to at least provide one simple
> reference
> > > > > implementation (file system based?) as we do with connect too.
> > > > > That would as a simple example and would help plugin developers to
> > > better
> > > > > understand the concept and the interfaces.
> > > > >
> > > > > Best,
> > > > > Viktor
> > > > >
> > > > > On Wed, Oct 23, 2019 at 8:49 PM Jun Rao  wrote:
> > > > >
> > > > >> Hi, Harsha,
> > > > >>
> > > > >> Regarding feature branch, if the goal is faster collaboration, it
> > > seems
> > > > >> that doing 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-11-01 Thread Satish Duggana
Hi Jun,
Thanks for looking into the updated KIP and clarifying our earlier queries.

>20. It's fine to keep the HDFS binding temporarily in the PR. We just need
to remove it before it's merged to trunk. As Victor mentioned, we can
provide a reference implementation based on a mocked version of remote
storage.

Sure, sounds good.

>21. I am not sure that I understood the need for RemoteLogIndexEntry and
its relationship with RemoteLogSegmentInfo. It seems
that RemoteLogIndexEntry are offset index entries pointing to record
batches inside a segment. That seems to be the same as the .index file?

That is a good point. `RemoteLogManager` does not put a restriction on
`RemoteStorageManager(RSM)` for maintaining positions in the remote
segment same as the local segments or keeping a correlation between
local segment's positions to the remote segment positions. RSM gives
back the respective entries for a given log segment, call RSM to fetch
the data by giving the respective entry. This allows RSM to have
better control in managing the given log segments.

Thanks,
Satish.

On Fri, Nov 1, 2019 at 2:28 AM Jun Rao  wrote:
>
> Hi, Harsha,
>
> I am still looking at the KIP and the PR. A couple of quick
> comments/questions.
>
> 20. It's fine to keep the HDFS binding temporarily in the PR. We just need
> to remove it before it's merged to trunk. As Victor mentioned, we can
> provide a reference implementation based on a mocked version of remote
> storage.
>
> 21. I am not sure that I understood the need for RemoteLogIndexEntry and
> its relationship with RemoteLogSegmentInfo. It seems
> that RemoteLogIndexEntry are offset index entries pointing to record
> batches inside a segment. That seems to be the same as the .index file?
>
> Thanks,
>
> Jun
>
> On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana 
> wrote:
>
> > Hi Viktor,
> > >1. Can we allow RLM Followers to serve read requests? After all segments
> > on
> > the cold storage are closed ones, no modification is allowed. Besides
> > KIP-392 (
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > )
> > would introduce follower fetching too, so I think it would be nice to
> > prepare RLM for this as well.
> >
> > That is a good point. We plan to support fetching remote storage from
> > followers too. Current code in the PR work fine for this scenario
> > though there may be some edge cases to be handled. We have not yet
> > tested this scenario.
> >
> > >2. I think the remote.log.storage.enable config is redundant. By
> > specifying
> > remote.log.storage.manager.class.name one already declares that they want
> > to use remote storage. Would it make sense to remove
> > the remote.log.storage.enable config?
> >
> > I do not think it is really needed. `remote.log.storage.enable`
> > property can be removed.
> >
> > Thanks,
> > Satish.
> >
> >
> > On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
> >  wrote:
> > >
> > > Hi Harsha,
> > >
> > > A couple more questions:
> > > 1. Can we allow RLM Followers to serve read requests? After all segments
> > on
> > > the cold storage are closed ones, no modification is allowed. Besides
> > > KIP-392 (
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > )
> > > would introduce follower fetching too, so I think it would be nice to
> > > prepare RLM for this as well.
> > > 2. I think the remote.log.storage.enable config is redundant. By
> > specifying
> > > remote.log.storage.manager.class.name one already declares that they
> > want
> > > to use remote storage. Would it make sense to remove
> > > the remote.log.storage.enable config?
> > >
> > > Thanks,
> > > Viktor
> > >
> > >
> > > On Thu, Oct 24, 2019 at 10:37 AM Viktor Somogyi-Vass <
> > > viktorsomo...@gmail.com> wrote:
> > >
> > > > Hi Jun & Harsha,
> > > >
> > > > I think it would be beneficial to at least provide one simple reference
> > > > implementation (file system based?) as we do with connect too.
> > > > That would as a simple example and would help plugin developers to
> > better
> > > > understand the concept and the interfaces.
> > > >
> > > > Best,
> > > > Viktor
> > > >
> > > > On Wed, Oct 23, 2019 at 8:49 PM Jun Rao  wrote:
> > > >
> > > >> Hi, Harsha,
> > > >>
> > > >> Regarding feature branch, if the goal is faster collaboration, it
> > seems
> > > >> that doing the development on your own fork is better since
> > non-committers
> > > >> can push changes there.
> > > >>
> > > >> Regarding the dependencies, this is an important thing to clarify. My
> > > >> understanding for this KIP is that in Apache Kafka, we won't provide
> > any
> > > >> specific implementation for a particular block storage. There are many
> > > >> block storage systems out there (HDFS, S3, Google storage, Azure
> > storage,
> > > >> Ceph, etc). We don't want to drag in all those dependencies in Apache
> > > >> Kafka, even if they are in a separate module. 

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

2019-10-31 Thread Jun Rao
Hi, Harsha,

I am still looking at the KIP and the PR. A couple of quick
comments/questions.

20. It's fine to keep the HDFS binding temporarily in the PR. We just need
to remove it before it's merged to trunk. As Victor mentioned, we can
provide a reference implementation based on a mocked version of remote
storage.

21. I am not sure that I understood the need for RemoteLogIndexEntry and
its relationship with RemoteLogSegmentInfo. It seems
that RemoteLogIndexEntry are offset index entries pointing to record
batches inside a segment. That seems to be the same as the .index file?

Thanks,

Jun

On Mon, Oct 28, 2019 at 9:11 PM Satish Duggana 
wrote:

> Hi Viktor,
> >1. Can we allow RLM Followers to serve read requests? After all segments
> on
> the cold storage are closed ones, no modification is allowed. Besides
> KIP-392 (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> )
> would introduce follower fetching too, so I think it would be nice to
> prepare RLM for this as well.
>
> That is a good point. We plan to support fetching remote storage from
> followers too. Current code in the PR work fine for this scenario
> though there may be some edge cases to be handled. We have not yet
> tested this scenario.
>
> >2. I think the remote.log.storage.enable config is redundant. By
> specifying
> remote.log.storage.manager.class.name one already declares that they want
> to use remote storage. Would it make sense to remove
> the remote.log.storage.enable config?
>
> I do not think it is really needed. `remote.log.storage.enable`
> property can be removed.
>
> Thanks,
> Satish.
>
>
> On Thu, Oct 24, 2019 at 2:46 PM Viktor Somogyi-Vass
>  wrote:
> >
> > Hi Harsha,
> >
> > A couple more questions:
> > 1. Can we allow RLM Followers to serve read requests? After all segments
> on
> > the cold storage are closed ones, no modification is allowed. Besides
> > KIP-392 (
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> )
> > would introduce follower fetching too, so I think it would be nice to
> > prepare RLM for this as well.
> > 2. I think the remote.log.storage.enable config is redundant. By
> specifying
> > remote.log.storage.manager.class.name one already declares that they
> want
> > to use remote storage. Would it make sense to remove
> > the remote.log.storage.enable config?
> >
> > Thanks,
> > Viktor
> >
> >
> > On Thu, Oct 24, 2019 at 10:37 AM Viktor Somogyi-Vass <
> > viktorsomo...@gmail.com> wrote:
> >
> > > Hi Jun & Harsha,
> > >
> > > I think it would be beneficial to at least provide one simple reference
> > > implementation (file system based?) as we do with connect too.
> > > That would as a simple example and would help plugin developers to
> better
> > > understand the concept and the interfaces.
> > >
> > > Best,
> > > Viktor
> > >
> > > On Wed, Oct 23, 2019 at 8:49 PM Jun Rao  wrote:
> > >
> > >> Hi, Harsha,
> > >>
> > >> Regarding feature branch, if the goal is faster collaboration, it
> seems
> > >> that doing the development on your own fork is better since
> non-committers
> > >> can push changes there.
> > >>
> > >> Regarding the dependencies, this is an important thing to clarify. My
> > >> understanding for this KIP is that in Apache Kafka, we won't provide
> any
> > >> specific implementation for a particular block storage. There are many
> > >> block storage systems out there (HDFS, S3, Google storage, Azure
> storage,
> > >> Ceph, etc). We don't want to drag in all those dependencies in Apache
> > >> Kafka, even if they are in a separate module. Doing that will make the
> > >> Kafka repo much harder to manage. We have used the same approach for
> > >> connect. The connect framework is in Apache Kafka, but all specific
> > >> connectors are hosted externally.
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >>
> > >>
> > >> On Wed, Oct 23, 2019 at 8:41 AM Eno Thereska 
> > >> wrote:
> > >>
> > >> > Thanks Satish, Harsha,
> > >> >
> > >> > It's probably worth it making it clearer in the KIP what exact
> > >> > libraries will be added to libs, if any. The KIP specifies the
> remote
> > >> > storage interface but it isn't clear if particular implementations
> > >> > will be added to Kafka's repository or whether they will reside in
> > >> > other repositories. If I understand the intention correctly, you are
> > >> > proposing to have an HDFS and S3 implementation as part of the Kafka
> > >> > repository working out of the box. Is that correct?
> > >> >
> > >> > Thanks
> > >> > Eno
> > >> >
> > >> > On Wed, Oct 23, 2019 at 5:01 AM Satish Duggana <
> > >> satish.dugg...@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > >Regarding the HDFS dependency its not a direct dependency rather
> > >> > > its implementing the RemoteStorageManager interface.
> > >> > > We packaged it along with core to make it more convenient to test
> it.
> > >> We
> > >> > > can move this to 

  1   2   >