from:"Justine Olshan"

Re: [DISCUSS] KIP-890 Server Side Defense

2024-02-06 Thread Justine Olshan

I don't think AddPartitions is a good example since we currenly don't gate
the version on TV or MV. (We only set a different flag depending on the TV)

Even if we did want to gate it on TV, I think the idea is to move away from
MV gating inter broker protocols. Ideally we can get to a state where MV is
just used for metadata changes.

I think some of this discussion might fit more with the feature version
KIP, so I can try to open that up soon. Until we settle that, some of the
work in KIP-890 is blocked.

Justine

On Mon, Feb 5, 2024 at 5:38 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> Since AddPartitions is an inter broker request, will its version be gated
> only by TV or other features like MV too? For example, if we need to change
> the protocol for AddPartitions for reasons other than txn verification in
> the future, will the new version be gated by a new MV? If so, does
> downgrading a TV imply potential downgrade of MV too?
>
> Jun
>
>
>
> On Mon, Feb 5, 2024 at 5:07 PM Justine Olshan  >
> wrote:
>
> > One TV gates the flexible feature version (no rpcs involved, only the
> > transactional records that should only be gated by TV)
> > Another TV gates the ability to turn on kip-890 part 2. This would gate
> the
> > version of Produce and EndTxn (likely only used by transactions), and
> > specifies a flag in AddPartitionsToTxn though the version is already used
> > without TV.
> >
> > I think the only concern is the Produce request and we could consider
> work
> > arounds similar to the AddPartitionsToTxn call.
> >
> > Justine
> >
> > On Mon, Feb 5, 2024 at 4:56 PM Jun Rao  wrote:
> >
> > > Hi, Justine,
> > >
> > > Which PRC/record protocols will TV guard? Going forward, will those
> > > PRC/record protocols only be guarded by TV and not by other features
> like
> > > MV?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Feb 5, 2024 at 2:41 PM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Sorry I think I misunderstood your question or answered incorrectly.
> > The
> > > TV
> > > > version should ideally be fully independent from MV.
> > > > At least for the changes I proposed, TV should not affect MV and MV
> > > should
> > > > not affect TV/
> > > >
> > > > I think if we downgrade TV, only that feature should downgrade.
> > Likewise
> > > > the same with MV. The finalizedFeatures should just reflect the
> feature
> > > > downgrade we made.
> > > >
> > > > I also plan to write a new KIP for managing the disk format and
> upgrade
> > > > tool as we will need new flags to support these features. That should
> > > help
> > > > clarify some things.
> > > >
> > > > Justine
> > > >
> > > > On Mon, Feb 5, 2024 at 11:03 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > So, if we downgrade TV, we could implicitly downgrade another
> feature
> > > > (say
> > > > > MV) that has dependency (e.g. RPC). What would we return for
> > > > > FinalizedFeatures for MV in ApiVersionsResponse in that case?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Feb 2, 2024 at 1:06 PM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > Yes, the idea is that if we downgrade TV (transaction version) we
> > > will
> > > > > stop
> > > > > > using the add partitions to txn optimization and stop writing the
> > > > > flexible
> > > > > > feature version of the log.
> > > > > > In the compatibility section I included some explanations on how
> > this
> > > > is
> > > > > > done.
> > > > > >
> > > > > > Thanks,
> > > > > > Justine
> > > > > >
> > > > > > On Fri, Feb 2, 2024 at 11:12 AM Jun Rao  >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Justine,
> > > > > > &g

Re: [VOTE] 3.7.0 RC4

2024-02-14 Thread Justine Olshan

Hey Stan,
Did we ever get system tests results? I can also start a run

On Mon, Feb 12, 2024 at 1:06 PM Jakub Scholz  wrote:

> +1 (non-binding). I used the staged binaries with Scala 2.13 and the staged
> Maven artifacts to run my tests. All seems to work fine. Thanks.
>
> Jakub
>
> On Fri, Feb 9, 2024 at 4:20 PM Stanislav Kozlovski
>  wrote:
>
> > Hello Kafka users, developers and client-developers,
> >
> > This is the second candidate we are considering for release of Apache
> Kafka
> > 3.7.0.
> >
> > Major changes include:
> > - Early Access to KIP-848 - the next generation of the consumer rebalance
> > protocol
> > - Early Access to KIP-858: Adding JBOD support to KRaft
> > - KIP-714: Observability into Client metrics via a standardized interface
> >
> > Release notes for the 3.7.0 release:
> >
> >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/RELEASE_NOTES.html
> >
> > *** Please download, test and vote by Thursday, February 15th, 9AM PST
> ***
> >
> > Kafka's KEYS file containing PGP keys we use to sign the release:
> > https://kafka.apache.org/KEYS
> >
> > * Release artifacts to be voted upon (source and binary):
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/
> >
> > * Docker release artifact to be voted upon:
> > apache/kafka:3.7.0-rc4
> >
> > * Maven artifacts to be voted upon:
> > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> >
> > * Javadoc:
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/javadoc/
> >
> > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> > https://github.com/apache/kafka/releases/tag/3.7.0-rc4
> >
> > * Documentation:
> > https://kafka.apache.org/37/documentation.html
> >
> > * Protocol:
> > https://kafka.apache.org/37/protocol.html
> >
> > * Successful Jenkins builds for the 3.7 branch:
> >
> > Unit/integration tests: I am in the process of running and analyzing
> these.
> > System tests: I am in the process of running these.
> >
> > Expect a follow-up over the weekend
> >
> > * Successful Docker Image Github Actions Pipeline for 3.7 branch:
> > Docker Build Test Pipeline:
> > https://github.com/apache/kafka/actions/runs/7845614846
> >
> > /**
> >
> > Best,
> > Stanislav
> >
>

Re: [VOTE] 3.7.0 RC4

2024-02-14 Thread Justine Olshan

I just merged a PR to trunk that fixes some of the LogDirFailureTests.

I believe the issue is mostly related to the tests and not the code. Not
sure if we think it is necessary to hotfix since the change was in the
tests.
https://github.com/apache/kafka/commit/be6653c8bc25717e25a7db164527635a6579b4cc

Thanks for the results and the gist!

Justine

On Wed, Feb 14, 2024 at 11:53 AM Stanislav Kozlovski
 wrote:

> Hey Justine,
>
> Good question. I have been running the system tests and I have been
> preparing a doc to share. I got one run completed and was waiting for a
> second one to cross-check, but unfortunately I had a build take 35+ hours
> and canceled it. I'm currently 6h into the second one, expecting it to
> complete in 11h or so.
>
> There were quite a bit, so I was looking forward to get another run in to
> cross-check the flakes:
> -
>
> https://gist.githubusercontent.com/stanislavkozlovski/3a97fc7602f3fee40cb374f1343d46b6/raw/b9de6e0eb975e8234d43bc725982096e47fd0457/rc4_test_failures_system_test_run_1.md
>
> I also checked the integration tests and saw
> - kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll
> - kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll
> failed three timed in a row.
>
> Here is the gist with the results:
> https://gist.github.com/stanislavkozlovski/820976fc7bfb5f4dcdf9742fd96a9982
>
>
>
> On Wed, Feb 14, 2024 at 6:31 PM Justine Olshan
> 
> wrote:
>
> > Hey Stan,
> > Did we ever get system tests results? I can also start a run
> >
> > On Mon, Feb 12, 2024 at 1:06 PM Jakub Scholz  wrote:
> >
> > > +1 (non-binding). I used the staged binaries with Scala 2.13 and the
> > staged
> > > Maven artifacts to run my tests. All seems to work fine. Thanks.
> > >
> > > Jakub
> > >
> > > On Fri, Feb 9, 2024 at 4:20 PM Stanislav Kozlovski
> > >  wrote:
> > >
> > > > Hello Kafka users, developers and client-developers,
> > > >
> > > > This is the second candidate we are considering for release of Apache
> > > Kafka
> > > > 3.7.0.
> > > >
> > > > Major changes include:
> > > > - Early Access to KIP-848 - the next generation of the consumer
> > rebalance
> > > > protocol
> > > > - Early Access to KIP-858: Adding JBOD support to KRaft
> > > > - KIP-714: Observability into Client metrics via a standardized
> > interface
> > > >
> > > > Release notes for the 3.7.0 release:
> > > >
> > > >
> > >
> >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/RELEASE_NOTES.html
> > > >
> > > > *** Please download, test and vote by Thursday, February 15th, 9AM
> PST
> > > ***
> > > >
> > > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > https://kafka.apache.org/KEYS
> > > >
> > > > * Release artifacts to be voted upon (source and binary):
> > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/
> > > >
> > > > * Docker release artifact to be voted upon:
> > > > apache/kafka:3.7.0-rc4
> > > >
> > > > * Maven artifacts to be voted upon:
> > > >
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > > >
> > > > * Javadoc:
> > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/javadoc/
> > > >
> > > > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> > > > https://github.com/apache/kafka/releases/tag/3.7.0-rc4
> > > >
> > > > * Documentation:
> > > > https://kafka.apache.org/37/documentation.html
> > > >
> > > > * Protocol:
> > > > https://kafka.apache.org/37/protocol.html
> > > >
> > > > * Successful Jenkins builds for the 3.7 branch:
> > > >
> > > > Unit/integration tests: I am in the process of running and analyzing
> > > these.
> > > > System tests: I am in the process of running these.
> > > >
> > > > Expect a follow-up over the weekend
> > > >
> > > > * Successful Docker Image Github Actions Pipeline for 3.7 branch:
> > > > Docker Build Test Pipeline:
> > > > https://github.com/apache/kafka/actions/runs/7845614846
> > > >
> > > > /**
> > > >
> > > > Best,
> > > > Stanislav
> > > >
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [VOTE] 3.7.0 RC4

2024-02-20 Thread Justine Olshan

Hey folks,

I've done the following to validate the release:

-- validated the keys for all the artifacts
-- built from source and started a ZK cluster -- ran a few workloads on it.
-- ran 2.12 Kraft cluster and ran a few workloads on it

I see there is a lot of ongoing discussion about the upgrade notes. +1
(binding) from me given Mickael is voting +1 as well.

Justine

On Tue, Feb 20, 2024 at 6:18 AM Divij Vaidya 
wrote:

> > I am a bit unclear on the precise process regarding what parts of this
> get merged at what time, and whether the release first needs to be done or
> not.
>
> The order is as follows:
>
> 1. Release approved as part of this vote. After this we follow the
> steps from here:
>
> https://cwiki.apache.org/confluence/display/KAFKA/Release+Process#ReleaseProcess-Afterthevotepasses
>
> 2. Upload artifacts to maven etc. These artifacts do not have RC suffix in
> them. You need a PMC member to mark these artifacts as "production" in
> apache svn.
> 3. Update website changes (docs, blog etc.). This is where your PRs
> on kafka-site repo get merged.
> 4. Send a release announcement by email.
>
> --
> Divij Vaidya
>
>
>
> On Tue, Feb 20, 2024 at 3:02 PM Stanislav Kozlovski
>  wrote:
>
> > Thanks for testing the release! And thanks for the review on the
> > documentation. Good catch on the license too.
> >
> > I have addressed the comments in the blog PR, and opened a few other PRs
> to
> > the website in relation to the release.
> >
> > - 37: Add download section for the latest 3.7 release
> > 
> > - 37: Update default docs to point to the 3.7.0 release docs
> > 
> > - 3.7: Add blog post for Kafka 3.7
> > 
> > - MINOR: Update stale upgrade_3_6_0 header links in documentation
> > 
> > - 37: Add upgrade notes for the 3.7.0 release
> > 
> >
> > I am a bit unclear on the precise process regarding what parts of this
> get
> > merged at what time, and whether the release first needs to be done or
> not.
> >
> > Best,
> > Stanislav
> >
> > On Mon, Feb 19, 2024 at 8:34 PM Divij Vaidya 
> > wrote:
> >
> > > Great. In that case we can fix the license issue retrospectively. I
> have
> > > created a JIRA for it
> https://issues.apache.org/jira/browse/KAFKA-16278
> > > and
> > > also updated the release process (which redirects to
> > > https://issues.apache.org/jira/browse/KAFKA-12622) to check for the
> > > correct
> > > license in both the kafka binaries.
> > >
> > > I am +1 (binding) assuming Mickael's concerns about update notes to 3.7
> > are
> > > addressed before release.
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Mon, Feb 19, 2024 at 6:08 PM Mickael Maison <
> mickael.mai...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I agree with Josep, I don't think it's worth making a new RC just for
> > > this.
> > > >
> > > > Thanks Stanislav for sharing the test results. The last thing holding
> > > > me from casting my vote is the missing upgrade notes for 3.7.0.
> > > >
> > > > Thanks,
> > > > Mickael
> > > >
> > > >
> > > >
> > > > On Mon, Feb 19, 2024 at 4:28 PM Josep Prat
>  > >
> > > > wrote:
> > > > >
> > > > > I think I remember finding a similar problem (NOTICE_binary) and it
> > > > didn't
> > > > > qualify for an extra RC
> > > > >
> > > > > Best,
> > > > >
> > > > > On Mon, Feb 19, 2024 at 3:44 PM Divij Vaidya <
> > divijvaidy...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I have performed the following checks. The only thing I would
> like
> > to
> > > > call
> > > > > > out is the missing licenses before providing a vote. How do we
> want
> > > > > > to proceed on this? What have we done in the past? (Creating a
> new
> > RC
> > > > is
> > > > > > overkill IMO for this license issue).
> > > > > >
> > > > > > ## License check
> > > > > >
> > > > > > Test: Validate license of dependencies for both 2.12 & 2.13
> binary.
> > > > > > Result: Missing license for some scala* libraries specifically
> for
> > > > 2.12.
> > > > > > Seems like we have been missing these licenses for quite some
> > version
> > > > now.
> > > > > >
> > > > > > ```
> > > > > > for f in $(ls libs | grep -v "^kafka\|connect\|trogdor"); do if !
> > > grep
> > > > -q
> > > > > > ${f%.*} LICENSE; then echo "${f%.*} is missing in license file";
> > fi;
> > > > done
> > > > > > scala-collection-compat_2.12-2.10.0 is missing in license file
> > > > > > scala-java8-compat_2.12-1.0.2 is missing in license file
> > > > > > scala-library-2.12.18 is missing in license file
> > > > > > scala-logging_2.12-3.9.4 is missing in license file
> > > > > > scala-reflect-2.12.18 is missing in license file
> > > > > > ```
> > > > > >
> > > > > > ## Long running tests for memory leak (on ARM machine with zstd)
> > > > > >
> > > > > > Test: Run produce/consume for a few ho

Re: [DISCUSS]

2024-02-26 Thread Justine Olshan

Hey there,

Thanks for bringing this discussion to the mailing list. I think this is an
interesting one.

I am little concerned about the ability to add any topic ID to the create
topic request. It means that we can no longer rely on topic IDs being
globally unique. I worry folks may misunderstand and think they should be
creating their own IDs for topics and getting into sticky situations.

I wonder if there is a way to encapsulate what you want to do into an API
(ie some sort of formalized migration of a topic). This would make it
clearer that this is meant to be the "same" topic.
I also wonder if this is something that would be on the tiered storage
side. Is the use case combining two topics or recreating the topic and
using the same tiered storage? I'm curious about this use case and if the
topic is intended to be the "same" topic on the recreated cluster.

Thanks,
Justine

On Mon, Feb 26, 2024 at 5:39 AM Anton Agestam
 wrote:

> Hello everyone,
>
> I'm hoping to start a discussion to get some feedback on a topic, I assume
> this is something that would require a KIP, but wanted to start out with a
> discussion to gather some early feedback.
>
> We have a use case internally for creating topics with a given topic ID.
> The need for this arises because we're working on the capability to delete
> a cluster and later recreate its tiered storage topics in an entirely new
> cluster. For that to work, since the organization of the tiered data is
> dependent on the topic ID, the topic created in the fresh cluster must have
> the same ID as the original one in the deleted cluster.
>
> We have a small patch only used for internal testing so far, showing that
> adding TopicID as a tagged field on CreateTopicsRequest and implementing
> the capability is a small and uncomplicated change:
>
> https://github.com/aiven/kafka/pull/48/commits/fa5ebbca642391af630f1c7f7cc013fe56856e13
> .
>
> We also foresee that the same need will arise from tools that replicate
> topics between clusters, such as mirrormaker2.
>
> Are there any thoughts about this proposal, or any pointers for moving
> forward formalizing this in a KIP?
>
> Best regards,
> Anton
>
> --
> [image: Aiven] 
> *Anton Agestam* (he/him or they/them)
> Software Engineer, *Aiven*
> anton.ages...@aiven.io   |   +46 704 486 289
> aiven.io    |
> 
>  >
>

Re: [DISCUSS] Apache Kafka 3.8.0 release

2024-02-26 Thread Justine Olshan

Thanks Joesp. +1 from me.

On Mon, Feb 26, 2024 at 3:37 AM Josep Prat 
wrote:

> Hi all,
>
> I'd like to volunteer as release manager for the Apache Kafka 3.8.0
> release.
> If there are no objections, I'll start building a release plan (or adapting
> the one Colin made some weeks ago) in the wiki in the next days.
>
> Thank you.
>
> --
> [image: Aiven] 
>
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io    |    >
>      <
> https://twitter.com/aiven_io>
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B
>

[DISCUSS] KIP-1022 Formatting and Updating Features

2024-02-26 Thread Justine Olshan

Hello folks,

I'm proposing a KIP that allows for setting and upgrading new features
(other than metadata version) via the kafka storage format and feature
tools. This KIP extends on the feature versioning changes introduced by
KIP-584 by allowing for the features to be set and upgraded.

Please take a look:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Formatting+and+Updating+Features

Thanks,

Justine

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-02-28 Thread Justine Olshan

Hey Andrew,

Thanks for taking a look.

I previously didn't include 1. We do plan to use these features immediately
for KIP-890 and KIP-848. If we think it is easier to put the discussion in
those KIP discussions we can, but I fear that it will easily get lost given
the size of the KIPs.

I named the features similar to how we named metadata version. Transaction
version would control transaction features like enabling a new transaction
record format and APIs to enable KIP-890 part 2. Likewise, the group
coordinator version would also enable the new record formats there and the
new group coordinator. I am open to new names or further discussion.

For 2 and 3, I can provide example scripts that show the usage. I am open
to adding --latest-stable as well.

Justine

On Tue, Feb 27, 2024 at 4:59 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Justine,
> Thanks for the KIP. This area of Kafka is complicated and making it easier
> is good.
>
> When I use the `kafka-features.sh` tool to describe the features on my
> cluster, I find that there’s a
> single feature called “metadata.version”. I think this KIP does a handful
> of things:
>
> 1) It introduces the idea of two new features, TV and GCV, without giving
> them concrete names or
> describing their behaviour.
> 2) It introduces a new flag on the storage tool to enable advanced users
> to control individual features
> when they format storage for a new broker.
> 3) It introduces a new flag on the features tool to enable a set of latest
> stable features for a given
> version to be enabled all together.
>
> I think that (1) probably shouldn’t be in this KIP unless there are
> concrete details. Maybe this KIP is enabling
> the operator experience when we introduce TV and GCV in other KIPs. I
> don’t believe the plan is to enable
> the new group coordinator with a feature, and it seems unnecessary to me.
> I think it’s more compelling for TV
> given the changes in transactions.
>
> For (2) and (3), it would be helpful to explicit about the syntax for the
> enhancements to the tool. I think
> that for the features tool, `--release-version` is an optional parameter
> which requires a RELEASE_VERSION
> argument. I wonder whether it would be helpful to have `--latest-stable`
> as an option too.
>
> Thanks,
> Andrew
>
> > On 26 Feb 2024, at 21:26, Justine Olshan 
> wrote:
> >
> > Hello folks,
> >
> > I'm proposing a KIP that allows for setting and upgrading new features
> > (other than metadata version) via the kafka storage format and feature
> > tools. This KIP extends on the feature versioning changes introduced by
> > KIP-584 by allowing for the features to be set and upgraded.
> >
> > Please take a look:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Formatting+and+Updating+Features
> >
> > Thanks,
> >
> > Justine
>
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-02-29 Thread Justine Olshan

Hey folks,

Thanks for the discussion. Let me try to cover everyone's comments.

Artem --
I can add the examples you mentioned. As for naming, right now the feature
is called "transaction version" or "TV". Are you suggesting it should be
called "transaction protocol version" or "TPV"? I don't mind that, but just
wanted to clarify if we want to include protocol or if simply "transaction
version" is enough.

Jun --

10.  *With **more features, would each of those be controlled by a separate
feature or*

*multiple features. For example, is the new transaction record format*

*controlled only by MV with TV having a dependency on MV or is it
controlled*

*by both MV and TV.*


I think this will need to be decided on a case by case basis. There should
be a mechanism to set dependencies among features.
For transaction version specifically, I have no metadata version
dependencies besides requiring 3.3 to write the feature records and use the
feature tools. I would suspect all new features would have this requirement.


11. *Basically, if **--release-version is not used, the command will just
use the latest*

*production version of every feature. Should we apply that logic to both*

*tools?*


How would this work with the upgrade tool? I think we want a way to set a
new feature version for one feature and not touch any of the others.


*12. Should we remove --metadata METADATA from kafka-features? It does the*

*same thing as --release-version.*


When I previously discussed with Colin McCabe offline about this tool, he
was strongly against deprecation or changing flags. I personally think it
could good

to unify and not support a ton of flags, but I would want to make sure he
is aligned.


*13. KIP-853 also extends the tools to support a new feature kraft.version.*

*It would be useful to have alignment between that KIP and this one.*


Sounds good. Looks like Jose is in on the discussion so we can continue
here. :)



Jose --


*1. KIP-853 uses --feature for kafka-storage instead of --features.*

*This is consistent with the use of --feature in the "kafka-feature.sh*

*upgrade" command.*


I wanted to include multiple features in one command, so it seems like
features is a better name. I discuss more below about why I think we should
allow setting multiple features at once.


*2. I find it a bit inconsistent that --feature and --release-version*

*are mutually exclusive in the kafka-feature CLI but not in the*

*kafka-storage CLI. What is the reason for this decision?*


For the storage tool, we are setting all the features for the cluster. By
default, all are set. For the upgrade tool, the default is to set one
feature. In the storage tool, it is natural for the --release-version to
set the remaining features that --features didn't cover since otherwise we
would need to set them all

If we use the flag. In the feature upgrade case, it is less necessary for
all the features to be set at once and the tool can be run multiple times.
I'm not opposed to allowing both flags, but it is less necessary in my
opinion.


*3. KIP-853 deprecates --metadata in the kafka-features and makes it an*

*alias for --release-version. In KIP-1022, what happens if the user*

*specified both --metadata and --feature?*


See my note above (Jun's comment 12) about deprecating old commands. I
would think as the KIP stands now, we would not accept both commands.


*4. I would suggest keeping this*

*consistent with kafka-features. It would avoid having to implement one*

*more parser in Kafka.*


I sort of already implemented it as such, so I don't think it is too
tricky. I'm not sure of an alternative. Kafka features currently only
supports one feature at a time.
I would like to support more than one for the storage tool. Do you have
another suggestion for multiple features in the storage tool?


*5. As currently described, trial and error seem to be the*

*only mechanism. Should the Kafka documentation describe these*

*dependencies? Is that good enough?*


The plan so far is documentation. The idea is that this is an advanced
feature, so I think it is reasonable to ask folks use documentation


*6. Did you mean that 3.8-IV4 would map to TV2? If*

*not, 3.8-IV3 would map to two different TV values.*


It was a typo. Each MV maps to a single other feature version.


*7. For --release-version in "kafka-features upgrade", does*

*--release-version mean that all of the feature versions will either*

*upgrade or stay the same? Meaning that downgrades will be rejected by*

*the Kafka controller. How about when --release-version is used with*

*"kafka-features downgrade"?*


The idea I had was that the cluster will check if all the versions are
supported. If any version is not supported the tool will throw an error. We
can also issue a warning if the given command (if supported by the cluster)
will downgrade a feature. One option is also to require a yes/no prompt or
flag to allow downgrades. The downgrade tool would allow the same.
Alternativel

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-02-29 Thread Justine Olshan

Thanks José --

Let me answer a few quick things

If you are not going to deprecate or alias --metadata what happens if
the user uses these flags in one command: "kafka-features upgrade
--metadata 3.8 --feature kraft.version=1"

> I  would think as the KIP stands now, we would not accept both commands

Metadata flag was added as part of KIP-778. See
https://cwiki.apache.org/confluence/display/KAFKA/KIP-778%3A+KRaft+to+KRaft+Upgrades
and
https://github.com/apache/kafka/commit/28d5a059438634db6fdecdbb816e2584715884d6


I think the usage of --feature multiple times is a little bit clunky, but
we can use it for consistency with the existing tool. I was hoping it could
be simple to set one feature differently and take the default for
everything else. But alas.
For the storage tool, if we have to set all the features, do you expect to
throw an error when one is missing? This seems acceptable, but just wanted
to clarify.

For documentation, we can have an autogenerated doc. I haven't thought too
much how this would look yet.

I am ok with throwing an error if a feature isn't moving in the correct
direction (upgrade or downgrade) for the command that is given.

I will continue to think about everyone's comments and update the KIP over
the next few days.

Thanks,
Justine

On Thu, Feb 29, 2024 at 4:31 PM José Armando García Sancio
 wrote:

> Thanks for the reply Justine. See my comments below:
>
> On Thu, Feb 29, 2024 at 3:39 PM Justine Olshan
>  wrote:
> > I wanted to include multiple features in one command, so it seems like
> > features is a better name. I discuss more below about why I think we
> should
> > allow setting multiple features at once.
>
> --feature in kafka-features allows for multiple features to be
> specified. Please see the implementation of the tool and
> UpdateFeatures RPC.
>
> For example, you can execute this command kafka-features.sh upgrade
> --feature metadata.version=15 --feature kraft.version=1. You should be
> able to support the same UI in kafka-storage.sh. It is what KIP-853
> suggests in the interest of making it consistent across CLI.
>
> > For the storage tool, we are setting all the features for the cluster. By
> > default, all are set. For the upgrade tool, the default is to set one
> > feature. In the storage tool, it is natural for the --release-version to
> > set the remaining features that --features didn't cover since otherwise
> we
> > would need to set them all
> >
> > If we use the flag. In the feature upgrade case, it is less necessary for
> > all the features to be set at once and the tool can be run multiple
> times.
> > I'm not opposed to allowing both flags, but it is less necessary in my
> > opinion.
>
> I was thinking of making them mutually exclusive in both cases. Much
> easier to explain and document. If the user wants to use --feature in
> kafka-storage, they need to specify all of the supported features.
>
> For the "kafka-feature upgrade" case, they can enumerate all of the
> --feature versions that they want to upgrade in one command.
>
> > See my note above (Jun's comment 12) about deprecating old commands. I
> > would think as the KIP stands now, we would not accept both commands.
>
> If you are not going to deprecate or alias --metadata what happens if
> the user uses these flags in one command: "kafka-features upgrade
> --metadata 3.8 --feature kraft.version=1"
>
> One of the problems I am having is that I can't seem to find the KIP
> that introduces --metadata so I can only speculate as to the intent of
> this flag from the CLI help and implementation. Do you know which KIP
> added that flag?
>
> > I sort of already implemented it as such, so I don't think it is too
> > tricky. I'm not sure of an alternative. Kafka features currently only
> > supports one feature at a time.
> > I would like to support more than one for the storage tool. Do you have
> > another suggestion for multiple features in the storage tool?
>
> "kafka-features.sh upgrade" supports multiple --feature flags. Please
> see the tools implementation and the UpdateFeatures RPC.
>
> You can specify multiple features in the storage tool with:
> "kafka-storage format --feature kraft.version=1 --feature
> metadata.version=15". The command line parser used by both tools
> support flags that append values to a list. This is how the
> kafka-features CLI works already.
>
> > The plan so far is documentation. The idea is that this is an advanced
> > feature, so I think it is reasonable to ask folks use documentation
>
> Are you going to generate the documentation of these dependencies
> automatically from the released code

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-07 Thread Justine Olshan

Hi Colin,

Thanks for bringing up these points. I believe Jose suggested failing the
command if some features upgrade on the downgrade call and vice versa. This
could address some of the concerns you bring up here. I agree we should not
have an rpc to do both.

As for setting metadata only, we do have the option of using the --feature
flag to set only metadata version, but that requires writing it by short
rather than by metadata string. Not sure if that's what folks had in mind.

Including a command to list the features for a given version is pretty
useful. I can add that to the kip unless folks have concerns.

Justine

On Thu, Mar 7, 2024 at 1:28 PM Colin McCabe  wrote:

> Hi Jun and Justine,
>
> My understanding is that in the KIP, --metadata does not do the same thing
> as --release-version. The --metadata flag sets the metadata version only,
> and does not change any other feature. So neither flag replaces the other.
>
> In general, I'm not sure that this functionality belongs in the features
> command. It will have a lot of confusing interactions with the other flags.
> If we do want it in the features command it should be its own subcommand,
> not upgrade or downgrade. Because setting all the features could involve
> upgrading some, but downgrading others. So really what we're doing is
> neither an upgrade nor a downgrade, but a sync to a release version.
>
> This also raises some questions about the RPC. If we want to use the
> existing features RPC, we cannot perform a sync that does both upgrades and
> downgrades in a single RPC. So we'd either need to be non-atomic, or just
> refuse to do it in that case. Or add a new server-side RPC for this.
>
> Perhaps it would be more useful to just have a command that tells people
> what the levels of each feature would be for a given release version? Then
> the user could choose which ones to upgrade / downgrade (if any) and do
> them carefully one at a time. The shotgun approach, where you upgrade
> everything at once, actually isn't what I think most users want to do in
> production.
>
> best,
> Colin
>
>
> On Fri, Mar 1, 2024, at 13:25, Jun Rao wrote:
> > Hi, Justine,
> >
> > Thanks for the reply.
> >
> > 11. Yes, we can still have the option to set individual features. I was
> > suggesting that if one uses the tool without either --release-version
> > or --features, it will just use the latest version of every feature that
> it
> > knows. This is what's proposed in the original KIP-584.
> >
> > 12. If we can't deprecate --metadata METADATA, it would be useful to
> > describe the reason.
> >
> > Jun
> >
> > On Thu, Feb 29, 2024 at 3:39 PM Justine Olshan
> 
> > wrote:
> >
> >> Hey folks,
> >>
> >> Thanks for the discussion. Let me try to cover everyone's comments.
> >>
> >> Artem --
> >> I can add the examples you mentioned. As for naming, right now the
> feature
> >> is called "transaction version" or "TV". Are you suggesting it should be
> >> called "transaction protocol version" or "TPV"? I don't mind that, but
> just
> >> wanted to clarify if we want to include protocol or if simply
> "transaction
> >> version" is enough.
> >>
> >> Jun --
> >>
> >> 10.  *With **more features, would each of those be controlled by a
> separate
> >> feature or*
> >>
> >> *multiple features. For example, is the new transaction record format*
> >>
> >> *controlled only by MV with TV having a dependency on MV or is it
> >> controlled*
> >>
> >> *by both MV and TV.*
> >>
> >>
> >> I think this will need to be decided on a case by case basis. There
> should
> >> be a mechanism to set dependencies among features.
> >> For transaction version specifically, I have no metadata version
> >> dependencies besides requiring 3.3 to write the feature records and use
> the
> >> feature tools. I would suspect all new features would have this
> >> requirement.
> >>
> >>
> >> 11. *Basically, if **--release-version is not used, the command will
> just
> >> use the latest*
> >>
> >> *production version of every feature. Should we apply that logic to
> both*
> >>
> >> *tools?*
> >>
> >>
> >> How would this work with the upgrade tool? I think we want a way to set
> a
> >> new feature version for one feature and not touch any of the others.
> >>
> >>
> >> *12. Should we remove --me

Re: [VOTE] KIP-939: Support Participation in 2PC

2024-03-12 Thread Justine Olshan

Hey Artem,
I took another look at the KIP and the discussion on the mailing list.

+1 (binding) from me :)

Justine

On Wed, Mar 6, 2024 at 10:23 AM Martijn Visser 
wrote:

> Hi all,
>
> It is so exiting to see this KIP and know that it will greatly benefit
> Flink and other technologies.
>
>  +1
>
> Best regards,
>
> Martijn
>
> Op vr 1 dec 2023 om 11:07 schreef Artem Livshits
> 
>
> > Hello,
> >
> > This is a voting thread for
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
> > .
> >
> > The KIP proposes extending Kafka transaction support (that already uses
> 2PC
> > under the hood) to enable atomicity of dual writes to Kafka and an
> external
> > database, and helps to fix a long standing Flink issue.
> >
> > An example of code that uses the dual write recipe with JDBC and should
> > work for most SQL databases is here
> > https://github.com/apache/kafka/pull/14231.
> >
> > The FLIP for the sister fix in Flink is here
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710
> >
> > -Artem
> >
>

Re: [DISCUSS] Apache Kafka 3.6.2 release

2024-03-13 Thread Justine Olshan

Thanks Manikumar!
+1 from me

Justine

On Wed, Mar 13, 2024 at 8:52 AM Manikumar  wrote:

> Hi,
>
> I'd like to volunteer to be the release manager for a bug fix release of
> the 3.6 line.
> If there are no objections, I'll send out the release plan soon.
>
> Thanks,
> Manikumar
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-13 Thread Justine Olshan

Hey folks -- let me summarize my understanding

Artem requested we change the name of transaction version (tv) to
transaction protocol version (tpv). If I do that, I will also probably
change to group coordinator protocol version (gcpv).

Folks were concerned about upgrades/downgrades with different feature
versions when using the feature tool and --release version. I think we can
address some concerns here:

   - don't allow a command that moves a version in the wrong direction
   (when upgrade actually downgrades a feature for example)
   - allow a command to describe all the feature versions for a given
   metadata version

Note that Colin expressed some concern with having the --release-version
flag at all. I wonder if we can reach a compromise with having the upgrade
command have a "latest" command only.

Jun proposed this "latest" functionality can be the default when no version
or feature is specified.

Jose requested that we make --release-version and --feature mutually
exclusive for both the storage tool and the features tool. We should also
specify each feature in the storage tool as a separate flag.

Finally, Jun and Jose requested deprecation of the --metadata flag, but
Colin felt we should keep it. I mentioned the --feature flag does the same
but no reply yet.

* So here are the updates I propose:*
1. Update the feature names as Artem described -- transaction protocol
version and group coordinator protocol version

2. Rename --features in the storage tool to --feature. In the case where we
use this flag, we must repeat the flag for the number of features to set
all of them.

3. No argument behavior on the storage and upgrade tools is to set all the
features to the latest (stable) version

4. Include an API to list the features for a given metadata version

5. I'm going back and forth on whether we should support the
--release-version flag still. If we do, we need to include validation so we
only upgrade on upgrade.

Let me know if folks have any issues with the updates I proposed or think
there is something I missed.

Justine

On Fri, Mar 8, 2024 at 11:59 AM Artem Livshits
 wrote:

> Hi Justine,
>
> >  Are you suggesting it should be called "transaction protocol version" or
> "TPV"? I don't mind that, but just wanted to clarify if we want to include
> protocol or if simply "transaction version" is enough.
>
> My understanding is that "metadata version" is the version of metadata
> records, which is fairly straightforward.  "Transaction version" may be
> ambiguous.
>
> -Artem
>
> On Thu, Feb 29, 2024 at 3:39 PM Justine Olshan
> 
> wrote:
>
> > Hey folks,
> >
> > Thanks for the discussion. Let me try to cover everyone's comments.
> >
> > Artem --
> > I can add the examples you mentioned. As for naming, right now the
> feature
> > is called "transaction version" or "TV". Are you suggesting it should be
> > called "transaction protocol version" or "TPV"? I don't mind that, but
> just
> > wanted to clarify if we want to include protocol or if simply
> "transaction
> > version" is enough.
> >
> > Jun --
> >
> > 10.  *With **more features, would each of those be controlled by a
> separate
> > feature or*
> >
> > *multiple features. For example, is the new transaction record format*
> >
> > *controlled only by MV with TV having a dependency on MV or is it
> > controlled*
> >
> > *by both MV and TV.*
> >
> >
> > I think this will need to be decided on a case by case basis. There
> should
> > be a mechanism to set dependencies among features.
> > For transaction version specifically, I have no metadata version
> > dependencies besides requiring 3.3 to write the feature records and use
> the
> > feature tools. I would suspect all new features would have this
> > requirement.
> >
> >
> > 11. *Basically, if **--release-version is not used, the command will just
> > use the latest*
> >
> > *production version of every feature. Should we apply that logic to both*
> >
> > *tools?*
> >
> >
> > How would this work with the upgrade tool? I think we want a way to set a
> > new feature version for one feature and not touch any of the others.
> >
> >
> > *12. Should we remove --metadata METADATA from kafka-features? It does
> the*
> >
> > *same thing as --release-version.*
> >
> >
> > When I previously discussed with Colin McCabe offline about this tool, he
> > was strongly against deprecation or changing flags. I personally think it
> > could good
> >
> > to unify and not s

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-18 Thread Justine Olshan

Hey folks,

I didn't have a strong preference for requesting the versions via the tool
only or getting it from the server. Colin seemed to suggest it was "for the
best" to request from the server to make the tool lightweight.
I guess the argument against that is having to build and support another
API.

It also seems like there may be some confusion -- partially on my part. For
the tools, I had a question about the feature upgrade tool. So it seems
like we already support multiple features via the `--feature` flag, we
simply rely on the server side to throw errors now?

Justine

On Fri, Mar 15, 2024 at 3:38 PM José Armando García Sancio
 wrote:

> Hi Justine,
>
> Thanks for the update. Some comments below.
>
> On Wed, Mar 13, 2024 at 10:53 AM Justine Olshan
>  wrote:
> > 4. Include an API to list the features for a given metadata version
>
> I am not opposed to designing and implementing this. I am just
> wondering if this is strictly required?
>
> Would having auto-generated documentation address the issue of not
> knowing which feature versions are associated with a given release
> version?
>
> Does it need to be a Kafka API (RPC)? Or can this be strictly
> implemented in the tool? The latest tool, which is needed to parse
> release version to feature version, can print this mapping. Is this
> what you mean by API?
>
> > 5. I'm going back and forth on whether we should support the
> > --release-version flag still. If we do, we need to include validation so
> we
> > only upgrade on upgrade.
>
> I am not sure how this is different from supporting multiple --feature
> flags. The user can run an upgrade command where some of the features
> specified are greater than what the cluster has finalized and some of
> the features specified are less than what the cluster has finalized.
>
> In other words, the KRaft controller and kafka-storage tool are going
> to have to implement this validation even if you don't implement
> --release-version in the tools.
> Thanks,
> --
> -José
>

Re: [DISCUSS] KIP-932: Queues for Kafka

2024-03-18 Thread Justine Olshan

Hey Andrew,

I noticed you started the voting thread, but there seems to be a few
questions that were not answered. One was Jun's about memory usage
> How much additional heap memory will the server use? Do we need to cache
records in heap? If so, is the cache bounded?

Your response was
> This area needs more work. Using a share group surely gets the broker to
do
more manipulation of the data that it fetches than a regular consumer. I
want to minimise
this and need to research before providing a comprehensive answer. I
suspect zero-copy
is lost and that we do not cache records in heap. I will confirm later on.

I am also concerned about memory usage from my producer state management
work, so I want to make sure we have thought about it here -- not just in
the case Jun mentioned but generally.

Likewise, we have seen issues with large consumer groups and too many
producer IDs. Are there any concerns with an analogous situation with too
many share group members or share groups? Are they any ways we try to
handle this or mitigate risks with respect to memory usage and client
connections (wrt to rebalances for example)

Thanks,

Justine

On Fri, Mar 8, 2024 at 12:51 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Manikumar,
> Thanks for your queries.
>
> 1) Delivery count is added to the ConsumerRecord class so that a consumer
> can
> tell how often a record has been processed. I imagine that some
> applications might
> want to take different actions based on whether a record has previously
> failed. This
> enables richer error handling for bad records. In the future, I plan
> another KIP to
> enhance error handling.
>
> 2) It is only possible to delete a share group which is empty. As a
> result, all
> well-behaved consumers will have closed their share sessions. After a
> short while,
> the share-partition leaders will discard the share-partition information
> from memory.
>
> In the presence of badly behaved consumers, a consumer would have to
> pretend to
> be a member of a share group. There are several cases:
>
> a) If the share-partition leader still has in-memory state for the deleted
> share-group, it will
> continue to fetch records but it will be fenced by the share coordinator
> when it attempts to
> write its persistent state.
>
> b) If the share-partition leader does not have in-memory state, it will
> attempt to read it
> from the share coordinator and this will fail.
>
> 3) I will add metrics for the share coordinator today. This was an
> omission. Thanks for catching it.
>
> Thanks,
> Andrew
>
> > On 6 Mar 2024, at 17:53, Manikumar  wrote:
> >
> > Hi Andrew,
> >
> > Thanks for the updated KIP. Few queries below:
> >
> > 1. What is the use-case of deliveryCount in ShareFetchResponse?
> > 2. During delete share groups, Do we need to clean any in-memory state
> from
> > share-partition leaders?
> > 3. Any metrics for the share-coordinator?
> >
> > Thanks
> > Manikumar
> >
> > On Wed, Feb 21, 2024 at 12:11 AM Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Manikumar,
> >> Thanks for your comments.
> >>
> >> 1. I believe that in general, there are not situations in which a
> dynamic
> >> config
> >> change is prevented because of the existence of a resource. So, if we
> >> prevented
> >> setting config `group.type=consumer` on resource G of GROUP type
> >> if there was a share group G in existence, it would be a bit weird.
> >>
> >> I wonder whether changing the config name to `new.group.type` would
> help.
> >> It’s
> >> ensuring the type of a new group created.
> >>
> >> 2. The behaviour for a DEAD share group is intended to be the same as a
> >> DEAD
> >> consumer group. The group cannot be “reused” again as such, but the
> group
> >> ID
> >> can be used by a new group.
> >>
> >> 3. Yes. AlterShareGroupOffsets will cause a new SHARE_CHECKPOINT.
> >>
> >> 4. In common with Admin.deleteConsumerGroups, the underlying Kafka RPC
> >> for Admin.deleteShareGroups is DeleteGroups. This is handled by the
> group
> >> coordinator and it does this by writing control records (a tombstone in
> >> this case).
> >> The KIP doesn’t say anything about this because it’s the same as
> consumer
> >> groups.
> >> Perhaps it would be sensible to add a GroupType to DeleteGroupsRequest
> so
> >> we can
> >> make sure we are deleting the correct type of group. The fact that there
> >> is not a specific
> >> RPC for DeleteShareGroups seems correct to me.
> >>
> >> 5. I prefer using “o.a.k.clients.consumer” because it’s already a public
> >> package and
> >> many of the classes and interfaces such as ConsumerRecord are in that
> >> package.
> >>
> >> I definitely need to add more information about how the Admin operations
> >> work.
> >> I will add a section to the KIP in the next version, later today. This
> >> will fill in details for
> >> your questions (3) and (4).
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 14 Feb 2024, at 18:04, Manikumar  wrote:
> >>>
> >>> Hi And

Re: [DISCUSS] KIP-890 Server Side Defense

2024-03-18 Thread Justine Olshan

Hey there -- small update to the KIP,

The KIP mentioned introducing ABORTABLE_ERROR and bumping TxnOffsetCommit
and Produce requests. I've changed the name in the KIP to
ABORTABLE_TRANSACTION and the corresponding exception
AbortableTransactionException to match the pattern we had for other errors.
I also mentioned bumping all 6 transactional APIs so we can future
proof/support the error on the client going forward. If a future change
wants to have an error scenario that requires us to abort the transaction,
we can rely on the 3.8+ clients to support it. We ran into issues finding
good/generic error codes that older clients could support while working on
this KIP, so this should help in the future.

The features discussion is still ongoing in KIP-1022. Will update again
here when that concludes.

Justine

On Tue, Feb 6, 2024 at 8:39 AM Justine Olshan  wrote:

> I don't think AddPartitions is a good example since we currenly don't gate
> the version on TV or MV. (We only set a different flag depending on the TV)
>
> Even if we did want to gate it on TV, I think the idea is to move away
> from MV gating inter broker protocols. Ideally we can get to a state where
> MV is just used for metadata changes.
>
> I think some of this discussion might fit more with the feature version
> KIP, so I can try to open that up soon. Until we settle that, some of the
> work in KIP-890 is blocked.
>
> Justine
>
> On Mon, Feb 5, 2024 at 5:38 PM Jun Rao  wrote:
>
>> Hi, Justine,
>>
>> Thanks for the reply.
>>
>> Since AddPartitions is an inter broker request, will its version be gated
>> only by TV or other features like MV too? For example, if we need to
>> change
>> the protocol for AddPartitions for reasons other than txn verification in
>> the future, will the new version be gated by a new MV? If so, does
>> downgrading a TV imply potential downgrade of MV too?
>>
>> Jun
>>
>>
>>
>> On Mon, Feb 5, 2024 at 5:07 PM Justine Olshan
>> 
>> wrote:
>>
>> > One TV gates the flexible feature version (no rpcs involved, only the
>> > transactional records that should only be gated by TV)
>> > Another TV gates the ability to turn on kip-890 part 2. This would gate
>> the
>> > version of Produce and EndTxn (likely only used by transactions), and
>> > specifies a flag in AddPartitionsToTxn though the version is already
>> used
>> > without TV.
>> >
>> > I think the only concern is the Produce request and we could consider
>> work
>> > arounds similar to the AddPartitionsToTxn call.
>> >
>> > Justine
>> >
>> > On Mon, Feb 5, 2024 at 4:56 PM Jun Rao 
>> wrote:
>> >
>> > > Hi, Justine,
>> > >
>> > > Which PRC/record protocols will TV guard? Going forward, will those
>> > > PRC/record protocols only be guarded by TV and not by other features
>> like
>> > > MV?
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Mon, Feb 5, 2024 at 2:41 PM Justine Olshan
>> > > > > >
>> > > wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > Sorry I think I misunderstood your question or answered incorrectly.
>> > The
>> > > TV
>> > > > version should ideally be fully independent from MV.
>> > > > At least for the changes I proposed, TV should not affect MV and MV
>> > > should
>> > > > not affect TV/
>> > > >
>> > > > I think if we downgrade TV, only that feature should downgrade.
>> > Likewise
>> > > > the same with MV. The finalizedFeatures should just reflect the
>> feature
>> > > > downgrade we made.
>> > > >
>> > > > I also plan to write a new KIP for managing the disk format and
>> upgrade
>> > > > tool as we will need new flags to support these features. That
>> should
>> > > help
>> > > > clarify some things.
>> > > >
>> > > > Justine
>> > > >
>> > > > On Mon, Feb 5, 2024 at 11:03 AM Jun Rao 
>> > > wrote:
>> > > >
>> > > > > Hi, Justine,
>> > > > >
>> > > > > Thanks for the reply.
>> > > > >
>> > > > > So, if we downgrade TV, we could implicitly downgrade another
>> feature
>> > > > (say
>> > > > > MV) that has dependency (e.g. RPC). What would we return for

Re: [DISCUSS] KIP-932: Queues for Kafka

2024-03-21 Thread Justine Olshan

Thanks Andrew,

That answers some of the questions I have.

With respect to the limits -- how will this be implemented? One issue we
saw with producers is "short-lived" producers that send one message and
disconnect.
Due to how expiration works for producer state, if we have a simple limit
for producer IDs, all new producers are blocked until the old ones expire.
Will we block new group members as well if we reach our limit?

In the consumer case, we have a heartbeat which can be used for expiration
behavior and avoid the headache we see on the producer side, but I can
imagine a case where misuse of the groups themselves could occur -- ie
creating a short lived share group that I believe will take some time to
expire. Do we have considerations for this case?

I also plan to re-read the read-committed section and may have further
questions there.

You also mentioned in the KIP how there are a few inter-broker hops to the
share coordinator, etc for a given read operation of a partition. Are we
concerned about performance here? My work in transactions and trying to
optimize performance made me realize how expensive these inter-broker hops
can be.

Justine

On Thu, Mar 21, 2024 at 7:37 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Justine,
> Thanks for your comment. Sorry for the delay responding.
>
> It was not my intent to leave a query unanswered. I have modified the KIP
> as a result
> of the discussion and I think perhaps I didn’t neatly close off the email
> thread.
>
> In summary:
>
> * The share-partition leader does not maintain an explicit cache of
> records that it has
> fetched. When it fetches records, it does “shallow” iteration to look at
> the batch
> headers only so that it understands at least the base/last offset of the
> records within.
> It is left to the consumers to do the “deep” iteration of the records
> batches they fetch.
>
> * It may sometimes be necessary to re-fetch records for redelivery. This
> is essentially
> analogous to two consumer groups independently fetching the same records
> today.
> We will be relying on the efficiency of the page cache.
>
> * The zero-copy optimisation is not possible for records fetched for
> consumers in
> share groups. The KIP does not affect the use of the zero-copy
> optimisation for any
> scenarios in which it currently applies (this was not true in one earlier
> version of the KIP).
>
> * I share concern about memory usage, partly because of the producer state
> management
> area. To keep a lid on memory use, the number of share groups, the number
> of members
> of each share group, and the number of in-flight messages per partition in
> a share group
> are all limited. The aim is to get the in-memory state to be nice and
> compact, probably
> at the expense of throughput. Over time, I’m sure we’ll optimise and get a
> bit more
> headroom. Limits like these cannot easily be applied retrospectively, so
> the limits are
> there right at the start.
>
> * I have reworked the section on read-committed isolation level, and the
> complexity
> and memory usage of the approach is significantly improved.
>
> I hope this answers your question.
>
> Thanks,
> Andrew
>
>
> > On 18 Mar 2024, at 20:47, Justine Olshan 
> wrote:
> >
> > Hey Andrew,
> >
> > I noticed you started the voting thread, but there seems to be a few
> > questions that were not answered. One was Jun's about memory usage
> >> How much additional heap memory will the server use? Do we need to cache
> > records in heap? If so, is the cache bounded?
> >
> > Your response was
> >> This area needs more work. Using a share group surely gets the broker to
> > do
> > more manipulation of the data that it fetches than a regular consumer. I
> > want to minimise
> > this and need to research before providing a comprehensive answer. I
> > suspect zero-copy
> > is lost and that we do not cache records in heap. I will confirm later
> on.
> >
> > I am also concerned about memory usage from my producer state management
> > work, so I want to make sure we have thought about it here -- not just in
> > the case Jun mentioned but generally.
> >
> > Likewise, we have seen issues with large consumer groups and too many
> > producer IDs. Are there any concerns with an analogous situation with too
> > many share group members or share groups? Are they any ways we try to
> > handle this or mitigate risks with respect to memory usage and client
> > connections (wrt to rebalances for example)
> >
> > Thanks,
> >
> > Justine
> >
> > On Fri, Mar 8, 2024 at 12:51 AM Andrew Schofield <
> > andr

Re: [DISCUSS] KIP-1021: Allow to get last stable offset (LSO) in kafka-get-offsets.sh

2024-03-21 Thread Justine Olshan

Hey Ahmed,

Echoing Andrew's comments to clean up the public interfaces, it was unclear
to me if this new flag applies to all the options or just the "latest"
option.
Clarifying that and showing a few examples could help.

Thanks,
Justine

On Thu, Mar 21, 2024 at 9:43 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi,
> Glad to hear you’re better.
>
> Re-reading the KIP, I think the changes to the public interfaces are
> probably only the addition
> of the new `--isolation` flag on the `kafka-get-offsets.sh` tool. The
> `KafkaAdminClient.listOffsets`
> parameters already have what you need I think (OffsetSpec.LATEST and
> ListOffsetOptions.isolationLevel).
>
> If you tidy up the `Public Interfaces` section with the revised syntax for
> the tool, I think the
> committers will be able to see more easily what the changes are and it
> would then be ready to vote
> in my opinion.
>
> Thanks,
> Andrew
>
> > On 20 Mar 2024, at 12:56, Ahmed Sobeh 
> wrote:
> >
> > Hi Andrew,
> >
> > Apologies for disappearing, I had to undergo knee surgery, all good now!
> >
> > I adjusted the KIP as you suggested, do you think it's ready for the
> voting
> > stage?
> >
> > On Wed, Mar 6, 2024 at 4:02 PM Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Ahmed,
> >> Looks good to me with one remaining comment.
> >>
> >> You’ve used:
> >>
> >> bin/kafka-get-offsets.sh  … --time -1 --isolation -committed
> >> bin/kafka-get-offsets.sh  … --time latest --isolation -committed
> >> bin/kafka-get-offsets.sh  … --time -1 --isolation -uncommitted
> >> bin/kafka-get-offsets.sh  … --time latest --isolation -uncommitted
> >>
> >> I find the flags starting with a single - to be a bit unusual and
> doesn’t
> >> match the --time options such as “latest”. I suggest:
> >>
> >> bin/kafka-get-offsets.sh  … --time -1 --isolation committed
> >> bin/kafka-get-offsets.sh  … --time latest --isolation committed
> >> bin/kafka-get-offsets.sh  … --time -1 --isolation uncommitted
> >> bin/kafka-get-offsets.sh  … --time latest --isolation uncommitted
> >>
> >> Or even:
> >>
> >> bin/kafka-get-offsets.sh  … --time -1 --isolation read-committed
> >> bin/kafka-get-offsets.sh  … --time latest --isolation read-committed
> >> bin/kafka-get-offsets.sh  … --time -1 --isolation read-uncommitted
> >> bin/kafka-get-offsets.sh  … --time latest --isolation read-uncommitted
> >>
> >> Apart from that nit, looks good to me.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>
> >>> On 5 Mar 2024, at 16:35, Ahmed Sobeh 
> >> wrote:
> >>>
> >>> I adjusted the KIP according to what we agreed on, let me know if you
> >> have
> >>> any comments!
> >>>
> >>> Best,
> >>> Ahmed
> >>>
> >>> On Thu, Feb 29, 2024 at 1:44 AM Luke Chen  wrote:
> >>>
>  Hi Ahmed,
> 
>  Thanks for the KIP!
> 
>  Comments:
>  1. If we all agree with the suggestion from Andrew, you could update
> the
>  KIP.
> 
>  Otherwise, LGTM!
> 
> 
>  Thanks.
>  Luke
> 
>  On Thu, Feb 29, 2024 at 1:32 AM Andrew Schofield <
>  andrew_schofield_j...@outlook.com> wrote:
> 
> > Hi Ahmed,
> > Could do. Personally, I find the existing “--time -1” totally horrid
> > anyway, which was why
> > I suggested an alternative. I think your suggestion of a flag for
> > isolation level is much
> > better than -6.
> >
> > Maybe I should put in a KIP which adds:
> > --latest (as a synonym for --time -1)
> > --earliest (as a synonym for --time -2)
> > --max-timestamp (as a synonym for --time -3)
> >
> > That’s really what I would prefer. If the user has a timestamp, use
> > `--time`. If they want a
> > specific special offset, use a separate flag.
> >
> > Thanks,
> > Andrew
> >
> >> On 28 Feb 2024, at 09:22, Ahmed Sobeh  >
> > wrote:
> >>
> >> Hi Andrew,
> >>
> >> Thanks for the hint! That sounds reasonable, do you think adding a
> >> conditional argument, something like "--time -1 --isolation
> >> -committed"
> > and
> >> "--time -1 --isolation -uncommitted" would make sense to keep the
> >> consistency of getting the offset by time? or do you think having a
> > special
> >> argument for this case is better?
> >>
> >> On Tue, Feb 27, 2024 at 2:19 PM Andrew Schofield <
> >> andrew_schofield_j...@outlook.com> wrote:
> >>
> >>> Hi Ahmed,
> >>> Thanks for the KIP.  It looks like a useful addition.
> >>>
> >>> The use of negative timestamps, and in particular letting the user
> >> use
> >>> `--time -1` or the equivalent `--time latest`
> >>> is a bit peculiar in this tool already. The negative timestamps
> come
> > from
> >>> org.apache.kafka.common.requests.ListOffsetsRequest,
> >>> but you’re not actually adding another value to that. As a result,
> I
> >>> really wouldn’t recommend using -6 for the new
> >>> flag. LSO is really a variant of -1 with read_committ

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-25 Thread Justine Olshan

Hey all,

I've updated the KIP with the changes I mentioned earlier. I have not yet
removed the --feature-version flag from the upgrade tool.

Please take a look at the API to get the versions for a given
metadata version. Right now I'm using ApiVersions request and specifying a
metadata version. The supported versions are then supplied in the
ApiVersions response.
There were tradeoffs with this approach. It is a pretty minimal change, but
reusing the API means that it could be confusing (ie, the ApiKeys will be
supplied in the response but not needed.) I considered making a whole new
API, but that didn't seem necessary for this use.

Please please try to let me know in the next few days, I hope to start a
vote soon.

Thanks,
Justine

On Mon, Mar 18, 2024 at 9:17 AM Justine Olshan  wrote:

> Hey folks,
>
> I didn't have a strong preference for requesting the versions via the tool
> only or getting it from the server. Colin seemed to suggest it was "for the
> best" to request from the server to make the tool lightweight.
> I guess the argument against that is having to build and support another
> API.
>
> It also seems like there may be some confusion -- partially on my part.
> For the tools, I had a question about the feature upgrade tool. So it seems
> like we already support multiple features via the `--feature` flag, we
> simply rely on the server side to throw errors now?
>
> Justine
>
> On Fri, Mar 15, 2024 at 3:38 PM José Armando García Sancio
>  wrote:
>
>> Hi Justine,
>>
>> Thanks for the update. Some comments below.
>>
>> On Wed, Mar 13, 2024 at 10:53 AM Justine Olshan
>>  wrote:
>> > 4. Include an API to list the features for a given metadata version
>>
>> I am not opposed to designing and implementing this. I am just
>> wondering if this is strictly required?
>>
>> Would having auto-generated documentation address the issue of not
>> knowing which feature versions are associated with a given release
>> version?
>>
>> Does it need to be a Kafka API (RPC)? Or can this be strictly
>> implemented in the tool? The latest tool, which is needed to parse
>> release version to feature version, can print this mapping. Is this
>> what you mean by API?
>>
>> > 5. I'm going back and forth on whether we should support the
>> > --release-version flag still. If we do, we need to include validation
>> so we
>> > only upgrade on upgrade.
>>
>> I am not sure how this is different from supporting multiple --feature
>> flags. The user can run an upgrade command where some of the features
>> specified are greater than what the cluster has finalized and some of
>> the features specified are less than what the cluster has finalized.
>>
>> In other words, the KRaft controller and kafka-storage tool are going
>> to have to implement this validation even if you don't implement
>> --release-version in the tools.
>> Thanks,
>> --
>> -José
>>
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-25 Thread Justine Olshan

Hi Jose,

Sorry for the typos. I think you figured out what I meant.

I can make a new API. There is a risk of creating a ton of very similar
APIs though. Even the ApiVersions api is confusing with its supported and
finalized features fields. I wonder if there is a middle ground here.
I can have the storage tool and features tool rely on the feature and not
query the server. Colin seemed to be against that.
> Anyway your idea of putting the info on the server side is probably for
the best.

--release-version can work with the downgrade tool too. I just didn't think
I needed to directly spell that out. I can though.

I wish we weren't splitting this conversation among the two threads. :( I
tried to get this out so it could cover all KIPs. Having this on separate
threads makes getting this consensus even harder than it already is.
>From what I can tell your KIP's text matches this. Is the expectation that
the added flags will be done as part of this KIP or your KIP? I don't
really have a strong opinion about --release-version so maybe it should
have been part of your KIP all along.

> To me version 0 doesn't necessarily mean that the feature is not
enabled. For example, for kraft.version, version 0 means the protocol
prior to KIP-853. Version 0 is the currently implemented version of
the KRaft protocol.

This is what Colin told me in our previous discussions. I don't really feel
too strongly about the semantics here.

So it seems like the only real undecided item here is whether we should
have this new api query the server or rely on the information being built
in the tool.
I will update the KIP to include the CLI command to get the information.

Justine

On Mon, Mar 25, 2024 at 4:19 PM José Armando García Sancio
 wrote:

> Hi Justine,
>
> Thanks for the update. See my comments below.
>
> On Mon, Mar 25, 2024 at 2:51 PM Justine Olshan
>  wrote:
> > I've updated the KIP with the changes I mentioned earlier. I have not yet
> > removed the --feature-version flag from the upgrade tool.
>
> What's the "--feature-version" flag? This is the first time I see it
> mentioned and I don't see it in the KIP. Did you mean
> "--release-version"?
>
> > Please take a look at the API to get the versions for a given
> > metadata version. Right now I'm using ApiVersions request and specifying
> a
> > metadata version. The supported versions are then supplied in the
> > ApiVersions response.
> > There were tradeoffs with this approach. It is a pretty minimal change,
> but
> > reusing the API means that it could be confusing (ie, the ApiKeys will be
> > supplied in the response but not needed.) I considered making a whole new
> > API, but that didn't seem necessary for this use.
>
> I agree that this is extremely confusing and we shouldn't overload the
> ApiVersions RPC to return this information. The KIP doesn't mention
> how it is going to use this API. Do you need to update the Admin
> client to include this information?
>
> Having said this, as you mentioned in the KIP the kafka-storage tool
> needs this information and that tool cannot assume that there is a
> running server it can query (send an RPC). Can the kafka-features use
> the same mechanism used by kafka-storage without calling into a
> broker?
>
> re: "It will work just like the storage tool and upgrade all the
> features to a version"
>
> Does this mean that --release-version cannot be used with
> "kafka-features downgrade"?
>
> re: Consistency with KIP-853
>
> Jun and I have been having a similar conversation in the discussion
> thread for KIP-853. From what I can tell both changes are compatible.
> Do you mind taking a look at these two sections and confirming that
> they don't contradict your KIP?
> 1.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-kafka-storage
> 2.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-kafka-features
>
> re: nit "For MVs that existed before these features, we map the new
> features to version 0 (no feature enabled)."
>
> To me version 0 doesn't necessarily mean that the feature is not
> enabled. For example, for kraft.version, version 0 means the protocol
> prior to KIP-853. Version 0 is the currently implemented version of
> the KRaft protocol.
>
> Thanks,
> --
> -José
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-25 Thread Justine Olshan

I've updated the KIP to include this CLI. For now, I've moved the new api
to the server as a rejected alternative.

It seems like we can keep the mapping in the tool. Given that we also need
to do the same validation for using multiple --feature flags (the request
will look the same to the server), we can have the --release-version flag.

I think that closes the main discussions. Please let me know if there is
any further discussion I missed.

Justine

On Mon, Mar 25, 2024 at 4:44 PM Justine Olshan  wrote:

> Hi Jose,
>
> Sorry for the typos. I think you figured out what I meant.
>
> I can make a new API. There is a risk of creating a ton of very similar
> APIs though. Even the ApiVersions api is confusing with its supported and
> finalized features fields. I wonder if there is a middle ground here.
> I can have the storage tool and features tool rely on the feature and not
> query the server. Colin seemed to be against that.
> > Anyway your idea of putting the info on the server side is probably for
> the best.
>
> --release-version can work with the downgrade tool too. I just didn't
> think I needed to directly spell that out. I can though.
>
> I wish we weren't splitting this conversation among the two threads. :( I
> tried to get this out so it could cover all KIPs. Having this on separate
> threads makes getting this consensus even harder than it already is.
> From what I can tell your KIP's text matches this. Is the expectation that
> the added flags will be done as part of this KIP or your KIP? I don't
> really have a strong opinion about --release-version so maybe it should
> have been part of your KIP all along.
>
> > To me version 0 doesn't necessarily mean that the feature is not
> enabled. For example, for kraft.version, version 0 means the protocol
> prior to KIP-853. Version 0 is the currently implemented version of
> the KRaft protocol.
>
> This is what Colin told me in our previous discussions. I don't really
> feel too strongly about the semantics here.
>
> So it seems like the only real undecided item here is whether we should
> have this new api query the server or rely on the information being built
> in the tool.
> I will update the KIP to include the CLI command to get the information.
>
> Justine
>
> On Mon, Mar 25, 2024 at 4:19 PM José Armando García Sancio
>  wrote:
>
>> Hi Justine,
>>
>> Thanks for the update. See my comments below.
>>
>> On Mon, Mar 25, 2024 at 2:51 PM Justine Olshan
>>  wrote:
>> > I've updated the KIP with the changes I mentioned earlier. I have not
>> yet
>> > removed the --feature-version flag from the upgrade tool.
>>
>> What's the "--feature-version" flag? This is the first time I see it
>> mentioned and I don't see it in the KIP. Did you mean
>> "--release-version"?
>>
>> > Please take a look at the API to get the versions for a given
>> > metadata version. Right now I'm using ApiVersions request and
>> specifying a
>> > metadata version. The supported versions are then supplied in the
>> > ApiVersions response.
>> > There were tradeoffs with this approach. It is a pretty minimal change,
>> but
>> > reusing the API means that it could be confusing (ie, the ApiKeys will
>> be
>> > supplied in the response but not needed.) I considered making a whole
>> new
>> > API, but that didn't seem necessary for this use.
>>
>> I agree that this is extremely confusing and we shouldn't overload the
>> ApiVersions RPC to return this information. The KIP doesn't mention
>> how it is going to use this API. Do you need to update the Admin
>> client to include this information?
>>
>> Having said this, as you mentioned in the KIP the kafka-storage tool
>> needs this information and that tool cannot assume that there is a
>> running server it can query (send an RPC). Can the kafka-features use
>> the same mechanism used by kafka-storage without calling into a
>> broker?
>>
>> re: "It will work just like the storage tool and upgrade all the
>> features to a version"
>>
>> Does this mean that --release-version cannot be used with
>> "kafka-features downgrade"?
>>
>> re: Consistency with KIP-853
>>
>> Jun and I have been having a similar conversation in the discussion
>> thread for KIP-853. From what I can tell both changes are compatible.
>> Do you mind taking a look at these two sections and confirming that
>> they don't contradict your KIP?
>> 1.
>> https://cwiki.apache.org/confluence/display/KA

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-25 Thread Justine Olshan

Hi Jun,

I apologize for typos. I thought I got rid of all the non protocol
versions. It is protocol version as per my previous email. I will fix the
KIP.

The group coordinator version is used to upgrade to the new group
coordinator protocol. (KIP-848)
I don't have all the context there.

I would prefer to not add the metadata flag to the storage tool as it is
not necessary. The reason it is not removed is purely for backwards
compatibility. Colin had strong feelings about not removing any flags.

Justine

On Mon, Mar 25, 2024 at 5:01 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP. A few more comments.
>
> 10. Could we describe what RPCs group.coordinator.version controls?
>
> 12. --metadata METADATA is not removed from kafka-features. Do we have a
> justification for this? If so, should we add that to kafka-storage to be
> consistent?
>
> 14. The KIP has both transaction.protocol.version vs transaction.version.
> What's the correct feature name?
>
> Jun
>
> On Mon, Mar 25, 2024 at 4:54 PM Justine Olshan
> 
> wrote:
>
> > I've updated the KIP to include this CLI. For now, I've moved the new api
> > to the server as a rejected alternative.
> >
> > It seems like we can keep the mapping in the tool. Given that we also
> need
> > to do the same validation for using multiple --feature flags (the request
> > will look the same to the server), we can have the --release-version
> flag.
> >
> > I think that closes the main discussions. Please let me know if there is
> > any further discussion I missed.
> >
> > Justine
> >
> > On Mon, Mar 25, 2024 at 4:44 PM Justine Olshan 
> > wrote:
> >
> > > Hi Jose,
> > >
> > > Sorry for the typos. I think you figured out what I meant.
> > >
> > > I can make a new API. There is a risk of creating a ton of very similar
> > > APIs though. Even the ApiVersions api is confusing with its supported
> and
> > > finalized features fields. I wonder if there is a middle ground here.
> > > I can have the storage tool and features tool rely on the feature and
> not
> > > query the server. Colin seemed to be against that.
> > > > Anyway your idea of putting the info on the server side is probably
> for
> > > the best.
> > >
> > > --release-version can work with the downgrade tool too. I just didn't
> > > think I needed to directly spell that out. I can though.
> > >
> > > I wish we weren't splitting this conversation among the two threads.
> :( I
> > > tried to get this out so it could cover all KIPs. Having this on
> separate
> > > threads makes getting this consensus even harder than it already is.
> > > From what I can tell your KIP's text matches this. Is the expectation
> > that
> > > the added flags will be done as part of this KIP or your KIP? I don't
> > > really have a strong opinion about --release-version so maybe it should
> > > have been part of your KIP all along.
> > >
> > > > To me version 0 doesn't necessarily mean that the feature is not
> > > enabled. For example, for kraft.version, version 0 means the protocol
> > > prior to KIP-853. Version 0 is the currently implemented version of
> > > the KRaft protocol.
> > >
> > > This is what Colin told me in our previous discussions. I don't really
> > > feel too strongly about the semantics here.
> > >
> > > So it seems like the only real undecided item here is whether we should
> > > have this new api query the server or rely on the information being
> built
> > > in the tool.
> > > I will update the KIP to include the CLI command to get the
> information.
> > >
> > > Justine
> > >
> > > On Mon, Mar 25, 2024 at 4:19 PM José Armando García Sancio
> > >  wrote:
> > >
> > >> Hi Justine,
> > >>
> > >> Thanks for the update. See my comments below.
> > >>
> > >> On Mon, Mar 25, 2024 at 2:51 PM Justine Olshan
> > >>  wrote:
> > >> > I've updated the KIP with the changes I mentioned earlier. I have
> not
> > >> yet
> > >> > removed the --feature-version flag from the upgrade tool.
> > >>
> > >> What's the "--feature-version" flag? This is the first time I see it
> > >> mentioned and I don't see it in the KIP. Did you mean
> > >> "--release-version"?
> > >>
> > >> > Please take a look at the A

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-26 Thread Justine Olshan

Hi all,

I've added the notes about requiring 3.3-IV0 and that the tool/rpc will
fail if the metadata version is too low.

I will wait for Colin's response on the deprecation. I am not opposed to
deprecating it but want everyone to agree.

Justine

On Tue, Mar 26, 2024 at 12:26 PM José Armando García Sancio
 wrote:

> Hi Justine,
>
> On Mon, Mar 25, 2024 at 5:09 PM Justine Olshan
>  wrote:
> > The reason it is not removed is purely for backwards
> > compatibility. Colin had strong feelings about not removing any flags.
>
> We are not saying that we should remove that flag. That would break
> backward compatibility of 3.8 with 3.7. We are suggesting to deprecate
> the flag in the next release.
>
> Thanks,
> -José
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-27 Thread Justine Olshan

Hi Jun,

We could expose them in a tool. I'm wondering, are you thinking something
like a command that lists the dependencies for a given feature + version?

Something like:
kafka-feature dependencies --feature transaction.protocol.version=2
> transaction.protocol.verison=2 requires metadata.version=4 (listing any
other version dependencies)

Justine

On Wed, Mar 27, 2024 at 10:28 AM Jun Rao  wrote:

> Hi, Colin,
>
> Thanks for the comments. It's fine if we want to keep the --metadata flag
> if it's useful. Then we should add the same flag for kafka-storage for
> consistency, right?
>
> Hi, Justine,
>
> Thanks for the reply.
>
> How will a user know about the dependencies among features? Should we
> expose them in a tool?
>
> Thanks,
>
> Jun
>
> On Tue, Mar 26, 2024 at 4:33 PM Colin McCabe  wrote:
>
> > Hi all,
> >
> > Thanks for the discussion. Let me respond to a few questions I saw in the
> > thread (hope I didn't miss any!)
> >
> > 
> >
> > Feature level 0 is "special" because it effectively means that the
> feature
> > hasn't been set up. So I could create a foo feature, a bar feature, and a
> > baz feature tomorrow and correctly say that your cluster is on feature
> > level 0 for foo, bar, and baz. Because feature level 0 is isomorphic with
> > "no feature level set."
> >
> > Obviously you can have whatever semantics you want for feature level 0.
> In
> > the case of Jose's Raft changes, feature level 0 may end up being the
> > current state of the world. That's fine.
> >
> > 0 being isomorphic with "not set" simplifes the code a lot because we
> > don't need tons of special cases for "feature not set" versus "feature
> set
> > to 0". Effectively we can use short integers everywhere, and not
> > Optional. Does that make sense?
> >
> > 
> >
> > The --metadata flag doesn't quite do the same thing as the --feature
> flag.
> > The --metadata flag takes a string like 3.7-IV0, whereas the --feature
> flag
> > takes an integer like "17".
> >
> > It's true that in the current kafka-features.sh, you can shun the
> > --metadata flag, and only use --feature. The --metadata flag is a
> > convenience. But... conveniences are good. Better than inconveniences.
> So I
> > don't think it makes sense to deprecate --metadata, since its
> functionality
> > is not provided by anything else.
> >
> > 
> >
> > As I said earlier, the proposed semantics for --release-version aren't
> > actually possible given the current RPCs on the server side. The problem
> is
> > that UpdateFeaturesRequest needs to be set upgradType to one of:
> >
> > 1. FeatureUpdate.UpgradeType.UPGRADE
> > 2. FeatureUpdate.UpgradeType.SAFE_DOWNGRADE
> > 3. FeatureUpdate.UpgradeType.UNSAFE_DOWNGRADE
> >
> > If it's set to #1, levels can only go up; if it's set to 2 or 3, levels
> > can only go down. (I forget what happens if the levels are the same...
> you
> > can check)
> >
> > So how does the command invoking --release-version know whether it's
> > upgrading or downgrading? I can't see any way for it to know, and plus it
> > may end up doing more than one of these if some features need to go down
> > and others up. "Making everything the same as it was in 3.7-IV0" may end
> up
> > down-levelling some features, and up-levelling others. There isn't any
> way
> > to do this atomically in a single RPC today.
> >
> > 
> >
> > I don't find the proposed semantics for --release-version to be very
> > useful.
> >
> > In the clusters I help to administer, I don't like changing a bunch of
> > things all at once. I'd much rather make one change at a time and see
> what
> > happens, then move on to the next change.
> >
> > Earlier I proposed just having a subcommand in kafka-features.sh that
> > compared the currently set feature flags against the "default" one for
> the
> > provided Kafka release / MV. I think this would be more useful than the
> > "shotgun approach" of making a bunch of changes together. Just DISPLAY
> what
> > would need to be changed to "make everything the same as it was in
> 3.7-IV0"
> > but then let the admin decide what they want to do (or not do). You could
> > even display the commands that would need to be run, if you like. But let
> > them decide whethe

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-27 Thread Justine Olshan

Hey Jun,

So just including the dependencies for the currently set features? Along
with the supported min, max, and finalized versions?

Feature: transaction.protocol.version SupportedMinVersion: 0
SupportedMaxVersion: 2 FinalizedVersionLevel: 1 Epoch: 3 Dependencies:
metadata.version=4

On Wed, Mar 27, 2024 at 11:14 AM Jun Rao  wrote:

> Hi, Justine,
>
> Yes, something like that. We could also extend "kafka-feature describe" by
> adding the dependency to every feature in the output.
>
> Thanks,
>
> Jun
>
> On Wed, Mar 27, 2024 at 10:39 AM Justine Olshan
>  wrote:
>
> > Hi Jun,
> >
> > We could expose them in a tool. I'm wondering, are you thinking something
> > like a command that lists the dependencies for a given feature + version?
> >
> > Something like:
> > kafka-feature dependencies --feature transaction.protocol.version=2
> > > transaction.protocol.verison=2 requires metadata.version=4 (listing any
> > other version dependencies)
> >
> > Justine
> >
> > On Wed, Mar 27, 2024 at 10:28 AM Jun Rao 
> wrote:
> >
> > > Hi, Colin,
> > >
> > > Thanks for the comments. It's fine if we want to keep the --metadata
> flag
> > > if it's useful. Then we should add the same flag for kafka-storage for
> > > consistency, right?
> > >
> > > Hi, Justine,
> > >
> > > Thanks for the reply.
> > >
> > > How will a user know about the dependencies among features? Should we
> > > expose them in a tool?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Mar 26, 2024 at 4:33 PM Colin McCabe 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Thanks for the discussion. Let me respond to a few questions I saw in
> > the
> > > > thread (hope I didn't miss any!)
> > > >
> > > > 
> > > >
> > > > Feature level 0 is "special" because it effectively means that the
> > > feature
> > > > hasn't been set up. So I could create a foo feature, a bar feature,
> > and a
> > > > baz feature tomorrow and correctly say that your cluster is on
> feature
> > > > level 0 for foo, bar, and baz. Because feature level 0 is isomorphic
> > with
> > > > "no feature level set."
> > > >
> > > > Obviously you can have whatever semantics you want for feature level
> 0.
> > > In
> > > > the case of Jose's Raft changes, feature level 0 may end up being the
> > > > current state of the world. That's fine.
> > > >
> > > > 0 being isomorphic with "not set" simplifes the code a lot because we
> > > > don't need tons of special cases for "feature not set" versus
> "feature
> > > set
> > > > to 0". Effectively we can use short integers everywhere, and not
> > > > Optional. Does that make sense?
> > > >
> > > > 
> > > >
> > > > The --metadata flag doesn't quite do the same thing as the --feature
> > > flag.
> > > > The --metadata flag takes a string like 3.7-IV0, whereas the
> --feature
> > > flag
> > > > takes an integer like "17".
> > > >
> > > > It's true that in the current kafka-features.sh, you can shun the
> > > > --metadata flag, and only use --feature. The --metadata flag is a
> > > > convenience. But... conveniences are good. Better than
> inconveniences.
> > > So I
> > > > don't think it makes sense to deprecate --metadata, since its
> > > functionality
> > > > is not provided by anything else.
> > > >
> > > > 
> > > >
> > > > As I said earlier, the proposed semantics for --release-version
> aren't
> > > > actually possible given the current RPCs on the server side. The
> > problem
> > > is
> > > > that UpdateFeaturesRequest needs to be set upgradType to one of:
> > > >
> > > > 1. FeatureUpdate.UpgradeType.UPGRADE
> > > > 2. FeatureUpdate.UpgradeType.SAFE_DOWNGRADE
> > > > 3. FeatureUpdate.UpgradeType.UNSAFE_DOWNGRADE
> > > >
> > > > If it's set to #1, levels can only go up; if it's set to 2 or 3,
> levels
> > > > can only go down. (I forget what happens if the levels are the
> same...
> > > you
> &g

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-27 Thread Justine Olshan

Hey Jun,

Right, sorry this would be one row in an output of all the various features
(transaction.protocol.version, group.coordinator.version) currently set on
the cluster.

If we want to know the dependencies for each of a given feature (ie
transaction.protocol.versions 0, 1, 2, 3 etc) we can use the other command
if those versions are not yet set.

If this makes sense I will update the KIP.

I think one remaining open area was with respect to  `--release-version`
I didn't follow Colin's point
> So how does the command invoking --release-version know whether it's
upgrading or downgrading?
The feature command includes the upgrade or downgrade command along with
the --release-version flag. If some features are not moving in the
direction mentioned (upgrade or downgrade) the command will fail -- perhaps
with an error of which features were going in the wrong direction.

Thanks,
Justine

On Wed, Mar 27, 2024 at 11:52 AM Jun Rao  wrote:

> Hi, Justine,
>
> It seems that we need to specify the dependencies for each feature version?
>
> Thanks,
>
> Jun
>
> On Wed, Mar 27, 2024 at 11:28 AM Justine Olshan
>  wrote:
>
> > Hey Jun,
> >
> > So just including the dependencies for the currently set features? Along
> > with the supported min, max, and finalized versions?
> >
> > Feature: transaction.protocol.version SupportedMinVersion: 0
> > SupportedMaxVersion: 2 FinalizedVersionLevel: 1 Epoch: 3 Dependencies:
> > metadata.version=4
> >
> > On Wed, Mar 27, 2024 at 11:14 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Yes, something like that. We could also extend "kafka-feature describe"
> > by
> > > adding the dependency to every feature in the output.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Mar 27, 2024 at 10:39 AM Justine Olshan
> > >  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > We could expose them in a tool. I'm wondering, are you thinking
> > something
> > > > like a command that lists the dependencies for a given feature +
> > version?
> > > >
> > > > Something like:
> > > > kafka-feature dependencies --feature transaction.protocol.version=2
> > > > > transaction.protocol.verison=2 requires metadata.version=4 (listing
> > any
> > > > other version dependencies)
> > > >
> > > > Justine
> > > >
> > > > On Wed, Mar 27, 2024 at 10:28 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Colin,
> > > > >
> > > > > Thanks for the comments. It's fine if we want to keep the
> --metadata
> > > flag
> > > > > if it's useful. Then we should add the same flag for kafka-storage
> > for
> > > > > consistency, right?
> > > > >
> > > > > Hi, Justine,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > How will a user know about the dependencies among features? Should
> we
> > > > > expose them in a tool?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Mar 26, 2024 at 4:33 PM Colin McCabe 
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Thanks for the discussion. Let me respond to a few questions I
> saw
> > in
> > > > the
> > > > > > thread (hope I didn't miss any!)
> > > > > >
> > > > > > 
> > > > > >
> > > > > > Feature level 0 is "special" because it effectively means that
> the
> > > > > feature
> > > > > > hasn't been set up. So I could create a foo feature, a bar
> feature,
> > > > and a
> > > > > > baz feature tomorrow and correctly say that your cluster is on
> > > feature
> > > > > > level 0 for foo, bar, and baz. Because feature level 0 is
> > isomorphic
> > > > with
> > > > > > "no feature level set."
> > > > > >
> > > > > > Obviously you can have whatever semantics you want for feature
> > level
> > > 0.
> > > > > In
> > > > > > the case of Jose's Raft changes, feature level 0 may end up being
> > the
> > > > > > current state of the

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-27 Thread Justine Olshan

I had already included this example
bin/kafka-features.sh downgrade --feature metadata.version=16
--transaction.protocol.version=2 // Throws error if metadata version is <
16, and this would be an upgrade
But I have updated the KIP to explicitly say the text you mentioned.

Justine

On Wed, Mar 27, 2024 at 1:41 PM José Armando García Sancio
 wrote:

> Hi Justine,
>
> See my comment below.
>
> On Wed, Mar 27, 2024 at 1:31 PM Justine Olshan
>  wrote:
> > The feature command includes the upgrade or downgrade command along with
> > the --release-version flag. If some features are not moving in the
> > direction mentioned (upgrade or downgrade) the command will fail --
> perhaps
> > with an error of which features were going in the wrong direction.
>
> How about updating the KIP to show and document this behavior?
>
> Thanks,
> --
> -José
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-27 Thread Justine Olshan

Ok. I can remove the info from the describe output.

Dependencies is needed for the storage tool because we want to make sure
the desired versions we are setting will be valid. Version mapping should
be for both tools since we have --release-version for both tools.

I was considering changing the IV strings, but I wasn't sure if there would
be some disagreement with the decision. Not sure if that breaks
compatibility etc. Happy to hear everyone's thoughts.

Justine

On Wed, Mar 27, 2024 at 3:36 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> Having "kafka-feature dependencies" seems enough to me. We don't need to
> include the dependencies in the output of "kafka-feature describe".
>
> We only support "dependencies" in kafka-feature, not kafka-storage. We
> probably should do the same for "version-mapping".
>
> bin/kafka-features.sh downgrade --feature metadata.version=16
> --transaction.protocol.version=2
> We need to add the --feature flag for the second feature, right?
>
> In "kafka-features.sh describe", we only show the IV string for
> metadata.version. Should we also show the level number?
>
> Thanks,
>
> Jun
>
> On Wed, Mar 27, 2024 at 1:52 PM Justine Olshan
> 
> wrote:
>
> > I had already included this example
> > bin/kafka-features.sh downgrade --feature metadata.version=16
> > --transaction.protocol.version=2 // Throws error if metadata version is <
> > 16, and this would be an upgrade
> > But I have updated the KIP to explicitly say the text you mentioned.
> >
> > Justine
> >
> > On Wed, Mar 27, 2024 at 1:41 PM José Armando García Sancio
> >  wrote:
> >
> > > Hi Justine,
> > >
> > > See my comment below.
> > >
> > > On Wed, Mar 27, 2024 at 1:31 PM Justine Olshan
> > >  wrote:
> > > > The feature command includes the upgrade or downgrade command along
> > with
> > > > the --release-version flag. If some features are not moving in the
> > > > direction mentioned (upgrade or downgrade) the command will fail --
> > > perhaps
> > > > with an error of which features were going in the wrong direction.
> > >
> > > How about updating the KIP to show and document this behavior?
> > >
> > > Thanks,
> > > --
> > > -José
> > >
> >
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-28 Thread Justine Olshan

Hi Jun,

For both transaction state and group coordinator state, there are only
version 0 records.
KIP-915 introduced flexible versions, but it was never put to use. MV has
never gated these. This KIP will do that. I can include this context in the
KIP.

I'm happy to modify his 1 and 2 to 0 and 1.

Justine

On Thu, Mar 28, 2024 at 10:57 AM Jun Rao  wrote:

> Hi, David,
>
> Thanks for the reply.
>
> Historically, the format of all records were controlled by MV. Now, records
> in _offset_commit will be controlled by `group.coordinator.version`, is
> that right? It would be useful to document that.
>
> Also, we should align on the version numbering. "kafka-feature disable"
> says "Disable one or more feature flags. This is the same as downgrading
> the version to zero". So, in the `group.coordinator.version' case, we
> probably should use version 0 for the old consumer protocol.
>
> Jun
>
> On Thu, Mar 28, 2024 at 2:13 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
> > Hi David,
> > I agree that we should use the same mechanism to gate KIP-932 once that
> > feature reaches production readiness. The precise details of the values
> > will
> > depend upon the current state of all these flags when that release comes.
> >
> > Thanks,
> > Andrew
> >
> > > On 28 Mar 2024, at 07:11, David Jacot 
> > wrote:
> > >
> > > Hi, Jun, Justine,
> > >
> > > Regarding `group.coordinator.version`, the idea is to use it to gate
> > > records and APIs of the group coordinator. The first use case will be
> > > KIP-848. We will use version 2 of the flag to gate all the new records
> > and
> > > the new ConsumerGroupHeartbeat/Describe APIs present in AK 3.8. So
> > version
> > > 1 will be the only the old protocol and version 2 will be the currently
> > > implemented new protocol. I don't think that we have any dependency on
> > the
> > > metadata version at the moment. The changes are orthogonal. I think
> that
> > we
> > > could mention KIP-848 as the first usage of this flag in the KIP. I
> will
> > > also update KIP-848 to include it when this KIP is accepted. Another
> use
> > > case is the Queues KIP. I think that we should also use this new flag
> to
> > > gate it.
> > >
> > > Best,
> > > David
> > >
> > > On Thu, Mar 28, 2024 at 1:14 AM Jun Rao 
> > wrote:
> > >
> > >> Hi, Justine,
> > >>
> > >> Thanks for the reply.
> > >>
> > >> So, "dependencies" and "version-mapping" will be added to both
> > >> kafka-feature and kafka-storage? Could we document that in the tool
> > format
> > >> section?
> > >>
> > >> Jun
> > >>
> > >> On Wed, Mar 27, 2024 at 4:01 PM Justine Olshan
> > >> 
> > >> wrote:
> > >>
> > >>> Ok. I can remove the info from the describe output.
> > >>>
> > >>> Dependencies is needed for the storage tool because we want to make
> > sure
> > >>> the desired versions we are setting will be valid. Version mapping
> > should
> > >>> be for both tools since we have --release-version for both tools.
> > >>>
> > >>> I was considering changing the IV strings, but I wasn't sure if there
> > >> would
> > >>> be some disagreement with the decision. Not sure if that breaks
> > >>> compatibility etc. Happy to hear everyone's thoughts.
> > >>>
> > >>> Justine
> > >>>
> > >>> On Wed, Mar 27, 2024 at 3:36 PM Jun Rao 
> > >> wrote:
> > >>>
> > >>>> Hi, Justine,
> > >>>>
> > >>>> Thanks for the reply.
> > >>>>
> > >>>> Having "kafka-feature dependencies" seems enough to me. We don't
> need
> > >> to
> > >>>> include the dependencies in the output of "kafka-feature describe".
> > >>>>
> > >>>> We only support "dependencies" in kafka-feature, not kafka-storage.
> We
> > >>>> probably should do the same for "version-mapping".
> > >>>>
> > >>>> bin/kafka-features.sh downgrade --feature metadata.version=16
> > >>>> --transaction.protocol.version=2
> > >>>> We need to add the --feature flag for th

Re: [DISCUSS] KIP-890 Server Side Defense

2024-03-28 Thread Justine Olshan

Hi there -- another update!

When looking into the implementation for the safe epoch bumps I realized
that we are already populating previousProducerID in memory as part of
KIP-360.
If we are to start using flexible fields, it is better to always use this
information and have an explicit (tagged) field to indicate whether the
client supports KIP-890 part 2.

I've included the extra field and how it is set in the KIP. I've also
updated the KIP to explain that we will be setting the tagged fields when
they are available for all transitions.

Finally, I added clearer text about the transaction protocol versions
included with this KIP. 1 for flexible transaction state records and 2 for
KIP-890 part 2 enablement.

Justine

On Mon, Mar 18, 2024 at 6:39 PM Justine Olshan  wrote:

> Hey there -- small update to the KIP,
>
> The KIP mentioned introducing ABORTABLE_ERROR and bumping TxnOffsetCommit
> and Produce requests. I've changed the name in the KIP to
> ABORTABLE_TRANSACTION and the corresponding exception
> AbortableTransactionException to match the pattern we had for other errors.
> I also mentioned bumping all 6 transactional APIs so we can future
> proof/support the error on the client going forward. If a future change
> wants to have an error scenario that requires us to abort the transaction,
> we can rely on the 3.8+ clients to support it. We ran into issues finding
> good/generic error codes that older clients could support while working on
> this KIP, so this should help in the future.
>
> The features discussion is still ongoing in KIP-1022. Will update again
> here when that concludes.
>
> Justine
>
> On Tue, Feb 6, 2024 at 8:39 AM Justine Olshan 
> wrote:
>
>> I don't think AddPartitions is a good example since we currenly don't
>> gate the version on TV or MV. (We only set a different flag depending on
>> the TV)
>>
>> Even if we did want to gate it on TV, I think the idea is to move away
>> from MV gating inter broker protocols. Ideally we can get to a state where
>> MV is just used for metadata changes.
>>
>> I think some of this discussion might fit more with the feature version
>> KIP, so I can try to open that up soon. Until we settle that, some of the
>> work in KIP-890 is blocked.
>>
>> Justine
>>
>> On Mon, Feb 5, 2024 at 5:38 PM Jun Rao  wrote:
>>
>>> Hi, Justine,
>>>
>>> Thanks for the reply.
>>>
>>> Since AddPartitions is an inter broker request, will its version be gated
>>> only by TV or other features like MV too? For example, if we need to
>>> change
>>> the protocol for AddPartitions for reasons other than txn verification in
>>> the future, will the new version be gated by a new MV? If so, does
>>> downgrading a TV imply potential downgrade of MV too?
>>>
>>> Jun
>>>
>>>
>>>
>>> On Mon, Feb 5, 2024 at 5:07 PM Justine Olshan
>>> 
>>> wrote:
>>>
>>> > One TV gates the flexible feature version (no rpcs involved, only the
>>> > transactional records that should only be gated by TV)
>>> > Another TV gates the ability to turn on kip-890 part 2. This would
>>> gate the
>>> > version of Produce and EndTxn (likely only used by transactions), and
>>> > specifies a flag in AddPartitionsToTxn though the version is already
>>> used
>>> > without TV.
>>> >
>>> > I think the only concern is the Produce request and we could consider
>>> work
>>> > arounds similar to the AddPartitionsToTxn call.
>>> >
>>> > Justine
>>> >
>>> > On Mon, Feb 5, 2024 at 4:56 PM Jun Rao 
>>> wrote:
>>> >
>>> > > Hi, Justine,
>>> > >
>>> > > Which PRC/record protocols will TV guard? Going forward, will those
>>> > > PRC/record protocols only be guarded by TV and not by other features
>>> like
>>> > > MV?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > > On Mon, Feb 5, 2024 at 2:41 PM Justine Olshan
>>> > >> > > >
>>> > > wrote:
>>> > >
>>> > > > Hi Jun,
>>> > > >
>>> > > > Sorry I think I misunderstood your question or answered
>>> incorrectly.
>>> > The
>>> > > TV
>>> > > > version should ideally be fully independent from MV.
>>> > > > At least for the changes I proposed, TV should not affect MV and MV
>&

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-01 Thread Justine Olshan

I have also updated the KIP to mention the feature tool's --metadata flag
will be deprecated.
It will still work for users as they learn the new flag, but a warning
indicating the alternatives will be shown.

Justine

On Thu, Mar 28, 2024 at 11:03 AM Justine Olshan 
wrote:

> Hi Jun,
>
> For both transaction state and group coordinator state, there are only
> version 0 records.
> KIP-915 introduced flexible versions, but it was never put to use. MV has
> never gated these. This KIP will do that. I can include this context in the
> KIP.
>
> I'm happy to modify his 1 and 2 to 0 and 1.
>
> Justine
>
> On Thu, Mar 28, 2024 at 10:57 AM Jun Rao  wrote:
>
>> Hi, David,
>>
>> Thanks for the reply.
>>
>> Historically, the format of all records were controlled by MV. Now,
>> records
>> in _offset_commit will be controlled by `group.coordinator.version`, is
>> that right? It would be useful to document that.
>>
>> Also, we should align on the version numbering. "kafka-feature disable"
>> says "Disable one or more feature flags. This is the same as downgrading
>> the version to zero". So, in the `group.coordinator.version' case, we
>> probably should use version 0 for the old consumer protocol.
>>
>> Jun
>>
>> On Thu, Mar 28, 2024 at 2:13 AM Andrew Schofield <
>> andrew_schofield_j...@outlook.com> wrote:
>>
>> > Hi David,
>> > I agree that we should use the same mechanism to gate KIP-932 once that
>> > feature reaches production readiness. The precise details of the values
>> > will
>> > depend upon the current state of all these flags when that release
>> comes.
>> >
>> > Thanks,
>> > Andrew
>> >
>> > > On 28 Mar 2024, at 07:11, David Jacot 
>> > wrote:
>> > >
>> > > Hi, Jun, Justine,
>> > >
>> > > Regarding `group.coordinator.version`, the idea is to use it to gate
>> > > records and APIs of the group coordinator. The first use case will be
>> > > KIP-848. We will use version 2 of the flag to gate all the new records
>> > and
>> > > the new ConsumerGroupHeartbeat/Describe APIs present in AK 3.8. So
>> > version
>> > > 1 will be the only the old protocol and version 2 will be the
>> currently
>> > > implemented new protocol. I don't think that we have any dependency on
>> > the
>> > > metadata version at the moment. The changes are orthogonal. I think
>> that
>> > we
>> > > could mention KIP-848 as the first usage of this flag in the KIP. I
>> will
>> > > also update KIP-848 to include it when this KIP is accepted. Another
>> use
>> > > case is the Queues KIP. I think that we should also use this new flag
>> to
>> > > gate it.
>> > >
>> > > Best,
>> > > David
>> > >
>> > > On Thu, Mar 28, 2024 at 1:14 AM Jun Rao 
>> > wrote:
>> > >
>> > >> Hi, Justine,
>> > >>
>> > >> Thanks for the reply.
>> > >>
>> > >> So, "dependencies" and "version-mapping" will be added to both
>> > >> kafka-feature and kafka-storage? Could we document that in the tool
>> > format
>> > >> section?
>> > >>
>> > >> Jun
>> > >>
>> > >> On Wed, Mar 27, 2024 at 4:01 PM Justine Olshan
>> > >> 
>> > >> wrote:
>> > >>
>> > >>> Ok. I can remove the info from the describe output.
>> > >>>
>> > >>> Dependencies is needed for the storage tool because we want to make
>> > sure
>> > >>> the desired versions we are setting will be valid. Version mapping
>> > should
>> > >>> be for both tools since we have --release-version for both tools.
>> > >>>
>> > >>> I was considering changing the IV strings, but I wasn't sure if
>> there
>> > >> would
>> > >>> be some disagreement with the decision. Not sure if that breaks
>> > >>> compatibility etc. Happy to hear everyone's thoughts.
>> > >>>
>> > >>> Justine
>> > >>>
>> > >>> On Wed, Mar 27, 2024 at 3:36 PM Jun Rao 
>> > >> wrote:
>> > >>>
>> > >>>> Hi, Justine,
>> > >>>>
>> > >>>&

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-01 Thread Justine Olshan

Hi Jun,

20. I can update the KIP.

21. This is used to complete some of the work with KIP-360. (We use
previous producer ID there, but never persisted it which was in the KIP
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820)
The KIP also mentions including previous epoch but we explained in this KIP
how we can figure this out.

Justine



On Mon, Apr 1, 2024 at 3:56 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP. A couple of more comments.
>
> 20. Could we show the output of version-mapping?
>
> 21. "Transaction version 1 will include the flexible fields in the
> transaction state log, and transaction version 2 will include the changes
> to the transactional protocol as described by KIP-890 (epoch bumps and
> implicit add partitions.)"
>   So TV 1 enables the writing of new tagged fields like PrevProducerId? But
> those fields are only usable after the epoch bump, right? What
> functionality does TV 1 achieve?
>
> Jun
>
>
> On Mon, Apr 1, 2024 at 2:06 PM Justine Olshan  >
> wrote:
>
> > I have also updated the KIP to mention the feature tool's --metadata flag
> > will be deprecated.
> > It will still work for users as they learn the new flag, but a warning
> > indicating the alternatives will be shown.
> >
> > Justine
> >
> > On Thu, Mar 28, 2024 at 11:03 AM Justine Olshan 
> > wrote:
> >
> > > Hi Jun,
> > >
> > > For both transaction state and group coordinator state, there are only
> > > version 0 records.
> > > KIP-915 introduced flexible versions, but it was never put to use. MV
> has
> > > never gated these. This KIP will do that. I can include this context in
> > the
> > > KIP.
> > >
> > > I'm happy to modify his 1 and 2 to 0 and 1.
> > >
> > > Justine
> > >
> > > On Thu, Mar 28, 2024 at 10:57 AM Jun Rao 
> > wrote:
> > >
> > >> Hi, David,
> > >>
> > >> Thanks for the reply.
> > >>
> > >> Historically, the format of all records were controlled by MV. Now,
> > >> records
> > >> in _offset_commit will be controlled by `group.coordinator.version`,
> is
> > >> that right? It would be useful to document that.
> > >>
> > >> Also, we should align on the version numbering. "kafka-feature
> disable"
> > >> says "Disable one or more feature flags. This is the same as
> downgrading
> > >> the version to zero". So, in the `group.coordinator.version' case, we
> > >> probably should use version 0 for the old consumer protocol.
> > >>
> > >> Jun
> > >>
> > >> On Thu, Mar 28, 2024 at 2:13 AM Andrew Schofield <
> > >> andrew_schofield_j...@outlook.com> wrote:
> > >>
> > >> > Hi David,
> > >> > I agree that we should use the same mechanism to gate KIP-932 once
> > that
> > >> > feature reaches production readiness. The precise details of the
> > values
> > >> > will
> > >> > depend upon the current state of all these flags when that release
> > >> comes.
> > >> >
> > >> > Thanks,
> > >> > Andrew
> > >> >
> > >> > > On 28 Mar 2024, at 07:11, David Jacot  >
> > >> > wrote:
> > >> > >
> > >> > > Hi, Jun, Justine,
> > >> > >
> > >> > > Regarding `group.coordinator.version`, the idea is to use it to
> gate
> > >> > > records and APIs of the group coordinator. The first use case will
> > be
> > >> > > KIP-848. We will use version 2 of the flag to gate all the new
> > records
> > >> > and
> > >> > > the new ConsumerGroupHeartbeat/Describe APIs present in AK 3.8. So
> > >> > version
> > >> > > 1 will be the only the old protocol and version 2 will be the
> > >> currently
> > >> > > implemented new protocol. I don't think that we have any
> dependency
> > on
> > >> > the
> > >> > > metadata version at the moment. The changes are orthogonal. I
> think
> > >> that
> > >> > we
> > >> > > could mention KIP-848 as the first usage of this flag in the KIP.
> I
> > >> will
> > >> > > also update KIP-848 to include it when this KIP is accepted.
> Another
> > >> use
> > >> &g

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-02 Thread Justine Olshan

Jun,

21. Next producer ID field doesn't need to be populated for TV 1. We don't
have the same need to retain this since it is written directly to the
transaction log in InitProducerId. It is only needed for KIP-890 part 2 /
TV 2.

22. We can do that.

Justine

On Tue, Apr 2, 2024 at 10:41 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> 21. What about the new NextProducerId field? Will that be populated with TV
> 1?
>
> 22. In the dependencies output, should we show both IV and level for
> metadata.version too?
>
> Jun
>
> On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan  >
> wrote:
>
> > Hi Jun,
> >
> > 20. I can update the KIP.
> >
> > 21. This is used to complete some of the work with KIP-360. (We use
> > previous producer ID there, but never persisted it which was in the KIP
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820)
> > The KIP also mentions including previous epoch but we explained in this
> KIP
> > how we can figure this out.
> >
> > Justine
> >
> >
> >
> > On Mon, Apr 1, 2024 at 3:56 PM Jun Rao  wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the updated KIP. A couple of more comments.
> > >
> > > 20. Could we show the output of version-mapping?
> > >
> > > 21. "Transaction version 1 will include the flexible fields in the
> > > transaction state log, and transaction version 2 will include the
> changes
> > > to the transactional protocol as described by KIP-890 (epoch bumps and
> > > implicit add partitions.)"
> > >   So TV 1 enables the writing of new tagged fields like PrevProducerId?
> > But
> > > those fields are only usable after the epoch bump, right? What
> > > functionality does TV 1 achieve?
> > >
> > > Jun
> > >
> > >
> > > On Mon, Apr 1, 2024 at 2:06 PM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > I have also updated the KIP to mention the feature tool's --metadata
> > flag
> > > > will be deprecated.
> > > > It will still work for users as they learn the new flag, but a
> warning
> > > > indicating the alternatives will be shown.
> > > >
> > > > Justine
> > > >
> > > > On Thu, Mar 28, 2024 at 11:03 AM Justine Olshan <
> jols...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > For both transaction state and group coordinator state, there are
> > only
> > > > > version 0 records.
> > > > > KIP-915 introduced flexible versions, but it was never put to use.
> MV
> > > has
> > > > > never gated these. This KIP will do that. I can include this
> context
> > in
> > > > the
> > > > > KIP.
> > > > >
> > > > > I'm happy to modify his 1 and 2 to 0 and 1.
> > > > >
> > > > > Justine
> > > > >
> > > > > On Thu, Mar 28, 2024 at 10:57 AM Jun Rao  >
> > > > wrote:
> > > > >
> > > > >> Hi, David,
> > > > >>
> > > > >> Thanks for the reply.
> > > > >>
> > > > >> Historically, the format of all records were controlled by MV.
> Now,
> > > > >> records
> > > > >> in _offset_commit will be controlled by
> `group.coordinator.version`,
> > > is
> > > > >> that right? It would be useful to document that.
> > > > >>
> > > > >> Also, we should align on the version numbering. "kafka-feature
> > > disable"
> > > > >> says "Disable one or more feature flags. This is the same as
> > > downgrading
> > > > >> the version to zero". So, in the `group.coordinator.version' case,
> > we
> > > > >> probably should use version 0 for the old consumer protocol.
> > > > >>
> > > > >> Jun
> > > > >>
> > > > >> On Thu, Mar 28, 2024 at 2:13 AM Andrew Schofield <
> > > > >> andrew_schofield_j...@outlook.com> wrote:
> > > > >>
> > > > >> > Hi David,
> > > > >> > I agree that we should use the same mechanism to gate KIP-932
> once
> > > > that
> > > > >> > feature reaches production readines

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-02 Thread Justine Olshan

Hi José

I originally had each on a new line and then switched to a single line
since it looked like a lot of space. I can switch it back.

I don't think it makes a big difference if we implement it for both tools.
Both tools will have use for it.

I can change the name to feature-dependencies if that makes it clearer.

I can say that we won't allow cycles, but I'm not sure the best way to
enforce this.

I just put ">" to show output. I can change the color if that makes it
clearer.

Justine

On Tue, Apr 2, 2024 at 11:41 AM José Armando García Sancio
 wrote:

> Hi Justine,
>
> See my comments below.
>
> On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan
>  wrote:
> > 20. I can update the KIP.
>
> I took a look at the latest KIP.
>
> * Should the output of this command "bin/kafka-features.sh
> version-mapping --release-version 3.6-IV1" be something like this:
> metadata.version=13
> transaction.protocol.version=0
> group.coordinator.version=0
> kraft.version=0
>
> I added the kraft.version to show the changes that are coming from
> KIP-853. I find this formatting much easier to parse by scripts/tools.
> They can even use Java's Properties parser if they want.
>
> * Maybe I missed the discussion but can you talk about why both
> "kafka-storage" and "kafka-features" are going to implement the
> "version-mapping" and dependencies"? I assume that it is sufficient
> for only one (kafka-features) to implement these new commands.
>
> * For the command "dependencies" in the "kafka-features" command, it
> is probably obvious that the dependencies listed are feature version
> dependencies. This is not obvious when the user uses "kafka-storage
> dependencies".  This reads as the dependencies of kafka-storage.
>
> * Should we state that Kafka will not allow cycles in the feature
> version dependency definition? Meaning the user is not allowed to
> define a feature version X which depends on Y which depends on Z which
> depends on X.
>
> * Some of the example output start with the characters "> ". Will the
> CLI print those characters or is that just part of the KIP's
> formating?
>
> Thanks,
> --
> -José
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-02 Thread Justine Olshan

Updated!

Justine

On Tue, Apr 2, 2024 at 2:40 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> 21. Sounds good. It would be useful to document that.
>
> 22. Should we add the IV in "metadata.version=17 has no dependencies" too?
>
> Jun
>
>
> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
> 
> wrote:
>
> > Jun,
> >
> > 21. Next producer ID field doesn't need to be populated for TV 1. We
> don't
> > have the same need to retain this since it is written directly to the
> > transaction log in InitProducerId. It is only needed for KIP-890 part 2 /
> > TV 2.
> >
> > 22. We can do that.
> >
> > Justine
> >
> > On Tue, Apr 2, 2024 at 10:41 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the reply.
> > >
> > > 21. What about the new NextProducerId field? Will that be populated
> with
> > TV
> > > 1?
> > >
> > > 22. In the dependencies output, should we show both IV and level for
> > > metadata.version too?
> > >
> > > Jun
> > >
> > > On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 20. I can update the KIP.
> > > >
> > > > 21. This is used to complete some of the work with KIP-360. (We use
> > > > previous producer ID there, but never persisted it which was in the
> KIP
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820)
> > > > The KIP also mentions including previous epoch but we explained in
> this
> > > KIP
> > > > how we can figure this out.
> > > >
> > > > Justine
> > > >
> > > >
> > > >
> > > > On Mon, Apr 1, 2024 at 3:56 PM Jun Rao 
> > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > Thanks for the updated KIP. A couple of more comments.
> > > > >
> > > > > 20. Could we show the output of version-mapping?
> > > > >
> > > > > 21. "Transaction version 1 will include the flexible fields in the
> > > > > transaction state log, and transaction version 2 will include the
> > > changes
> > > > > to the transactional protocol as described by KIP-890 (epoch bumps
> > and
> > > > > implicit add partitions.)"
> > > > >   So TV 1 enables the writing of new tagged fields like
> > PrevProducerId?
> > > > But
> > > > > those fields are only usable after the epoch bump, right? What
> > > > > functionality does TV 1 achieve?
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Mon, Apr 1, 2024 at 2:06 PM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > I have also updated the KIP to mention the feature tool's
> > --metadata
> > > > flag
> > > > > > will be deprecated.
> > > > > > It will still work for users as they learn the new flag, but a
> > > warning
> > > > > > indicating the alternatives will be shown.
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Thu, Mar 28, 2024 at 11:03 AM Justine Olshan <
> > > jols...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > For both transaction state and group coordinator state, there
> are
> > > > only
> > > > > > > version 0 records.
> > > > > > > KIP-915 introduced flexible versions, but it was never put to
> > use.
> > > MV
> > > > > has
> > > > > > > never gated these. This KIP will do that. I can include this
> > > context
> > > > in
> > > > > > the
> > > > > > > KIP.
> > > > > > >
> > > > > > > I'm happy to modify his 1 and 2 to 0 and 1.
> > > > > > >
> > > > > > > Justine
> > > > > > >
> > > > > > > On Thu, Mar 28, 2024 at 10:57 AM Jun Rao
&g

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-03 Thread Justine Olshan

I had missed the David's message yesterday about the naming for transaction
version vs transaction protocol version.

After some offline discussion with Jun, Artem, and David, we agreed that
transaction version is simpler and conveys more than just protocol changes
(flexible records for example)

I will update the KIP as well as KIP-890

Thanks,
Justine

On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan  wrote:

> Updated!
>
> Justine
>
> On Tue, Apr 2, 2024 at 2:40 PM Jun Rao  wrote:
>
>> Hi, Justine,
>>
>> Thanks for the reply.
>>
>> 21. Sounds good. It would be useful to document that.
>>
>> 22. Should we add the IV in "metadata.version=17 has no dependencies" too?
>>
>> Jun
>>
>>
>> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
>> 
>> wrote:
>>
>> > Jun,
>> >
>> > 21. Next producer ID field doesn't need to be populated for TV 1. We
>> don't
>> > have the same need to retain this since it is written directly to the
>> > transaction log in InitProducerId. It is only needed for KIP-890 part 2
>> /
>> > TV 2.
>> >
>> > 22. We can do that.
>> >
>> > Justine
>> >
>> > On Tue, Apr 2, 2024 at 10:41 AM Jun Rao 
>> wrote:
>> >
>> > > Hi, Justine,
>> > >
>> > > Thanks for the reply.
>> > >
>> > > 21. What about the new NextProducerId field? Will that be populated
>> with
>> > TV
>> > > 1?
>> > >
>> > > 22. In the dependencies output, should we show both IV and level for
>> > > metadata.version too?
>> > >
>> > > Jun
>> > >
>> > > On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan
>> > > > > >
>> > > wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > 20. I can update the KIP.
>> > > >
>> > > > 21. This is used to complete some of the work with KIP-360. (We use
>> > > > previous producer ID there, but never persisted it which was in the
>> KIP
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820
>> )
>> > > > The KIP also mentions including previous epoch but we explained in
>> this
>> > > KIP
>> > > > how we can figure this out.
>> > > >
>> > > > Justine
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Apr 1, 2024 at 3:56 PM Jun Rao 
>> > wrote:
>> > > >
>> > > > > Hi, Justine,
>> > > > >
>> > > > > Thanks for the updated KIP. A couple of more comments.
>> > > > >
>> > > > > 20. Could we show the output of version-mapping?
>> > > > >
>> > > > > 21. "Transaction version 1 will include the flexible fields in the
>> > > > > transaction state log, and transaction version 2 will include the
>> > > changes
>> > > > > to the transactional protocol as described by KIP-890 (epoch bumps
>> > and
>> > > > > implicit add partitions.)"
>> > > > >   So TV 1 enables the writing of new tagged fields like
>> > PrevProducerId?
>> > > > But
>> > > > > those fields are only usable after the epoch bump, right? What
>> > > > > functionality does TV 1 achieve?
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > >
>> > > > > On Mon, Apr 1, 2024 at 2:06 PM Justine Olshan
>> > > > > > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > I have also updated the KIP to mention the feature tool's
>> > --metadata
>> > > > flag
>> > > > > > will be deprecated.
>> > > > > > It will still work for users as they learn the new flag, but a
>> > > warning
>> > > > > > indicating the alternatives will be shown.
>> > > > > >
>> > > > > > Justine
>> > > > > >
>> > > > > > On Thu, Mar 28, 2024 at 11:03 AM Justine Olshan <
>> > > jols...@confluent.io>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi J

Re: [VOTE] 3.6.2 RC2

2024-04-03 Thread Justine Olshan

Hi Manikumar,

I've verified the keys, scanned the artifacts, and other docs.
I built from source and ran with a ZK cluster (since I saw that we updated
ZK version in this release)
I ran a few tests on this cluster.

I also ran the 2.12 binary.

I noticed the docs link (https://kafka.apache.org/36/documentation.html)
mentions 3.6.1 as the latest. Is that intended?
I will give my final vote when we figure this out.

Thanks,
Justine

On Wed, Apr 3, 2024 at 7:25 AM Lianet M.  wrote:

> Hi Manikumar, I did the following checks:
>
> - downloaded and built from src
> - ran all unit test and integration test for clients
> - ran quickstart with Kraft mode
> - ran simple workloads with the console consumer/producer
> - checked all links
>
> All looks good to me with this.
>
> +1 (non-binding)
>
> Thanks!
>
>
>
> On Wed, Apr 3, 2024, 2:19 a.m. Manikumar  wrote:
>
> > Gentle reminder. Please download, test and vote for the release.
> >
> > Thanks,
> >
> > On Fri, Mar 29, 2024 at 4:57 PM Manikumar  wrote:
> >
> > > Hi All,
> > >
> > > System test runs are green. There were 13 test failures in the first
> run.
> > > All the failed tests passed in the second run.
> > >
> > > System test results:
> > > https://gist.github.com/omkreddy/17d23d3eb36ef840011f2494d65bbd4f
> > >
> > > Thanks,
> > >
> > > On Thu, Mar 28, 2024 at 3:21 PM Manikumar 
> wrote:
> > >
> > >> Hello Kafka users, developers and client-developers,
> > >>
> > >> This is the second candidate we are considering for the release of
> > Apache
> > >> Kafka 3.6.2.
> > >>
> > >> This is a bugfix release with several fixes, including dependency
> > >> version bumps for CVEs.
> > >>
> > >> Release notes for the 3.6.2 release:
> > >> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/RELEASE_NOTES.html
> > >>
> > >> *** Please download, test and vote by by Wednesday, April 3rd
> > >>
> > >> Kafka's KEYS file containing PGP keys we use to sign the release:
> > >> https://kafka.apache.org/KEYS
> > >>
> > >> * Release artifacts to be voted upon (source and binary):
> > >> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/
> > >>
> > >>
> > >> * Maven artifacts to be voted upon:
> > >>
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > >>
> > >> * Javadoc:
> > >> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/javadoc/
> > >>
> > >> * Tag to be voted upon (off 3.6 branch) is the 3.6.2 tag:
> > >> https://github.com/apache/kafka/releases/tag/3.6.2-rc2
> > >>
> > >> * Documentation:
> > >> https://kafka.apache.org/36/documentation.html
> > >>
> > >> * Protocol:
> > >> https://kafka.apache.org/36/protocol.html
> > >>
> > >> * Successful Jenkins builds for the 3.6 branch:
> > >> Unit/integration tests:
> > >> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.6/167/ (test
> > >> failures passed in local runs)
> > >> System tests: I will update system test results soon.
> > >>
> > >>
> > >> Thanks,
> > >> Manikumar
> > >>
> > >
> >
>

Re: [VOTE] 3.6.2 RC2

2024-04-03 Thread Justine Olshan

Thanks for clarifying!
I took a look at the documentation.html file in there, and it said 3.4. Is
that expected?

There are some files that request fullDot version and that seemed closer to
what was expected: "fullDotVersion": "3.6.2-SNAPSHOT"
The upgrade.html file also looked ok.

Thanks for running the release and answering my questions!
Justine

On Wed, Apr 3, 2024 at 10:21 AM Manikumar  wrote:

> Hi Justine,
>
> Yes, it is intended. For bug fix releases website docs will be updated
> during the final release process.
> We can verify the site-docs artifacts here:
>
> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/kafka_2.12-3.6.2-site-docs.tgz
> These site-docs artifacts will be used to update website docs.
>
>
> Thanks,
>
> On Wed, Apr 3, 2024 at 10:30 PM Justine Olshan
> 
> wrote:
>
> > Hi Manikumar,
> >
> > I've verified the keys, scanned the artifacts, and other docs.
> > I built from source and ran with a ZK cluster (since I saw that we
> updated
> > ZK version in this release)
> > I ran a few tests on this cluster.
> >
> > I also ran the 2.12 binary.
> >
> > I noticed the docs link (https://kafka.apache.org/36/documentation.html)
> > mentions 3.6.1 as the latest. Is that intended?
> > I will give my final vote when we figure this out.
> >
> > Thanks,
> > Justine
> >
> > On Wed, Apr 3, 2024 at 7:25 AM Lianet M.  wrote:
> >
> > > Hi Manikumar, I did the following checks:
> > >
> > > - downloaded and built from src
> > > - ran all unit test and integration test for clients
> > > - ran quickstart with Kraft mode
> > > - ran simple workloads with the console consumer/producer
> > > - checked all links
> > >
> > > All looks good to me with this.
> > >
> > > +1 (non-binding)
> > >
> > > Thanks!
> > >
> > >
> > >
> > > On Wed, Apr 3, 2024, 2:19 a.m. Manikumar  wrote:
> > >
> > > > Gentle reminder. Please download, test and vote for the release.
> > > >
> > > > Thanks,
> > > >
> > > > On Fri, Mar 29, 2024 at 4:57 PM Manikumar 
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > System test runs are green. There were 13 test failures in the
> first
> > > run.
> > > > > All the failed tests passed in the second run.
> > > > >
> > > > > System test results:
> > > > > https://gist.github.com/omkreddy/17d23d3eb36ef840011f2494d65bbd4f
> > > > >
> > > > > Thanks,
> > > > >
> > > > > On Thu, Mar 28, 2024 at 3:21 PM Manikumar 
> > > wrote:
> > > > >
> > > > >> Hello Kafka users, developers and client-developers,
> > > > >>
> > > > >> This is the second candidate we are considering for the release of
> > > > Apache
> > > > >> Kafka 3.6.2.
> > > > >>
> > > > >> This is a bugfix release with several fixes, including dependency
> > > > >> version bumps for CVEs.
> > > > >>
> > > > >> Release notes for the 3.6.2 release:
> > > > >>
> > https://home.apache.org/~manikumar/kafka-3.6.2-rc2/RELEASE_NOTES.html
> > > > >>
> > > > >> *** Please download, test and vote by by Wednesday, April 3rd
> > > > >>
> > > > >> Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > >> https://kafka.apache.org/KEYS
> > > > >>
> > > > >> * Release artifacts to be voted upon (source and binary):
> > > > >> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/
> > > > >>
> > > > >>
> > > > >> * Maven artifacts to be voted upon:
> > > > >>
> > > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > > > >>
> > > > >> * Javadoc:
> > > > >> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/javadoc/
> > > > >>
> > > > >> * Tag to be voted upon (off 3.6 branch) is the 3.6.2 tag:
> > > > >> https://github.com/apache/kafka/releases/tag/3.6.2-rc2
> > > > >>
> > > > >> * Documentation:
> > > > >> https://kafka.apache.org/36/documentation.html
> > > > >>
> > > > >> * Protocol:
> > > > >> https://kafka.apache.org/36/protocol.html
> > > > >>
> > > > >> * Successful Jenkins builds for the 3.6 branch:
> > > > >> Unit/integration tests:
> > > > >> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.6/167/
> (test
> > > > >> failures passed in local runs)
> > > > >> System tests: I will update system test results soon.
> > > > >>
> > > > >>
> > > > >> Thanks,
> > > > >> Manikumar
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-03 Thread Justine Olshan

Find and replace has failed me :(

Group version seems a little vague, but we can update it. Hopefully find
and replace won't fail me again, otherwise I will get another email on this.

Justine

On Wed, Apr 3, 2024 at 12:15 PM David Jacot 
wrote:

> Thanks, Justine.
>
> * Should we also use `group.version` (GV) as I suggested in my previous
> message in order to be consistent?
> * Should we add both names to the `Public Interfaces` section?
> * There is still at least one usage of `transaction.protocol.verison` in
> the KIP too.
>
> Best,
> David
>
> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan  >
> wrote:
>
> > I had missed the David's message yesterday about the naming for
> transaction
> > version vs transaction protocol version.
> >
> > After some offline discussion with Jun, Artem, and David, we agreed that
> > transaction version is simpler and conveys more than just protocol
> changes
> > (flexible records for example)
> >
> > I will update the KIP as well as KIP-890
> >
> > Thanks,
> > Justine
> >
> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan 
> > wrote:
> >
> > > Updated!
> > >
> > > Justine
> > >
> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao 
> wrote:
> > >
> > >> Hi, Justine,
> > >>
> > >> Thanks for the reply.
> > >>
> > >> 21. Sounds good. It would be useful to document that.
> > >>
> > >> 22. Should we add the IV in "metadata.version=17 has no dependencies"
> > too?
> > >>
> > >> Jun
> > >>
> > >>
> > >> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
> > >> 
> > >> wrote:
> > >>
> > >> > Jun,
> > >> >
> > >> > 21. Next producer ID field doesn't need to be populated for TV 1. We
> > >> don't
> > >> > have the same need to retain this since it is written directly to
> the
> > >> > transaction log in InitProducerId. It is only needed for KIP-890
> part
> > 2
> > >> /
> > >> > TV 2.
> > >> >
> > >> > 22. We can do that.
> > >> >
> > >> > Justine
> > >> >
> > >> > On Tue, Apr 2, 2024 at 10:41 AM Jun Rao 
> > >> wrote:
> > >> >
> > >> > > Hi, Justine,
> > >> > >
> > >> > > Thanks for the reply.
> > >> > >
> > >> > > 21. What about the new NextProducerId field? Will that be
> populated
> > >> with
> > >> > TV
> > >> > > 1?
> > >> > >
> > >> > > 22. In the dependencies output, should we show both IV and level
> for
> > >> > > metadata.version too?
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > > On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan
> > >> >  > >> > > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Jun,
> > >> > > >
> > >> > > > 20. I can update the KIP.
> > >> > > >
> > >> > > > 21. This is used to complete some of the work with KIP-360. (We
> > use
> > >> > > > previous producer ID there, but never persisted it which was in
> > the
> > >> KIP
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820
> > >> )
> > >> > > > The KIP also mentions including previous epoch but we explained
> in
> > >> this
> > >> > > KIP
> > >> > > > how we can figure this out.
> > >> > > >
> > >> > > > Justine
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Apr 1, 2024 at 3:56 PM Jun Rao  >
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi, Justine,
> > >> > > > >
> > >> > > > > Thanks for the updated KIP. A couple of more comments.
> > >> > > > >
> > >> > > > > 20. Could we show the output of version-mapping?
> > >> > > > >
> > >> > > &

Re: [VOTE] 3.6.2 RC2

2024-04-04 Thread Justine Olshan

Thanks for clarifying Manikumar!

+1 (binding) from me

Justine

On Thu, Apr 4, 2024 at 3:19 AM Divij Vaidya  wrote:

> Hi Manikumar
>
> I verified the following:
> - all non-minor commits in the branch are captured in release notes
> - signature on the artifact match the public signature of Manikumar
> - basic topic creation & produce / consumer works with JDK8 + ARM + Kafka
> 3.6.2 + Scala 2.12 + ZK + compression (zstd)
>
> Things look good to me. We don't need another RC for fixing docs.
>
> +1 (binding) from me.
>
> --
> Divij Vaidya
>
>
>
> On Thu, Apr 4, 2024 at 10:04 AM Manikumar 
> wrote:
>
> > Hi Justine,
> >
> > Thanks for catching this. looks like we have missed updating
> > `docs/documentation.html` in kafka repo during 3.5 and 3.6 release.
> >
> > I will make sure to use the correct version when updating docs for 3.6.2
> > release.
> > I will also update 3.5 and 3.6 branches with the correct heading and also
> > update the release wiki.
> >
> > >what was expected: "fullDotVersion": "3.6.2-SNAPSHOT"
> > we will remove the "-SNAPSHOT" suffix while updating the website docs. we
> > may need to automate this in the release script.
> >
> >
> > [1] https://github.com/apache/kafka/blob/3.6/docs/documentation.html#L36
> > [2] https://github.com/apache/kafka/blob/3.5/docs/documentation.html#L36
> >
> >
> > Thanks,
> >
> > On Thu, Apr 4, 2024 at 3:50 AM Justine Olshan
>  > >
> > wrote:
> >
> > > Thanks for clarifying!
> > > I took a look at the documentation.html file in there, and it said 3.4.
> > Is
> > > that expected?
> > >
> > > There are some files that request fullDot version and that seemed
> closer
> > to
> > > what was expected: "fullDotVersion": "3.6.2-SNAPSHOT"
> > > The upgrade.html file also looked ok.
> > >
> > > Thanks for running the release and answering my questions!
> > > Justine
> > >
> > > On Wed, Apr 3, 2024 at 10:21 AM Manikumar 
> > > wrote:
> > >
> > > > Hi Justine,
> > > >
> > > > Yes, it is intended. For bug fix releases website docs will be
> updated
> > > > during the final release process.
> > > > We can verify the site-docs artifacts here:
> > > >
> > > >
> > >
> >
> https://home.apache.org/~manikumar/kafka-3.6.2-rc2/kafka_2.12-3.6.2-site-docs.tgz
> > > > These site-docs artifacts will be used to update website docs.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > On Wed, Apr 3, 2024 at 10:30 PM Justine Olshan
> > > > 
> > > > wrote:
> > > >
> > > > > Hi Manikumar,
> > > > >
> > > > > I've verified the keys, scanned the artifacts, and other docs.
> > > > > I built from source and ran with a ZK cluster (since I saw that we
> > > > updated
> > > > > ZK version in this release)
> > > > > I ran a few tests on this cluster.
> > > > >
> > > > > I also ran the 2.12 binary.
> > > > >
> > > > > I noticed the docs link (
> > > https://kafka.apache.org/36/documentation.html)
> > > > > mentions 3.6.1 as the latest. Is that intended?
> > > > > I will give my final vote when we figure this out.
> > > > >
> > > > > Thanks,
> > > > > Justine
> > > > >
> > > > > On Wed, Apr 3, 2024 at 7:25 AM Lianet M. 
> wrote:
> > > > >
> > > > > > Hi Manikumar, I did the following checks:
> > > > > >
> > > > > > - downloaded and built from src
> > > > > > - ran all unit test and integration test for clients
> > > > > > - ran quickstart with Kraft mode
> > > > > > - ran simple workloads with the console consumer/producer
> > > > > > - checked all links
> > > > > >
> > > > > > All looks good to me with this.
> > > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 3, 2024, 2:19 a.m. Manikumar 
> > > wrote:
> > > > > >
> > > > > > > Gentle reminder. Please downloa

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-04 Thread Justine Olshan

I'm hoping this covers the majority of comments. I will go ahead and open
the vote in the next day or so.

Thanks,
Justine

On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan  wrote:

> Find and replace has failed me :(
>
> Group version seems a little vague, but we can update it. Hopefully find
> and replace won't fail me again, otherwise I will get another email on this.
>
> Justine
>
> On Wed, Apr 3, 2024 at 12:15 PM David Jacot 
> wrote:
>
>> Thanks, Justine.
>>
>> * Should we also use `group.version` (GV) as I suggested in my previous
>> message in order to be consistent?
>> * Should we add both names to the `Public Interfaces` section?
>> * There is still at least one usage of `transaction.protocol.verison` in
>> the KIP too.
>>
>> Best,
>> David
>>
>> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
>> 
>> wrote:
>>
>> > I had missed the David's message yesterday about the naming for
>> transaction
>> > version vs transaction protocol version.
>> >
>> > After some offline discussion with Jun, Artem, and David, we agreed that
>> > transaction version is simpler and conveys more than just protocol
>> changes
>> > (flexible records for example)
>> >
>> > I will update the KIP as well as KIP-890
>> >
>> > Thanks,
>> > Justine
>> >
>> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan 
>> > wrote:
>> >
>> > > Updated!
>> > >
>> > > Justine
>> > >
>> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao 
>> wrote:
>> > >
>> > >> Hi, Justine,
>> > >>
>> > >> Thanks for the reply.
>> > >>
>> > >> 21. Sounds good. It would be useful to document that.
>> > >>
>> > >> 22. Should we add the IV in "metadata.version=17 has no dependencies"
>> > too?
>> > >>
>> > >> Jun
>> > >>
>> > >>
>> > >> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
>> > >> 
>> > >> wrote:
>> > >>
>> > >> > Jun,
>> > >> >
>> > >> > 21. Next producer ID field doesn't need to be populated for TV 1.
>> We
>> > >> don't
>> > >> > have the same need to retain this since it is written directly to
>> the
>> > >> > transaction log in InitProducerId. It is only needed for KIP-890
>> part
>> > 2
>> > >> /
>> > >> > TV 2.
>> > >> >
>> > >> > 22. We can do that.
>> > >> >
>> > >> > Justine
>> > >> >
>> > >> > On Tue, Apr 2, 2024 at 10:41 AM Jun Rao 
>> > >> wrote:
>> > >> >
>> > >> > > Hi, Justine,
>> > >> > >
>> > >> > > Thanks for the reply.
>> > >> > >
>> > >> > > 21. What about the new NextProducerId field? Will that be
>> populated
>> > >> with
>> > >> > TV
>> > >> > > 1?
>> > >> > >
>> > >> > > 22. In the dependencies output, should we show both IV and level
>> for
>> > >> > > metadata.version too?
>> > >> > >
>> > >> > > Jun
>> > >> > >
>> > >> > > On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan
>> > >> > > > >> > > >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi Jun,
>> > >> > > >
>> > >> > > > 20. I can update the KIP.
>> > >> > > >
>> > >> > > > 21. This is used to complete some of the work with KIP-360. (We
>> > use
>> > >> > > > previous producer ID there, but never persisted it which was in
>> > the
>> > >> KIP
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820
>> > >> )
>> > >> > > > The KIP also mentions including previous epoch but we
>> explained in
>> > >> this
>> > >> > > KIP
>> > >> > > > how we ca

[VOTE] KIP-1022 Formatting and Updating Features

2024-04-08 Thread Justine Olshan

Hello all,
I would like to start a vote for KIP-1022 Formatting and Updating Features


Please take a look and cast your vote.

Thanks,
Justine

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-08 Thread Justine Olshan

Hey Jun,

That's a good question. I think maybe for simplicity, we can have a single
config?
If that makes sense, I will update the KIP.

Justine

On Mon, Apr 8, 2024 at 3:20 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP.
>
> One more question related to KIP-1014. It introduced a new
> config unstable.metadata.versions.enable. Does each new feature need to
> have a corresponding config to enable the testing of unstable features or
> should we have a generic config enabling the testing of all unstable
> features?
>
> Jun
>
> On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan  >
> wrote:
>
> > I'm hoping this covers the majority of comments. I will go ahead and open
> > the vote in the next day or so.
> >
> > Thanks,
> > Justine
> >
> > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan 
> > wrote:
> >
> > > Find and replace has failed me :(
> > >
> > > Group version seems a little vague, but we can update it. Hopefully
> find
> > > and replace won't fail me again, otherwise I will get another email on
> > this.
> > >
> > > Justine
> > >
> > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
>  > >
> > > wrote:
> > >
> > >> Thanks, Justine.
> > >>
> > >> * Should we also use `group.version` (GV) as I suggested in my
> previous
> > >> message in order to be consistent?
> > >> * Should we add both names to the `Public Interfaces` section?
> > >> * There is still at least one usage of `transaction.protocol.verison`
> in
> > >> the KIP too.
> > >>
> > >> Best,
> > >> David
> > >>
> > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > >> 
> > >> wrote:
> > >>
> > >> > I had missed the David's message yesterday about the naming for
> > >> transaction
> > >> > version vs transaction protocol version.
> > >> >
> > >> > After some offline discussion with Jun, Artem, and David, we agreed
> > that
> > >> > transaction version is simpler and conveys more than just protocol
> > >> changes
> > >> > (flexible records for example)
> > >> >
> > >> > I will update the KIP as well as KIP-890
> > >> >
> > >> > Thanks,
> > >> > Justine
> > >> >
> > >> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan  >
> > >> > wrote:
> > >> >
> > >> > > Updated!
> > >> > >
> > >> > > Justine
> > >> > >
> > >> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao 
> > >> wrote:
> > >> > >
> > >> > >> Hi, Justine,
> > >> > >>
> > >> > >> Thanks for the reply.
> > >> > >>
> > >> > >> 21. Sounds good. It would be useful to document that.
> > >> > >>
> > >> > >> 22. Should we add the IV in "metadata.version=17 has no
> > dependencies"
> > >> > too?
> > >> > >>
> > >> > >> Jun
> > >> > >>
> > >> > >>
> > >> > >> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
> > >> > >> 
> > >> > >> wrote:
> > >> > >>
> > >> > >> > Jun,
> > >> > >> >
> > >> > >> > 21. Next producer ID field doesn't need to be populated for TV
> 1.
> > >> We
> > >> > >> don't
> > >> > >> > have the same need to retain this since it is written directly
> to
> > >> the
> > >> > >> > transaction log in InitProducerId. It is only needed for
> KIP-890
> > >> part
> > >> > 2
> > >> > >> /
> > >> > >> > TV 2.
> > >> > >> >
> > >> > >> > 22. We can do that.
> > >> > >> >
> > >> > >> > Justine
> > >> > >> >
> > >> > >> > On Tue, Apr 2, 2024 at 10:41 AM Jun Rao
>  > >
> > >> > >> wrote:
> > >> > >> >
> > >> > >> > > Hi, Justine,
> > >> > >> > >
> > >> > >> >

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-09 Thread Justine Olshan

Hi Jun,

Makes sense to me. It seems like KIP-1014 has been inactive recently. I can
update my KIP and mention this change on that discussion thread.

Justine

On Tue, Apr 9, 2024 at 9:01 AM Jun Rao  wrote:

> Hi, Justine,
>
> A single config makes sense to me too. We just need to reach consensus with
> KIP-1014.
>
> Thanks,
>
> Jun
>
> On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan  >
> wrote:
>
> > Hey Jun,
> >
> > That's a good question. I think maybe for simplicity, we can have a
> single
> > config?
> > If that makes sense, I will update the KIP.
> >
> > Justine
> >
> > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao  wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the updated KIP.
> > >
> > > One more question related to KIP-1014. It introduced a new
> > > config unstable.metadata.versions.enable. Does each new feature need to
> > > have a corresponding config to enable the testing of unstable features
> or
> > > should we have a generic config enabling the testing of all unstable
> > > features?
> > >
> > > Jun
> > >
> > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > I'm hoping this covers the majority of comments. I will go ahead and
> > open
> > > > the vote in the next day or so.
> > > >
> > > > Thanks,
> > > > Justine
> > > >
> > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan 
> > > > wrote:
> > > >
> > > > > Find and replace has failed me :(
> > > > >
> > > > > Group version seems a little vague, but we can update it. Hopefully
> > > find
> > > > > and replace won't fail me again, otherwise I will get another email
> > on
> > > > this.
> > > > >
> > > > > Justine
> > > > >
> > > > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
> > >  > > > >
> > > > > wrote:
> > > > >
> > > > >> Thanks, Justine.
> > > > >>
> > > > >> * Should we also use `group.version` (GV) as I suggested in my
> > > previous
> > > > >> message in order to be consistent?
> > > > >> * Should we add both names to the `Public Interfaces` section?
> > > > >> * There is still at least one usage of
> > `transaction.protocol.verison`
> > > in
> > > > >> the KIP too.
> > > > >>
> > > > >> Best,
> > > > >> David
> > > > >>
> > > > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > > > >> 
> > > > >> wrote:
> > > > >>
> > > > >> > I had missed the David's message yesterday about the naming for
> > > > >> transaction
> > > > >> > version vs transaction protocol version.
> > > > >> >
> > > > >> > After some offline discussion with Jun, Artem, and David, we
> > agreed
> > > > that
> > > > >> > transaction version is simpler and conveys more than just
> protocol
> > > > >> changes
> > > > >> > (flexible records for example)
> > > > >> >
> > > > >> > I will update the KIP as well as KIP-890
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Justine
> > > > >> >
> > > > >> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan <
> > jols...@confluent.io
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Updated!
> > > > >> > >
> > > > >> > > Justine
> > > > >> > >
> > > > >> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao
>  > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > >> Hi, Justine,
> > > > >> > >>
> > > > >> > >> Thanks for the reply.
> > > > >> > >>
> > > > >> > >> 21. Sounds good. It would be useful to document that.
> > > > >> > >>
> > > > >> > >> 22. Should we add

Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-04-09 Thread Justine Olshan

; > >> -Artem
> > >>
> > >>
> > >> On Mon, Jan 15, 2024 at 10:03 PM Colin McCabe 
> > wrote:
> > >>
> > >> > On Fri, Jan 12, 2024, at 11:32, Artem Livshits wrote:
> > >> > > I think using feature flags (whether we support a framework and
> > tooling
> > >> > for
> > >> > > feature flags or just an ad-hoc XyzEnabled flag) can be an
> > alternative
> > >> to
> > >> > > this KIP.  I think the value of this KIP is that it's trying to
> > >> propose a
> > >> > > systemic approach for gating functionality that may take multiple
> > >> > releases
> > >> > > to develop.  A problem with ad-hoc feature flags is that it's
> useful
> > >> > during
> > >> > > feature development, so that people who are working on this
> feature
> > (or
> > >> > are
> > >> > > interested in beta-testing the feature) can get early access
> > (without
> > >> any
> > >> > > guarantees on compatibility or even correctness); but then the
> > feature
> > >> > > flags often stick forever after the feature development is done
> > (and as
> > >> > > time moves one, the new code is written in such a way that it's
> not
> > >> > > possible to turn the feature off any more cleanly).
> > >> > >
> > >> >
> > >> > Hi Artem,
> > >> >
> > >> > I think feature flags are somewhat orthogonal to the stable /
> unstable
> > >> > discussion. Even if every new feature was a feature flag, you
> probably
> > >> > still wouldn't want to stabilize the features immediately, to avoid
> > >> > maintaining a lot of alpha stuff forever.
> > >> >
> > >> > (I also think that feature flags should be used sparingly, if at
> all,
> > >> > because of the way that they exponentially increase the test matrix.
> > But
> > >> > that's a tangent, I think, given the first point...)
> > >> >
> > >> > >
> > >> > > I'd also clarify how I think about "stable".  Ismael made a
> comment
> > "
> > >> > > something is stable in the "this is battle-tested" sense.".  I
> don't
> > >> > think
> > >> > > it has to be "battle-tested", it just has to meet the bar of being
> > in
> > >> the
> > >> > > trunk.  Again, thinking of a small single-commit feature -- to
> > commit
> > >> to
> > >> > > trunk, the feature doesn't have to be "battle-tested", but it
> > should be
> > >> > > complete (and not just a bunch of TODOs), with tests written and
> > some
> > >> > level
> > >> > > of dev-testing done, so that once the release is cut, we can find
> > and
> > >> fix
> > >> > > bugs and make it release-quality (as opposed to reverting the
> whole
> > >> > > thing).  We can apply the same "stability" bar for features to be
> in
> > >> the
> > >> > > stable MV -- fully complete, tests written and some level of dev
> > >> testing
> > >> > > done.
> > >> > >
> > >> >
> > >> > I'm struggling a bit with your phrasing. Are you suggesting that
> > unstable
> > >> > MVs would not be able to be in trunk? I think we do want them to be
> > able
> > >> to
> > >> > go into trunk. If they have to go into a branch, then there is
> > actually
> > >> no
> > >> > need for any of this.
> > >> >
> > >> > Doing big features in long-lived branches is one genuine alternative
> > to
> > >> > unstable MVs, I think. But it's not an alternative that works well
> > with
> > >> our
> > >> > current tooling for continuous integration, deployment, building,
> > etc. I
> > >> > think it would also impact developer productivity somewhat
> negatively.
> > >> >
> > >> > best,
> > >> > Colin
> > >> >
> > >> >
> > >> > >
> > >> > > -Artem
> > >> > >
> > >> > > On Fri, Jan 12, 2024 a

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-09 Thread Justine Olshan

José,

INVALID_UPDATE_VERSION was added as part of KIP-497. The KIP seems to be
lacking some details on the error.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-497%3A+Add+inter-broker+API+to+alter+ISR
https://github.com/apache/kafka/commit/57de67db22eb373f92ec5dd449d317ed2bc8b8d1

The error seems to be used in the feature update path as well, though that
was also not included in KIP-584. I wonder if we were missing necessary
details for many KIPs in 2020...

I'm not sure I fully understand the proposal. Is the question for the exact
error to use in UpdatableFeatureResult.ErrorCode and what to write in
UpdatableFeatureResult.ErrorMessage? If so, those errors and adding a
message (the dependency that was violated for example) makes sense.
I agree that it makes sense that any errors in updates should be a top
level error and not have a partial update.

I thought these were part of KIP-584, but I will take a look and update
this KIP if they are not.

Justine

On Tue, Apr 9, 2024 at 1:10 PM José Armando García Sancio
 wrote:

> Hi Justine,
>
> Thanks for the KIP. I see that the KIP doesn't make any updates to the
> UpdateFeatures RPC. I was trying to understand how errors will be
> communicated to the client.
>
> Are you planning to use the INVALID_UPDATE_VERSION error and overwrite
> the ErrorMessage field for all of the validations you mentioned in the
> KIP? I see that INVALID_UPDATE_VERSION is in the code for Apache Kafka
> but I couldn't find the KIP that adds this error. It is not in KIP-584
> or KIP-778. If you agree, do you think we should document this error
> in this KIP?
>
> It is also not clear to me when the UpdateFeaturesResponse will return
> an error per feature versus an error for the entire RPC. KIP-584
> defines this relationship but it doesn't specify when exactly a top
> level error will be returned versus when a feature level error will be
> returned. I think that most users wouldn't want partial failures. They
> instead would like to be guaranteed that all of the feature updates
> succeeded or none did. Do you agree? Should we update the KIP to make
> this clear?
>
> Thanks!
> --
> -José
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-09 Thread Justine Olshan

I took a quick look at the code -- looks like the previous behavior was not
to set a top level error if one particular feature had an issue. We can do
that.

I think it could make some sense to specify errors on features that were
not valid and use the top level error to indicate that the request didn't
update any features. The handling now is to complete the futures with the
top level error anyway.

As for the validation criteria. It seems like one bit of code that
validates whether a version is allowed is in the method
`reasonNotSupported` which checks the range of features available for the
given feature.
For metadata.version we have a method to do "additional checks" and we
could have those for the various other features as well. I have an
(internal) FeatureVersion interface in mind that would work well for this.
For any of these validations, we return the same error
`INVALID_UPDATE_VERSION`. I would think continuing to use this error
follows naturally, but if we think it is necessary to specify the error
code, I can do so in my KIP.

Justine

On Tue, Apr 9, 2024 at 1:46 PM Justine Olshan  wrote:

> José,
>
> INVALID_UPDATE_VERSION was added as part of KIP-497. The KIP seems to be
> lacking some details on the error.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-497%3A+Add+inter-broker+API+to+alter+ISR
>
> https://github.com/apache/kafka/commit/57de67db22eb373f92ec5dd449d317ed2bc8b8d1
>
> The error seems to be used in the feature update path as well, though that
> was also not included in KIP-584. I wonder if we were missing necessary
> details for many KIPs in 2020...
>
> I'm not sure I fully understand the proposal. Is the question for the
> exact error to use in UpdatableFeatureResult.ErrorCode and what to write
> in  UpdatableFeatureResult.ErrorMessage? If so, those errors and adding a
> message (the dependency that was violated for example) makes sense.
> I agree that it makes sense that any errors in updates should be a top
> level error and not have a partial update.
>
> I thought these were part of KIP-584, but I will take a look and update
> this KIP if they are not.
>
> Justine
>
> On Tue, Apr 9, 2024 at 1:10 PM José Armando García Sancio
>  wrote:
>
>> Hi Justine,
>>
>> Thanks for the KIP. I see that the KIP doesn't make any updates to the
>> UpdateFeatures RPC. I was trying to understand how errors will be
>> communicated to the client.
>>
>> Are you planning to use the INVALID_UPDATE_VERSION error and overwrite
>> the ErrorMessage field for all of the validations you mentioned in the
>> KIP? I see that INVALID_UPDATE_VERSION is in the code for Apache Kafka
>> but I couldn't find the KIP that adds this error. It is not in KIP-584
>> or KIP-778. If you agree, do you think we should document this error
>> in this KIP?
>>
>> It is also not clear to me when the UpdateFeaturesResponse will return
>> an error per feature versus an error for the entire RPC. KIP-584
>> defines this relationship but it doesn't specify when exactly a top
>> level error will be returned versus when a feature level error will be
>> returned. I think that most users wouldn't want partial failures. They
>> instead would like to be guaranteed that all of the feature updates
>> succeeded or none did. Do you agree? Should we update the KIP to make
>> this clear?
>>
>> Thanks!
>> --
>> -José
>>
>

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-11 Thread Justine Olshan

The original config was never actually approved in any KIP. But we can say
it is deprecated.
I can change the config name.

Justine

On Thu, Apr 11, 2024 at 8:52 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP.
>
> Would unstable.feature.version.enable be a clearer name? Also, should we
> remove/deprecate unstable.metadata.versions.enable in this KIP?
>
> Jun
>
> On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan  >
> wrote:
>
> > Hi Jun,
> >
> > Makes sense to me. It seems like KIP-1014 has been inactive recently. I
> can
> > update my KIP and mention this change on that discussion thread.
> >
> > Justine
> >
> > On Tue, Apr 9, 2024 at 9:01 AM Jun Rao  wrote:
> >
> > > Hi, Justine,
> > >
> > > A single config makes sense to me too. We just need to reach consensus
> > with
> > > KIP-1014.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > That's a good question. I think maybe for simplicity, we can have a
> > > single
> > > > config?
> > > > If that makes sense, I will update the KIP.
> > > >
> > > > Justine
> > > >
> > > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao 
> > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > Thanks for the updated KIP.
> > > > >
> > > > > One more question related to KIP-1014. It introduced a new
> > > > > config unstable.metadata.versions.enable. Does each new feature
> need
> > to
> > > > > have a corresponding config to enable the testing of unstable
> > features
> > > or
> > > > > should we have a generic config enabling the testing of all
> unstable
> > > > > features?
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > I'm hoping this covers the majority of comments. I will go ahead
> > and
> > > > open
> > > > > > the vote in the next day or so.
> > > > > >
> > > > > > Thanks,
> > > > > > Justine
> > > > > >
> > > > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan <
> > jols...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Find and replace has failed me :(
> > > > > > >
> > > > > > > Group version seems a little vague, but we can update it.
> > Hopefully
> > > > > find
> > > > > > > and replace won't fail me again, otherwise I will get another
> > email
> > > > on
> > > > > > this.
> > > > > > >
> > > > > > > Justine
> > > > > > >
> > > > > > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
> > > > >  > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks, Justine.
> > > > > > >>
> > > > > > >> * Should we also use `group.version` (GV) as I suggested in my
> > > > > previous
> > > > > > >> message in order to be consistent?
> > > > > > >> * Should we add both names to the `Public Interfaces` section?
> > > > > > >> * There is still at least one usage of
> > > > `transaction.protocol.verison`
> > > > > in
> > > > > > >> the KIP too.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> David
> > > > > > >>
> > > > > > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > > > > > >> 
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > I had missed the David's message yesterday about the naming
> > for
> > > > > > >> transaction
> > > > > > >> > version vs transaction protocol version.
> > > > > > >> >

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-11 Thread Justine Olshan

Updated. :)
Thanks for the reviews

Justine

On Thu, Apr 11, 2024 at 11:01 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP.
>
> Perhaps it's better to name the new config unstable.feature.versions.enable
> since there could be multiple unstable versions.
>
> Other than that, the KIP looks good to me.
>
> Jun
>
> On Thu, Apr 11, 2024 at 9:06 AM Justine Olshan
> 
> wrote:
>
> > The original config was never actually approved in any KIP. But we can
> say
> > it is deprecated.
> > I can change the config name.
> >
> > Justine
> >
> > On Thu, Apr 11, 2024 at 8:52 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the updated KIP.
> > >
> > > Would unstable.feature.version.enable be a clearer name? Also, should
> we
> > > remove/deprecate unstable.metadata.versions.enable in this KIP?
> > >
> > > Jun
> > >
> > > On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Makes sense to me. It seems like KIP-1014 has been inactive
> recently. I
> > > can
> > > > update my KIP and mention this change on that discussion thread.
> > > >
> > > > Justine
> > > >
> > > > On Tue, Apr 9, 2024 at 9:01 AM Jun Rao 
> > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > A single config makes sense to me too. We just need to reach
> > consensus
> > > > with
> > > > > KIP-1014.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > That's a good question. I think maybe for simplicity, we can
> have a
> > > > > single
> > > > > > config?
> > > > > > If that makes sense, I will update the KIP.
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao  >
> > > > wrote:
> > > > > >
> > > > > > > Hi, Justine,
> > > > > > >
> > > > > > > Thanks for the updated KIP.
> > > > > > >
> > > > > > > One more question related to KIP-1014. It introduced a new
> > > > > > > config unstable.metadata.versions.enable. Does each new feature
> > > need
> > > > to
> > > > > > > have a corresponding config to enable the testing of unstable
> > > > features
> > > > > or
> > > > > > > should we have a generic config enabling the testing of all
> > > unstable
> > > > > > > features?
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> > > > > >  > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I'm hoping this covers the majority of comments. I will go
> > ahead
> > > > and
> > > > > > open
> > > > > > > > the vote in the next day or so.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Justine
> > > > > > > >
> > > > > > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan <
> > > > jols...@confluent.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Find and replace has failed me :(
> > > > > > > > >
> > > > > > > > > Group version seems a little vague, but we can update it.
> > > > Hopefully
> > > > > > > find
> > > > > > > > > and replace won't fail me again, otherwise I will get
> another
> > > > email
> > > > > > on
> > > > > > > > this.
> > > > > > > > >
> > > > > > > &g

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-12 Thread Justine Olshan

Hi Jun,

Ok sounds good.

Justine

On Fri, Apr 12, 2024 at 10:17 AM Jun Rao  wrote:

> Hi, Justine,
>
> unstable.metadata.versions.enable is an internal configuration. So, we
> could probably just remove it instead of depreciation. Also, it would be
> useful to make it clear that unstable.feature.versions.enable is an
> internal configuration.
>
> Thanks,
>
> Jun
>
> On Thu, Apr 11, 2024 at 11:16 AM Justine Olshan
>  wrote:
>
> > Updated. :)
> > Thanks for the reviews
> >
> > Justine
> >
> > On Thu, Apr 11, 2024 at 11:01 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the updated KIP.
> > >
> > > Perhaps it's better to name the new config
> > unstable.feature.versions.enable
> > > since there could be multiple unstable versions.
> > >
> > > Other than that, the KIP looks good to me.
> > >
> > > Jun
> > >
> > > On Thu, Apr 11, 2024 at 9:06 AM Justine Olshan
> > > 
> > > wrote:
> > >
> > > > The original config was never actually approved in any KIP. But we
> can
> > > say
> > > > it is deprecated.
> > > > I can change the config name.
> > > >
> > > > Justine
> > > >
> > > > On Thu, Apr 11, 2024 at 8:52 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > Thanks for the updated KIP.
> > > > >
> > > > > Would unstable.feature.version.enable be a clearer name? Also,
> should
> > > we
> > > > > remove/deprecate unstable.metadata.versions.enable in this KIP?
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Makes sense to me. It seems like KIP-1014 has been inactive
> > > recently. I
> > > > > can
> > > > > > update my KIP and mention this change on that discussion thread.
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Tue, Apr 9, 2024 at 9:01 AM Jun Rao  >
> > > > wrote:
> > > > > >
> > > > > > > Hi, Justine,
> > > > > > >
> > > > > > > A single config makes sense to me too. We just need to reach
> > > > consensus
> > > > > > with
> > > > > > > KIP-1014.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
> > > > > >  > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Jun,
> > > > > > > >
> > > > > > > > That's a good question. I think maybe for simplicity, we can
> > > have a
> > > > > > > single
> > > > > > > > config?
> > > > > > > > If that makes sense, I will update the KIP.
> > > > > > > >
> > > > > > > > Justine
> > > > > > > >
> > > > > > > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao
> >  > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Justine,
> > > > > > > > >
> > > > > > > > > Thanks for the updated KIP.
> > > > > > > > >
> > > > > > > > > One more question related to KIP-1014. It introduced a new
> > > > > > > > > config unstable.metadata.versions.enable. Does each new
> > feature
> > > > > need
> > > > > > to
> > > > > > > > > have a corresponding config to enable the testing of
> unstable
> > > > > > features
> > > > > > > or
> > > > > > > > > should we have a generic config enabling the testing of all
> > > > > unstable
> > > > > > > > > features?
> > > > > > > >

Re: [VOTE] KIP-1022 Formatting and Updating Features

2024-04-15 Thread Justine Olshan

Hey folks,

Thanks everyone! I will go ahead and call it.
The KIP passes with the following +1 votes:

- Andrew Schofield (non-binding)
- David Jacot (binding)
- José Armando García Sancio (binding)
- Jun Rao (binding)

Thanks again,
Justine

On Fri, Apr 12, 2024 at 11:16 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the KIP. +1
>
> Jun
>
> On Wed, Apr 10, 2024 at 9:13 AM José Armando García Sancio
>  wrote:
>
> > Hi Justine,
> >
> > +1 (binding)
> >
> > Thanks for the improvement.
> > --
> > -José
> >
>

Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

2024-04-17 Thread Justine Olshan

Hey Alieh,

I echo what Omnia says, I'm not sure I understand the implications of the
change and I think more detail is needed.

This comment also confused me a bit:
* {@code ClientExceptionHandler} that continues the transaction even if a
record is too large.
* Otherwise, it makes the transaction to fail.

Relatedly, I've been working with some folks on a KIP for transactions
errors and how they are handled. Specifically for the
RecordTooLargeException (and a few other errors), we want to give a new
error category for this error that allows the application to choose how it
is handled. Maybe this KIP is something that you are looking for? Stay
tuned :)

Justine





On Wed, Apr 17, 2024 at 8:03 AM Omnia Ibrahim 
wrote:

> Hi Alieh,
> Thanks for the KIP! I have couple of comments
> - You mentioned in the KIP motivation,
> > Another example for which a production exception handler could be useful
> is if a user tries to write into a non-existing topic, which returns a
> retryable error code; with infinite retries, the producer would hang
> retrying forever. A handler could help to break the infinite retry loop.
>
> How the handler can differentiate between something that is temporary and
> it should keep retrying and something permanent like forgot to create the
> topic? temporary here could be
>  the producer get deployed before the topic creation finish (specially if
> the topic creation is handled via IaC)
>  temporary offline partitions
>  leadership changing
> Isn’t this putting the producer at risk of dropping records
> unintentionally?
> - Can you elaborate more on what is written in the compatibility /
> migration plan section please by explaining in bit more details what is the
> changing behaviour and how this will impact client who are upgrading?
> - In the proposal changes can you elaborate in the KIP where in the
> producer lifecycle will ClientExceptionHandler and
> TransactionExceptionHandler get triggered, and how will the producer
> configure them to point to costumed implementation.
>
> Thanks
> Omnia
>
> > On 17 Apr 2024, at 13:13, Alieh Saeedi 
> wrote:
> >
> > Hi all,
> > Here is the KIP-1038: Add Custom Error Handler to Producer.
> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1038%3A+Add+Custom+Error+Handler+to+Producer
> >
> > I look forward to your feedback!
> >
> > Cheers,
> > Alieh
>
>

Re: [ANNOUNCE] New Kafka PMC Member: Greg Harris

2024-04-18 Thread Justine Olshan

Congratulations Greg!

On Thu, Apr 18, 2024 at 12:03 AM Matthias J. Sax  wrote:

> Congrats Greg!
>
> On 4/15/24 10:44 AM, Hector Geraldino (BLOOMBERG/ 919 3RD A) wrote:
> > Congrats! Well deserved
> >
> > From: dev@kafka.apache.org At: 04/13/24 14:42:22 UTC-4:00To:
> dev@kafka.apache.org
> > Subject: [ANNOUNCE] New Kafka PMC Member: Greg Harris
> >
> > Hi all,
> >
> > Greg Harris has been a Kafka committer since July 2023. He has remained
> > very active and instructive in the community since becoming a committer.
> > It's my pleasure to announce that Greg is now a member of Kafka PMC.
> >
> > Congratulations, Greg!
> >
> > Chris, on behalf of the Apache Kafka PMC
> >
> >
>

Re: [VOTE] KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission

2024-04-19 Thread Justine Olshan

Hey Nikhil,

I meant to comment on the discussion thread, but my draft took so long, you
opened the vote.

Regardless, I just wanted to say that it makes sense to me. +1 (binding)

Justine

On Fri, Apr 19, 2024 at 7:22 AM Nikhil Ramakrishnan <
ramakrishnan.nik...@gmail.com> wrote:

> Hi everyone,
>
> I would like to start a voting thread for KIP-1037: Allow
> WriteTxnMarkers API with Alter Cluster Permission
> (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1037%3A+Allow+WriteTxnMarkers+API+with+Alter+Cluster+Permission
> )
> as there have been no objections on the discussion thread.
>
> For comments or feedback please check the discussion thread here:
> https://lists.apache.org/thread/bbkyt8mrc8xp3jfyvhph7oqtjxl29xmn
>
> Thanks,
> Nikhil
>

Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

2024-04-23 Thread Justine Olshan

t;>> Thanks,
> >>> Andrew
> >>>
> >>>
> >>>
> >>>> On 17 Apr 2024, at 18:17, Chris Egerton 
> >> wrote:
> >>>>
> >>>> Hi Alieh,
> >>>>
> >>>> Thanks for the KIP! The issue with writing to non-existent topics is
> >>>> particularly frustrating for users of Kafka Connect and has been the
> >> source
> >>>> of a handful of Jira tickets over the years. My thoughts:
> >>>>
> >>>> 1. An additional detail we can add to the motivation (or possibly
> >> rejected
> >>>> alternatives) section is that this kind of custom retry logic can't be
> >>>> implemented by hand by, e.g., setting retries to 0 in the producer
> >> config
> >>>> and handling exceptions at the application level. Or rather, it can,
> >> but 1)
> >>>> it's a bit painful to have to reimplement at every call-site for
> >>>> Producer::send (and any code that awaits the returned Future) and 2)
> >> it's
> >>>> impossible to do this without losing idempotency on retries.
> >>>>
> >>>> 2. That said, I wonder if a pluggable interface is really the right
> call
> >>>> here. Documenting the interactions of a producer with
> >>>> a ClientExceptionHandler instance will be tricky, and implementing
> them
> >>>> will also be a fair amount of work. I believe that there needs to be
> >> some
> >>>> more granularity for how writes to non-existent topics (or really,
> >>>> UNKNOWN_TOPIC_OR_PARTITION and related errors from the broker) are
> >> handled,
> >>>> but I'm torn between keeping it simple with maybe one or two new
> >> producer
> >>>> config properties, or a full-blown pluggable interface. If there are
> >> more
> >>>> cases that would benefit from a pluggable interface, it would be nice
> to
> >>>> identify these and add them to the KIP to strengthen the motivation.
> >> Right
> >>>> now, I'm not sure the two that are mentioned in the motivation are
> >>>> sufficient.
> >>>>
> >>>> 3. Alternatively, a possible compromise is for this KIP to introduce
> new
> >>>> properties that dictate how to handle unknown-topic-partition and
> >>>> record-too-large errors, with the thinking that if we introduce a
> >> pluggable
> >>>> interface later on, these properties will be recognized by the default
> >>>> implementation of that interface but could be completely ignored or
> >>>> replaced by alternative implementations.
> >>>>
> >>>> 4. (Nit) You can remove the "This page is meant as a template for
> >> writing a
> >>>> KIP..." part from the KIP. It's not a template anymore :)
> >>>>
> >>>> 5. If we do go the pluggable interface route, wouldn't we want to add
> >> the
> >>>> possibility for retry logic? The simplest version of this could be to
> >> add a
> >>>> RETRY value to the ClientExceptionHandlerResponse enum.
> >>>>
> >>>> 6. I think "SKIP" or "DROP" might be clearer instead of "CONTINUE" for
> >>>> the ClientExceptionHandlerResponse enum, since they cause records to
> be
> >>>> dropped.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Chris
> >>>>
> >>>> On Wed, Apr 17, 2024 at 12:25 PM Justine Olshan
> >>>>  wrote:
> >>>>
> >>>>> Hey Alieh,
> >>>>>
> >>>>> I echo what Omnia says, I'm not sure I understand the implications of
> >> the
> >>>>> change and I think more detail is needed.
> >>>>>
> >>>>> This comment also confused me a bit:
> >>>>> * {@code ClientExceptionHandler} that continues the transaction even
> >> if a
> >>>>> record is too large.
> >>>>> * Otherwise, it makes the transaction to fail.
> >>>>>
> >>>>> Relatedly, I've been working with some folks on a KIP for
> transactions
> >>>>> errors and how they are handled. Specifically for the
> >>>>> RecordTooLargeException (and a few other errors), we want to give a
> new
> >>>>> error category for this error that allows the application to choose
> >> how it
> >>>>> is handled. Maybe this KIP is something that you are looking for?
> Stay
> >>>>> tuned :)
> >>>>>
> >>>>> Justine
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 17, 2024 at 8:03 AM Omnia Ibrahim <
> o.g.h.ibra...@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Alieh,
> >>>>>> Thanks for the KIP! I have couple of comments
> >>>>>> - You mentioned in the KIP motivation,
> >>>>>>> Another example for which a production exception handler could be
> >>>>> useful
> >>>>>> is if a user tries to write into a non-existing topic, which
> returns a
> >>>>>> retryable error code; with infinite retries, the producer would hang
> >>>>>> retrying forever. A handler could help to break the infinite retry
> >> loop.
> >>>>>>
> >>>>>> How the handler can differentiate between something that is
> temporary
> >> and
> >>>>>> it should keep retrying and something permanent like forgot to
> create
> >> the
> >>>>>> topic? temporary here could be
> >>>>>> the producer get deployed before the topic creation finish
> (specially
> >> if
> >>>>>> the topic creation is handled via IaC)
> >>>>>> temporary offline partitions
> >>>>>> leadership changing
> >>>>>>Isn’t this putting the producer at risk of dropping records
> >>>>>> unintentionally?
> >>>>>> - Can you elaborate more on what is written in the compatibility /
> >>>>>> migration plan section please by explaining in bit more details what
> >> is
> >>>>> the
> >>>>>> changing behaviour and how this will impact client who are
> upgrading?
> >>>>>> - In the proposal changes can you elaborate in the KIP where in the
> >>>>>> producer lifecycle will ClientExceptionHandler and
> >>>>>> TransactionExceptionHandler get triggered, and how will the producer
> >>>>>> configure them to point to costumed implementation.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Omnia
> >>>>>>
> >>>>>>> On 17 Apr 2024, at 13:13, Alieh Saeedi
>  >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>> Here is the KIP-1038: Add Custom Error Handler to Producer.
> >>>>>>> <
> >>>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1038%3A+Add+Custom+Error+Handler+to+Producer
> >>>>>>>
> >>>>>>> I look forward to your feedback!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Alieh
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
>
>

Re: [VOTE]: KIP-1050: Consistent error handling for Transactions

2024-08-26 Thread Justine Olshan

+1 from me

Thanks,

Justine

On Sun, Aug 25, 2024 at 11:54 PM Kaushik Raina 
wrote:

> Hi everyone,
> I would like to start a voting thread for KIP-1050: Consistent error
> handling for Transactions
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions
>
>
> This KIP aims to improve error handling and messages for EoS and therefore
> improve developer experience.
>
> Thanks
> Kaushik Raina
>

Re: [ANNOUNCE] New committer: Lianet Magrans

2024-08-28 Thread Justine Olshan

Congratulations Lianet!!
Justine

On Wed, Aug 28, 2024 at 8:58 AM David Arthur  wrote:

> Congrats, Lianet!
>
> On Wed, Aug 28, 2024 at 11:48 AM Mickael Maison 
> wrote:
>
> > Congratulations Lianet!
> >
> > On Wed, Aug 28, 2024 at 5:40 PM Josep Prat 
> > wrote:
> > >
> > > Congrats Lianet!
> > >
> > > On Wed, Aug 28, 2024 at 5:38 PM Chia-Ping Tsai 
> > wrote:
> > >
> > > > Congratulations, Lianet!!!
> > > >
> > > > On 2024/08/28 15:35:23 David Jacot wrote:
> > > > > Hi all,
> > > > >
> > > > > The PMC of Apache Kafka is pleased to announce a new Kafka
> committer,
> > > > > Lianet Magrans.
> > > > >
> > > > > Lianet has been a Kafka contributor since June 2023. In addition to
> > > > > being a regular contributor and reviewer, she has made significant
> > > > > contributions to the next generation of the consumer rebalance
> > > > > protocol (KIP-848) and to the new consumer. She has also
> contributed
> > > > > to discussing and reviewing many KIPs.
> > > > >
> > > > > Congratulations, Lianet!
> > > > >
> > > > > Thanks,
> > > > > David (on behalf of the Apache Kafka PMC)
> > > > >
> > > >
> > >
> > >
> > > --
> > > [image: Aiven] 
> > >
> > > *Josep Prat*
> > > Open Source Engineering Director, *Aiven*
> > > josep.p...@aiven.io   |   +491715557497
> > > aiven.io    |   <
> > https://www.facebook.com/aivencloud>
> > >      <
> > https://twitter.com/aiven_io>
> > > *Aiven Deutschland GmbH*
> > > Alexanderufer 3-7, 10117 Berlin
> > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > > Anna Richardson, Kenneth Chen
> > > Amtsgericht Charlottenburg, HRB 209739 B
> >
>
>
> --
> David Arthur
>

Re: Confluence/wiki account creation -- self-service fixed

2024-08-28 Thread Justine Olshan

Thanks Gavin and Matthias for making this process better!

Justine

On Wed, Aug 28, 2024 at 10:37 AM Matthias J. Sax  wrote:

> Hello,
>
> I want to share great news. The new confluence/wiki account creating
> self-service is now available!
>
> Cf https://selfserve.apache.org/
>
> I updated the KIP wiki page accordingly.
>
> Thanks a lot Gavin for pushing this over the finish line!
>
>
>
> -Matthias
>

Re: [DISCUSS] KIP-1050: Consistent error handling for Transactions

2024-08-30 Thread Justine Olshan

Hey Matthias,

102/103) Sorry for the confusion on the TransactionAbortableException.
Here is the error -- we have an error name and an exception name -- this is
the same error mentioned in KIP-890. It is also implemented so that new
clients throw this error when transaction verification fails.
https://github.com/apache/kafka/blob/70dd577286de31ef20dc4f198e95f9b9e4479b47/clients/src/main/java/org/apache/kafka/common/protocol/Errors.java#L405


I think the name can be standardized so we don't have two?

106) The Producer does return those errors when trying to commit
transactional offsets. See
https://github.com/apache/kafka/blob/70dd577286de31ef20dc4f198e95f9b9e4479b47/clients/src/main/java/org/apache/kafka/common/requests/TxnOffsetCommitResponse.java#L35

I will let Kaushik respond to some of the other points.

Justine

On Fri, Aug 30, 2024 at 3:57 PM Matthias J. Sax  wrote:

> Thanks for updating the KIP. It's much clearer now what you propose. I
> have a bunch of question about the proposal:
>
>
>
> (100) nit (typo / missing word?):
>
> > We would new error types
>
>
>
> (101) `TransactionAbortableException`, `ProducerFencedException`, and
> `UnknownProducerIdException` are listed twice in the tables.
>
>
>
> (102) You introduce a new exception `AbortableTransactionException`
> which will only be extended by `TransactionAbortableException`. Given
> that the existing TransactionAbortableException is not thrown by the
> producer right now (no passed into the `Callback`), it seem if the
> producer starts to throw/return the exiting
> `TransactionAbortableException` or the new
> `AbortableTransactionException` is would be an incompatible change?
>
>
>
> (103) It's unclear which method would throw the new
> `AbortableTransactionException` and/or if this new exception might be
> passe into the producer's send `Callback`.
>
>
>
> Btw: KIP-890 does not mention `TransactionAbortableException`... Does
> KIP-890 need an update? KIP-890 only mentions a new error code
> TRANSACTION_ABORTABLE -- or is this an implicit introduction of
> TransactionAbortableException -- I am not familiar with the details how
> core KIPs are written?
>
>
>
> (104) The KIP does not explicitly say, which of the new exceptions are
> actually user facing? It seems only AbortableTransactionException,
> ApplicationRecoverableTransactionException, and
> InvalidConfiguationTransactionException are exception which user will be
> able to catch (or handle vie the `Callback`), while
> ProducerRetriableTransactionException and
> ProducerRefreshRetriableTransactionException won't be thrown/return by
> the producer into the app code?
>
>
>
> (105) `IllegalStateException` and `RuntimeException` which are Java
> exceptions are listed in the table of
> `ApplicationRecoverableTransactionException`: I think it is not valid to
> list them, as we cannot change their super-class.
>
>
>
> (106) `UnknownMemberIdException`, `IllegalGenerationException`, and
> `CorrelationIdMismatchException` are listed in the table of
> `ApplicationRecoverableTransactionException` but it seems they are not
> thrown/returned by the producer atm. If we start to throw/return either
> of them it seem to be a backward incompatible change.
>
>
>
> (106) Similarly to 105, `InvalidRecordException` and
> `InvalidRequiredAcksException` are listed in the table of
> `InvalidConfiguationTransactionException` but they seem not to be user
> facing.
>
>
>
>
> -Matthias
>
>
> On 7/25/24 8:50 AM, Kaushik Raina wrote:
> > Additionally,
> > - We will be depreciating KIP-691 in favour of KIP-1050.
> >
> >
> > On Fri, Jun 21, 2024 at 12:20 PM Kaushik Raina 
> wrote:
> >
> >> Thanks Matthias for feedback
> >> - We have updates KIP and grouped exceptions
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions#KIP1050:ConsistenterrorhandlingforTransactions-ExceptionTable
> >>
> >> - Regarding compatibility,  all changes in KIP are expected to be
> *backword
> >> compatible*.  We have updated KIP to make it clear.
> >>
> >>
> >> On Tue, Jun 11, 2024 at 4:50 AM Matthias J. Sax 
> wrote:
> >>
> >>> Thanks for this KIP. Great to see it. I would assume it will make
> >>> KIP-691 unnecessary?
> >>>
> >>> I don't think I fully understand the proposal yet. It's clear, that you
> >>> propose to add new sub-classed to group existing exceptions. But it's
> >>> not clear to me, which of the existing exceptions (which implement
> >>> ApiException directly right now) will get a new parent class and go
> into
> >>> the same group. You only list `InvalidProducerEpochException` which
> gets
> >>> `AbortableTransactionException` as new parent. It would help a lot, if
> >>> you could list out explicitly, which existing exceptions are grouped
> >>> together via which sub-class.
> >>>
> >>> It should be sufficient to just add a list for each group. For the
> newly
> >>> added exception classes, I would also omit all constructors etc and
> just
> >>> add a comment about

Re: [DISCUSS] Single broker failures causing offline partitions

2024-09-10 Thread Justine Olshan

Hey Martin.

I recommend you take a look at KIP-966. I think can help the use case you
are describing.
The KIP talks about failure scenarios, but I believe it will also help when
the leader has issues and kicks its followers out of ISR.
The goal is to better handle the "last replica standing" issue

https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas

Let me know if it helps,

Justine

On Tue, Sep 10, 2024 at 9:00 AM Martin Dickson
 wrote:

> Hi all,
>
> We have a recurring issue with single broker failures causing offline
> partitions. The issue is that when a leader is degraded, follower fetches
> can fail to happen in a timely manner, and all followers can fall out of
> sync. If that leader then later fails then the partition will go offline,
> but even if it remains only partially failed then applications might still
> be impacted (for example, if the producer is using acks=all and
> min.insync.replicas=2). This can all happen because of a problem solely
> with the leader, and hence a single broker failure can cause
> unavailability, even if RF=3 or higher.
>
> We’ve seen the issue with various kinds of failures, mostly related to
> failing disks, e.g. due to pressure on request handler threads as a result
> of produce requests waiting on a slow disk. But the easiest way for us to
> reproduce it is at the outgoing network level: Setting up a cluster with
> moderate levels of ingoing throughput then injecting 50% outgoing packet
> drop on a single broker is enough to cause the partitions to cause follower
> requests to be slow and replication to lag, but not enough for that broker
> to lose its connection to ZK. This triggers the degraded broker to become
> the only member of ISR.
>
> We have replica.lag.time.max.ms=1 and zookeeper.session.timeout.ms
> =6000
> (the pre-KIP-537 values, 1/3 of the current defaults, to control produce
> latency when a follower is failing). We are also able to reproduce the
> issue in the same way on a KRaft cluster with the KRaft defaults. (Note
> that we are not very experienced with operating KRaft as we aren’t running
> it in production yet.)
>
> The last KIP I saw regarding this was KIP-501
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-501+Avoid+out-of-sync+or+offline+partitions+when+follower+fetch+requests+are+not+processed+in+time
> >,
> which describes this exact problem. The proposed solution there was in the
> first part to introduce a notion of pending requests, and second part to
> relinquish leadership if pending requests are taking too long. The
> discussion
> thread 
> for that doesn’t come to a conclusion. However it is pointed out that not
> all failure modes would be solved by the pending requests approach, and
> that whilst relinquishing leadership seems ideal there are concerns about
> this thrashing in certain failure modes.
>
> We are experimenting with a variation on KIP-501 where we add a heuristic
> for brokers failing this way: if the leader for a partition has removed
> many followers from ISR in a short period of time (including the case when
> it sends a single AlterPartition request removing all followers from ISR
> and thereby shrinking ISR only to itself), have the controller ignore this
> request and instead choose one of the followers to become leader. To avoid
> thrashing, rate-limit how often the controller can do this per
> topic-partition. We have tested that this fixes our repro, but have not
> productionised it (see rough draft PR
> ). We have only
> implemented
> ZK-mode so far. We implemented this on the controller side out of
> convenience (no API changes), but potentially the demotion decision should
> be taken at the broker level instead, which should also be possible.
>
> Whilst the code change is small, the proposed solution we’re investigating
> isn’t very clean and we’re not totally satisfied with it. We wanted to get
> some ideas from the community on:
> 1. How are other folks handling this class of issues?
> 2. Is there any interest in adding more comprehensive failing
> broker detection to Kafka (particularly how this could look in KRaft)?
> 3. Is there any interest in having a heuristic failure detection like the
> one described above?
>
> Thanks,
> Martin
>

Re: [VOTE] KIP-1052: Enable warmup in producer performance test

2024-09-11 Thread Justine Olshan

+1 (binding) from me.

Thanks,
Justine

On Wed, Sep 4, 2024 at 3:22 PM Welch, Matt  wrote:

> Hi Kafka devs,
>
> Bumping this VOTE thread again for visibility.
>
> Thanks,
> Matt
>
> -Original Message-
> From: Welch, Matt 
> Sent: Friday, August 23, 2024 4:26 PM
> To: dev@kafka.apache.org
> Subject: RE: [VOTE] KIP-1052: Enable warmup in producer performance test
>
> Hi Kafka devs,
>
> Bumping this VOTE thread for visibility.
>
> Thanks,
> Matt
>
> -Original Message-
> From: Federico Valeri 
> Sent: Monday, August 19, 2024 12:38 AM
> To: dev@kafka.apache.org
> Subject: Re: [VOTE] KIP-1052: Enable warmup in producer performance test
>
> Hi Matt, +1 (non binding) from me. Thanks!
>
> Just a suggestion: I think that the following output line does not add
> much value and could be removed.
>
> "Warmup first 10 records. Steady-state results will print after the
> complete-test summary."
>
> On Wed, Aug 14, 2024 at 8:06 PM Welch, Matt  wrote:
> >
> >
> > Hi all,
> >
> > It seems discussion has been quiet for a couple of weeks so I'd like
> > to call a vote on KIP-1052
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1052%3A+Enable+w
> > armup+in+producer+performance+test
> >
> > Thanks,
> > Matt Welch
> >
>

Re: Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-09-16 Thread Justine Olshan

Hey there Jiunn-Yang,

Thanks for taking the time to update the KIP. I noticed the section now has
some crossed out components. I am a little confused since I think this
means we originally took a certain approach, but decided against it later?
Is it possible to explain we took one approach but decided against it later
in words and not crossing out text? I think it will help anyone who reads
this KIP in the future understand.

Thanks again,
Justine

On Sat, Sep 14, 2024 at 6:21 PM 黃竣陽  wrote:

> Hi everyone,
>
> I make some update to KIP-1022 <
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1022%3A+Formatting+and+Updating+Features
> >
> As part of KIP-584, brokers expose supported version ranges for features,
> which may include 0. Version 0 can indicate either an incomplete feature or
> a disabled one. However, older software versions have an issue in
> SupportedVersionRange that prevents handling ranges that include 0, leading
> to deserialization problems in ApiVersionsResponse.
> One approach is to modify the minimum version from 0 to 1, which passes old
> client validation but removes the ability to disable the feature. Another
> approach is to omit features with min version 0 in requests.
> KAFKA-17011 implements the solution of modifying the minimum version from 0
> to 1 for RPC versions v0 to v3. This passes old client validation but
> removes the ability to disable the feature.
> A related issue occurs when a new broker sends a BrokerRegistrationRequest
> to an old controller, as the old controller does not support version 1. The
> solution is to omit features with version 0 in requests for versions v0 to
> v3. Therefore, we need to adjust the approach to resolve this dilemma.
> In KAFKA-17492, we omit the minimum version of 0 for RPC versions v0 to v3,
> allowing a 3.9 broker to join a 3.8 broker controller.
>
> Please let me know if you have any questions or concerns regarding this
> approach.
> Thanks you,
> Best regard,
> Jiunn-Yang
>
> On 2024/04/02 18:49:50 Justine Olshan wrote:
> > Hi José
> >
> > I originally had each on a new line and then switched to a single line
> > since it looked like a lot of space. I can switch it back.
> >
> > I don't think it makes a big difference if we implement it for both
> tools.
> > Both tools will have use for it.
> >
> > I can change the name to feature-dependencies if that makes it clearer.
> >
> > I can say that we won't allow cycles, but I'm not sure the best way to
> > enforce this.
> >
> > I just put ">" to show output. I can change the color if that makes it
> > clearer.
> >
> > Justine
> >
> > On Tue, Apr 2, 2024 at 11:41 AM José Armando García Sancio
> >  wrote:
> >
> > > Hi Justine,
> > >
> > > See my comments below.
> > >
> > > On Mon, Apr 1, 2024 at 4:43 PM Justine Olshan
> > >  wrote:
> > > > 20. I can update the KIP.
> > >
> > > I took a look at the latest KIP.
> > >
> > > * Should the output of this command "bin/kafka-features.sh
> > > version-mapping --release-version 3.6-IV1" be something like this:
> > > metadata.version=13
> > > transaction.protocol.version=0
> > > group.coordinator.version=0
> > > kraft.version=0
> > >
> > > I added the kraft.version to show the changes that are coming from
> > > KIP-853. I find this formatting much easier to parse by scripts/tools.
> > > They can even use Java's Properties parser if they want.
> > >
> > > * Maybe I missed the discussion but can you talk about why both
> > > "kafka-storage" and "kafka-features" are going to implement the
> > > "version-mapping" and dependencies"? I assume that it is sufficient
> > > for only one (kafka-features) to implement these new commands.
> > >
> > > * For the command "dependencies" in the "kafka-features" command, it
> > > is probably obvious that the dependencies listed are feature version
> > > dependencies. This is not obvious when the user uses "kafka-storage
> > > dependencies".  This reads as the dependencies of kafka-storage.
> > >
> > > * Should we state that Kafka will not allow cycles in the feature
> > > version dependency definition? Meaning the user is not allowed to
> > > define a feature version X which depends on Y which depends on Z which
> > > depends on X.
> > >
> > > * Some of the example output start with the characters "> ". Will the
> > > CLI print those characters or is that just part of the KIP's
> > > formating?
> > >
> > > Thanks,
> > > --
> > > -José
> > >
> >
>

Re: [DISCUSS] KIP-854 Separate configuration for producer ID expiry

2022-08-17 Thread Justine Olshan

Hey all,
Quick update to this KIP. While working on the PR and tests I realized that
we have a hardcoded value for how often we clean up producer IDs. Currently
this value is 10 minutes and is defined in LogManager.
I thought for better testing and ease of use of the new configuration, we
should also be able to configure the cleanup interval.

Here is the new configuration I'm hoping to add. I also added it to the KIP.
name: producer.id.expiration.check.interval.ms
description: The interval at which to remove producer IDs that have expired
due to producer.id.expiration.ms passing
default: 60 (10 minutes)
valid values: [1,...]
priority: low
update-mode: read-only

I left the default as the current hardcoded value to avoid disruptions. If
there are any issues with this change let me know.
Thanks,
Justine

On Fri, Aug 5, 2022 at 1:40 PM Justine Olshan  wrote:

> Awesome. Thanks Tom!
>
> I plan to open this KIP vote at the start of next week.
> Thanks all for the discussion! Let me know if there is anything else. :)
>
> Justine
>
> On Wed, Aug 3, 2022 at 11:32 AM Tom Bentley  wrote:
>
>> Hi Justine,
>>
>> That all seems reasonable to me, thanks!
>>
>> On Wed, 3 Aug 2022 at 19:14, Justine Olshan > >
>> wrote:
>>
>> > Hi Tom and Ismael,
>> >
>> > 1. Yes, there are definitely many ways to improve this issue and I plan
>> to
>> > write followup KIPs to address some of the larger changes.
>> > Just wanted to get this simple fix in as a short term measure to prevent
>> > issues with too many producer IDs in the cache. Stay tuned :)
>> >
>> > 2. I did have some offline discussion about informing the client. I
>> think
>> > for this specific KIP the default behavior in practice should not change
>> > enough to require this information to go back to the client. In other
>> > words, a reasonable configuration should not regress behavior. However,
>> > with the further changes I mention in 1, perhaps this is something we
>> want
>> > to do. And yes -- unfortunately the current state of Kafka is no longer
>> > totally consistent with KIP-98. This is something we probably want to
>> > clarify in the future.
>> >
>> > 3. I will update the config to mention it is not dynamic. I think since
>> the
>> > transactional id configuration is read-only, this should be too.
>> >
>> > 4. I can update this wording.
>> >
>> > 5. I think there are definitely benefits to the name `
>> > idempotent.pid.expiration.ms` but there are other ways this could cause
>> > confusion. And to be clear -- the configuration can expire a producer ID
>> > for a transactional producer as long as there isn't an ongoing
>> transaction.
>> >
>> > Let me know if you have any questions and thanks for taking a look!
>> >
>> > Justine
>> >
>> > On Wed, Aug 3, 2022 at 9:30 AM Ismael Juma  wrote:
>> >
>> > > Regarding 1, more can certainly be done, but I think it would be
>> > > complementary. As such, I think this KIP stands on its own and
>> additional
>> > > improvements can be handled via future KIPs (unless Justine wants to
>> > > combine things, of course).
>> > >
>> > > Ismael
>> > >
>> > > On Wed, Aug 3, 2022 at 9:12 AM Tom Bentley 
>> wrote:
>> > >
>> > > > Hi Justine,
>> > > >
>> > > > Thanks for the KIP! I can see that this is a pragmatic attempt to
>> > > address a
>> > > > nasty problem. I have a few questions:
>> > > >
>> > > > 1. The KIP makes the problem significantly harder to trigger, but
>> > doesn't
>> > > > eliminate it entirely. How confident are you that it will be
>> sufficient
>> > > in
>> > > > practice? We can point to applications which are creating idempotent
>> > > > producers at a high rate and say they're broken, but that doesn't do
>> > > > anything to defend the broker from an interaction pattern that
>> differs
>> > > only
>> > > > in rate from a "good application". Did you consider a new quota to
>> > limit
>> > > > the rate at which a (principal, clientId) can allocate new PIDs?
>> > > >
>> > > > 2. The KIP contains this sentence: "when an idempotent producer’s ID
>> > > > expires, it silently loses its idempotency guarantees." That's at
>> odds
>> > > wi

Re: [DISCUSS] KIP-854 Separate configuration for producer ID expiry

2022-08-18 Thread Justine Olshan

Hi Ismael,

We can do this if we think that an external configuration is not necessary.
Just wondering, is there a reason we don't want an external configuration
here?

Thanks,
Justine

On Wed, Aug 17, 2022 at 12:30 PM Ismael Juma  wrote:

> Why does this have to be an external config? We can provide an internal
> mechanism to configure this, no?
>
> Ismael
>
> On Wed, Aug 17, 2022 at 9:22 AM Justine Olshan
> 
> wrote:
>
> > Hey all,
> > Quick update to this KIP. While working on the PR and tests I realized
> that
> > we have a hardcoded value for how often we clean up producer IDs.
> Currently
> > this value is 10 minutes and is defined in LogManager.
> > I thought for better testing and ease of use of the new configuration, we
> > should also be able to configure the cleanup interval.
> >
> > Here is the new configuration I'm hoping to add. I also added it to the
> > KIP.
> > name: producer.id.expiration.check.interval.ms
> > description: The interval at which to remove producer IDs that have
> expired
> > due to producer.id.expiration.ms passing
> > default: 60 (10 minutes)
> > valid values: [1,...]
> > priority: low
> > update-mode: read-only
> >
> > I left the default as the current hardcoded value to avoid disruptions.
> If
> > there are any issues with this change let me know.
> > Thanks,
> > Justine
> >
> > On Fri, Aug 5, 2022 at 1:40 PM Justine Olshan 
> > wrote:
> >
> > > Awesome. Thanks Tom!
> > >
> > > I plan to open this KIP vote at the start of next week.
> > > Thanks all for the discussion! Let me know if there is anything else.
> :)
> > >
> > > Justine
> > >
> > > On Wed, Aug 3, 2022 at 11:32 AM Tom Bentley 
> wrote:
> > >
> > >> Hi Justine,
> > >>
> > >> That all seems reasonable to me, thanks!
> > >>
> > >> On Wed, 3 Aug 2022 at 19:14, Justine Olshan
> >  > >> >
> > >> wrote:
> > >>
> > >> > Hi Tom and Ismael,
> > >> >
> > >> > 1. Yes, there are definitely many ways to improve this issue and I
> > plan
> > >> to
> > >> > write followup KIPs to address some of the larger changes.
> > >> > Just wanted to get this simple fix in as a short term measure to
> > prevent
> > >> > issues with too many producer IDs in the cache. Stay tuned :)
> > >> >
> > >> > 2. I did have some offline discussion about informing the client. I
> > >> think
> > >> > for this specific KIP the default behavior in practice should not
> > change
> > >> > enough to require this information to go back to the client. In
> other
> > >> > words, a reasonable configuration should not regress behavior.
> > However,
> > >> > with the further changes I mention in 1, perhaps this is something
> we
> > >> want
> > >> > to do. And yes -- unfortunately the current state of Kafka is no
> > longer
> > >> > totally consistent with KIP-98. This is something we probably want
> to
> > >> > clarify in the future.
> > >> >
> > >> > 3. I will update the config to mention it is not dynamic. I think
> > since
> > >> the
> > >> > transactional id configuration is read-only, this should be too.
> > >> >
> > >> > 4. I can update this wording.
> > >> >
> > >> > 5. I think there are definitely benefits to the name `
> > >> > idempotent.pid.expiration.ms` but there are other ways this could
> > cause
> > >> > confusion. And to be clear -- the configuration can expire a
> producer
> > ID
> > >> > for a transactional producer as long as there isn't an ongoing
> > >> transaction.
> > >> >
> > >> > Let me know if you have any questions and thanks for taking a look!
> > >> >
> > >> > Justine
> > >> >
> > >> > On Wed, Aug 3, 2022 at 9:30 AM Ismael Juma 
> wrote:
> > >> >
> > >> > > Regarding 1, more can certainly be done, but I think it would be
> > >> > > complementary. As such, I think this KIP stands on its own and
> > >> additional
> > >> > > improvements can be handled via future KIPs (unless Justine wants
> to
> > >> > > combine things, of course).
> > >> > >

Re: [DISCUSS] KIP-854 Separate configuration for producer ID expiry

2022-08-18 Thread Justine Olshan

Hi David,

I chose the name to match the variable name of the existing hardcoded
value. I also thought it was clearer what was happening.

I'm not sure how to follow the pattern above in a way that is not overly
verbose. It would have to be something like
producer.id.remove.expired.producer.id.cleanup.interval.ms (which matches
the "producer.id" prefix of the other added config) or
pid.remove.expired.pid.cleanup.interval.ms.

Both of these seemed repetitive. So there's something like
remove.expired.producer.id.interval.ms? This are starting to get away from
the existing configs though.
Let me know what you think.

Justine

On Thu, Aug 18, 2022 at 8:40 AM David Jacot 
wrote:

> Given that we already have
> `transaction.abort.timed.out.transaction.cleanup.interval.ms` and
> `transaction.remove.expired.transaction.cleanup.interval.ms`, it seems
> OK to add another one for our case here. Regarding the name, I would
> follow the pattern that we use for those two existing configs. We can
> define it as an internal config as well. This allows you to change it
> for your integration tests but does not advertise it.
>
> Best,
> David
>
> On Thu, Aug 18, 2022 at 5:34 PM Justine Olshan
>  wrote:
> >
> > Hi Ismael,
> >
> > We can do this if we think that an external configuration is not
> necessary.
> > Just wondering, is there a reason we don't want an external configuration
> > here?
> >
> > Thanks,
> > Justine
> >
> > On Wed, Aug 17, 2022 at 12:30 PM Ismael Juma  wrote:
> >
> > > Why does this have to be an external config? We can provide an internal
> > > mechanism to configure this, no?
> > >
> > > Ismael
> > >
> > > On Wed, Aug 17, 2022 at 9:22 AM Justine Olshan
> > > 
> > > wrote:
> > >
> > > > Hey all,
> > > > Quick update to this KIP. While working on the PR and tests I
> realized
> > > that
> > > > we have a hardcoded value for how often we clean up producer IDs.
> > > Currently
> > > > this value is 10 minutes and is defined in LogManager.
> > > > I thought for better testing and ease of use of the new
> configuration, we
> > > > should also be able to configure the cleanup interval.
> > > >
> > > > Here is the new configuration I'm hoping to add. I also added it to
> the
> > > > KIP.
> > > > name: producer.id.expiration.check.interval.ms
> > > > description: The interval at which to remove producer IDs that have
> > > expired
> > > > due to producer.id.expiration.ms passing
> > > > default: 60 (10 minutes)
> > > > valid values: [1,...]
> > > > priority: low
> > > > update-mode: read-only
> > > >
> > > > I left the default as the current hardcoded value to avoid
> disruptions.
> > > If
> > > > there are any issues with this change let me know.
> > > > Thanks,
> > > > Justine
> > > >
> > > > On Fri, Aug 5, 2022 at 1:40 PM Justine Olshan 
> > > > wrote:
> > > >
> > > > > Awesome. Thanks Tom!
> > > > >
> > > > > I plan to open this KIP vote at the start of next week.
> > > > > Thanks all for the discussion! Let me know if there is anything
> else.
> > > :)
> > > > >
> > > > > Justine
> > > > >
> > > > > On Wed, Aug 3, 2022 at 11:32 AM Tom Bentley 
> > > wrote:
> > > > >
> > > > >> Hi Justine,
> > > > >>
> > > > >> That all seems reasonable to me, thanks!
> > > > >>
> > > > >> On Wed, 3 Aug 2022 at 19:14, Justine Olshan
> > > >  > > > >> >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Tom and Ismael,
> > > > >> >
> > > > >> > 1. Yes, there are definitely many ways to improve this issue
> and I
> > > > plan
> > > > >> to
> > > > >> > write followup KIPs to address some of the larger changes.
> > > > >> > Just wanted to get this simple fix in as a short term measure to
> > > > prevent
> > > > >> > issues with too many producer IDs in the cache. Stay tuned :)
> > > > >> >
> > > > >> > 2. I did have some offline discussion about informing the
> client. I
> > > > >> think
> > > > >> > for

Re: [DISCUSS] KIP-854 Separate configuration for producer ID expiry

2022-08-19 Thread Justine Olshan

Hi all,

Followed up with David and Ismael offline.
Ismael explained that we probably don't want to increase complexity and
didn't think the value needed to be modified beyond tests. I agree with
this so I've decided to go with an internal configuration (with the same
default as mentioned above).

>From a user's perspective, there will be no change. Because this is
internal, we decided to keep the name the internal variable used.

Let me know if anyone has any further concerns.
Thanks,
Justine

Re: [VOTE] KIP-854 Separate configuration for producer ID expiry

2022-09-08 Thread Justine Olshan

Hi all,

Thanks for voting on this KIP!

Taking a look at this change, I realized that rolling the cluster to apply
the configuration may be difficult when the issue is ongoing. The brokers
will become overloaded and it will be hard to roll. Since the configuration
can be dynamic, it makes sense to make it so. This way, we can minimize
impact by quickly changing the expiration (and potentially reverting it
back once the issue is resolved). I've updated the KIP to reflect that the
configuration should be dynamic rather than static, and that it will be
updated cluster-wide.

Let me know if there are any issues with this change.

Thanks,
Justine

On Thu, Aug 11, 2022 at 4:07 PM Justine Olshan  wrote:

> Hey all,
> Thanks for the votes!
> To summarize, we have 4 binding votes:
>
>- David Jacot
>- Jason Gustafson
>- Tom Bentley
>- Luke Chen
>
> and one non-binding vote:
>
>- Sagar
>
> This meets the requirements for the KIP to be accepted! Thanks again!
> Justine
>
> On Thu, Aug 11, 2022 at 2:13 AM Luke Chen  wrote:
>
>> Hi Justine,
>>
>> Thanks for the KIP.
>> +1 (binding) from me.
>>
>> Luke
>>
>> On Wed, Aug 10, 2022 at 1:44 AM Tom Bentley  wrote:
>>
>> > Hi Justine,
>> >
>> > Thanks again for the KIP, +1 (binding).
>> >
>> >
>> >
>> > On Tue, 9 Aug 2022 at 18:09, Jason Gustafson > >
>> > wrote:
>> >
>> > > Thanks Justine, +1 from me.
>> > >
>> > > On Tue, Aug 9, 2022 at 1:12 AM Sagar 
>> wrote:
>> > >
>> > > > Thanks for the KIP.
>> > > >
>> > > > +1(non-binding)
>> > > >
>> > > > Sagar.
>> > > >
>> > > > On Tue, Aug 9, 2022 at 1:13 PM David Jacot
>> > > >
>> > > > wrote:
>> > > >
>> > > > > Thanks for the KIP, Justine. The proposal makes sense to me. I am
>> +1
>> > > > > (binding).
>> > > > >
>> > > > > Cheers,
>> > > > > David
>> > > > >
>> > > > > On Mon, Aug 8, 2022 at 6:18 PM Justine Olshan
>> > > > >  wrote:
>> > > > > >
>> > > > > > Hi all,
>> > > > > > I'd like to start a vote for KIP-854: Separate configuration for
>> > > > producer
>> > > > > > ID expiry.
>> > > > > >
>> > > > > > KIP:
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
>> > > > > > JIRA: https://issues.apache.org/jira/browse/KAFKA-14097
>> > > > > > Discussion thread:
>> > > > > >
>> https://lists.apache.org/thread/cz9x90883on98k082qd0tskj6yjhox1t
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Justine
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [ANNOUNCE] New committer: Deng Ziming

2022-10-10 Thread Justine Olshan

Congratulations Ziming! I'll always remember the help you provided with
topic IDs.
Very well deserved!

On Mon, Oct 10, 2022 at 9:53 AM Matthew Benedict de Detrich
 wrote:

> Congratulations!
>
> On Mon, 10 Oct 2022, 11:30 Jason Gustafson, 
> wrote:
>
> > Hi All
> >
> > The PMC for Apache Kafka has invited Deng Ziming to become a committer,
> > and we are excited to announce that he has accepted!
> >
> > Ziming has been contributing to Kafka for about three years. He has
> > authored
> > more than 100 patches and helped to review nearly as many. In particular,
> > he made significant contributions to the KRaft project which had a big
> part
> > in reaching our production readiness goal in the 3.3 release:
> > https://blogs.apache.org/kafka/entry/what-rsquo-s-new-in.
> >
> > Please join me in congratulating Ziming! Thanks for all of your
> > contributions!
> >
> > -- Jason, on behalf of the Apache Kafka PMC
> >
>

Re: [DISCUSS] solutions for broker OOM caused by many producer IDs

2022-10-18 Thread Justine Olshan

Oops.  I realized I just replied to Omnia 🤦‍♀️

Here was my response for the mailing thread:

Hey Omnia,
Sorry to hear this is a problem for you as well. :(
> * I have some concerns about the impact of this option on the
transactional producers, for example, what will happen to an ongoing
transaction associated with an expired PID? Would this leave the
transactions in a "hanging" state?*
We currently check if a transaction is ongoing and do not expire the
producer ID if it has an ongoing transaction. I suspect we will continue to
do this with any solution we pick.

My team members and I looked a bit into the throttling case and it can get
a bit tricky since it means we need to throttle the produce request before
it is processed. This means we either block all requests or have to store
the request in memory (which is not great if we are trying to save memory).

I recently had another discussion with my team and it does seem like there
should be a way to make it more clear to the clients what is going on. A
lot of this protocol is implicit. I'm wondering if maybe there is a way to
improve the story for newer clients. (Ie if we choose to expire based on a
size limit, we should include a response indicating the ID has expired.) We
also discussed ways to redefine the guarantees so that users who have
stronger idempotency requirements can ensure them (over availability/memory
concerns). Let me know if you have any ideas here.

Thanks again for commenting here, hopefully we can come to a good solution.

On Tue, Oct 18, 2022 at 1:11 AM Luke Chen  wrote:

> Hi Omnia,
>
> Thanks for your reply.
>
> For (3), you said:
> > - I have some concerns about the impact of this option on the
> transactional
> producers, for example, what will happen to an ongoing transaction
> associated with an expired PID? Would this leave the transactions in a
> "hanging" state?
>
> - How will we notify the client that the transaction can't continue due to
> an expired PID?
>
> - If PID got marked as `expired` this will mean that
> `admin.DescribeProducers` will not list them which will make
> *`kafka-transactions.sh
> --list`* a bit tricky as we can't identify if there are transactions linked
> to this expired PID or not. The same concern applies to
> *`kafka-transactions.sh
> --find-hanging`*.
>
> --> Yes, you're right. Those are also concerns for this solution.
> Currently, there's no way to notify clients about the expiration.
> Also, the ongoing transactions will be hanging. For the admin cli, we've
> never thought about it. Good point.
> In summary, to adopt this solution, there are many issues needed to get
> fixed.
>
>
> For (5), you said:
> > I am assuming you mean KafkaPrincipal here! If so is your concern here
> that
> those good clients that use the same principal as a rogue one will get
> throttled?
>
> If this is the case, then I believe it should be okay as other throttling
> in Kafka on *`/config/users/`* has the same behaviour.
>
>
> What about applying limit/throttling to
> *`/config/users//clients/`
> *similar to what we have with client quota? This should reduce the concern
> about throttling good clients, right?
>
> --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear)
> Yes, We were thinking about throttling by KafkaPrinciple. Client Id is
> also workable.
> It's just these 2 attributes are not required.
> That is, it's possible we take all clients as the same one: {default
> KafkaPrinciple + default clientID}, and apply throttling on it.
> Do you have any thoughts about it?
> Maybe skip throttling for {default KafkaPrinciple + default clientID} ?
>
> Luke
>
>
>
> On Sat, Oct 15, 2022 at 2:33 AM Omnia Ibrahim 
> wrote:
>
>> Hi Luke & Justine,
>> Thanks for looking into this issue, we have been experiencing this because
>> of rouge clients as well.
>>
>> > 3. Having a limit to the number of active producer IDs (sort of like an
>> LRU
>> >cache)
>> >-> The idea here is that if we hit a misconfigured client, we will expire
>> >the older entries. The concern here is we have risks to lose idempotency
>> >guarantees, and currently, we don't have a way to notify clients about
>> >losing idempotency guarantees. Besides, the least  recently used entries
>> >got removed are not always from the "bad" clients.
>>
>> - I have some concerns about the impact of this option on the
>> transactional
>> producers, for example, what will happen to an ongoing transaction
>> associated with an expired PID? Would this leave the transactions in a
>> "hanging" state?
>>
>> - How will we notify the client that the transaction can't continue due to
>> an expired PID?
>>
>> - If PID got marked as `expired` this will mean that
>> `admin.DescribeProducers` will not list them which will make
>> *`kafka-transactions.sh
>> --list`* a bit tricky as we can't identify if there are transactions
>> linked
>> to this expired PID or not. The same concern applies to
>> *`kafka-transactions.sh
>> --find-hanging`*.
>>
>>
>> >5. limit/throttling the pr

[DISCUSS] KIP-890 Server Side Defense

2022-11-18 Thread Justine Olshan

Hey all!

I'd like to start a discussion on my proposal to add some server-side
checks on transactions to avoid hanging transactions. I know this has been
an issue for some time, so I really hope this KIP will be helpful for many
users of EOS.

The KIP includes changes that will be compatible with old clients and
changes to improve performance and correctness on new clients.

Please take a look and leave any comments you may have!

KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
JIRA: https://issues.apache.org/jira/browse/KAFKA-14402

Thanks!
Justine

Re: [DISCUSS] KIP-890 Server Side Defense

2022-11-22 Thread Justine Olshan

#x27;s an idempotent operation?
>
>
> (30)
>
> > To cover older clients, we will ensure a transaction is ongoing before
> we write to a transaction
>
> Not sure what you mean by this? Can you elaborate?
>
>
> (40)
>
> > [the TX-coordinator] will write the prepare commit message with a bumped
> epoch and send WriteTxnMarkerRequests with the bumped epoch.
>
> Why do we use the bumped epoch for both? It seems more intuitive to use
> the current epoch, and only return the bumped epoch to the producer?
>
>
> (50) "Implicit AddPartitionToTransaction"
>
> Why does the implicitly sent request need to be synchronous? The KIP
> also says
>
> > in case we need to abort and need to know which partitions
>
> What do you mean by this?
>
>
> > we don’t want to write to it before we store in the transaction manager
>
> Do you mean TX-coordinator instead of "manager"?
>
>
> (60)
>
> For older clients and ensuring that the TX is ongoing, you describe a
> race condition. I am not sure if I can follow here. Can you elaborate?
>
>
>
> -Matthias
>
>
>
> On 11/18/22 1:21 PM, Justine Olshan wrote:
> > Hey all!
> >
> > I'd like to start a discussion on my proposal to add some server-side
> > checks on transactions to avoid hanging transactions. I know this has
> been
> > an issue for some time, so I really hope this KIP will be helpful for
> many
> > users of EOS.
> >
> > The KIP includes changes that will be compatible with old clients and
> > changes to improve performance and correctness on new clients.
> >
> > Please take a look and leave any comments you may have!
> >
> > KIP:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
> > JIRA: https://issues.apache.org/jira/browse/KAFKA-14402
> >
> > Thanks!
> > Justine
> >
>

Re: [DISCUSS] KIP-890 Server Side Defense

2022-11-29 Thread Justine Olshan

r to handle older clients,
> and if there are any changes we are making. Maybe I'm missing something,
> but we would want to identify whether we missed the 0 sequence for older
> clients, no?
>
> 2) Upon returning from the transaction coordinator, we can set the
> transaction
> as ongoing on the leader by populating currentTxnFirstOffset
> through the typical produce request handling.
>
> does the typical produce request path append records to local log along
> with the currentTxnFirstOffset information? I would like to understand
> when the field is written to disk.
>
> Thanks,
> Jeff
>
>
> On Tue, Nov 22, 2022 at 4:44 PM Artem Livshits
>  wrote:
>
> > Hi Justine,
> >
> > Thank you for the KIP.  I have one question.
> >
> > 5) For new clients, we can once again return an error UNKNOWN_PRODUCER_ID
> >
> > I believe we had problems in the past with returning UNKNOWN_PRODUCER_ID
> > because it was considered fatal and required client restart.  It would be
> > good to spell out the new client behavior when it receives the error.
> >
> > -Artem
> >
> > On Tue, Nov 22, 2022 at 10:00 AM Justine Olshan
> >  wrote:
> >
> > > Thanks for taking a look Matthias. I've tried to answer your questions
> > > below:
> > >
> > > 10)
> > >
> > > Right — so the hanging transaction only occurs when we have a late
> > message
> > > come in and the partition is never added to a transaction again. If we
> > > never add the partition to a transaction, we will never write a marker
> > and
> > > never advance the LSO.
> > >
> > > If we do end up adding the partition to the transaction (I suppose this
> > can
> > > happen before or after the late message comes in) then we will include
> > the
> > > late message in the next (incorrect) transaction.
> > >
> > > So perhaps it is clearer to make the distinction between messages that
> > > eventually get added to the transaction (but the wrong one) or messages
> > > that never get added and become hanging.
> > >
> > >
> > > 20)
> > >
> > > The client side change for 2 is removing the addPartitions to
> transaction
> > > call. We don't need to make this from the producer to the txn
> > coordinator,
> > > only server side.
> > >
> > >
> > > In my opinion, the issue with the addPartitionsToTxn call for older
> > clients
> > > is that we don't have the epoch bump, so we don't know if the message
> > > belongs to the previous transaction or this one. We need to check if
> the
> > > partition has been added to this transaction. Of course, this means we
> > > won't completely cover the case where we have a really late message and
> > we
> > > have added the partition to the new transaction, but that's
> unfortunately
> > > something we will need the new clients to cover.
> > >
> > >
> > > 30)
> > >
> > > Transaction is ongoing = partition was added to transaction via
> > > addPartitionsToTxn. We check this with the DescribeTransactions call.
> Let
> > > me know if this wasn't sufficiently explained here:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense#KIP890:TransactionsServerSideDefense-EnsureOngoingTransactionforOlderClients(3)
> > >
> > >
> > > 40)
> > >
> > > The idea here is that if any messages somehow come in before we get the
> > new
> > > epoch to the producer, they will be fenced. However, if we don't think
> > this
> > > is necessary, it can be discussed
> > >
> > >
> > > 50)
> > >
> > > It should be synchronous because if we have an event (ie, an error)
> that
> > > causes us to need to abort the transaction, we need to know which
> > > partitions to send transaction markers to. We know the partitions
> because
> > > we added them to the coordinator via the addPartitionsToTxn call.
> > > Previously we have had asynchronous calls in the past (ie, writing the
> > > commit markers when the transaction is completed) but often this just
> > > causes confusion as we need to wait for some operations to complete. In
> > the
> > > writing commit markers case, clients often see CONCURRENT_TRANSACTIONs
> > > error messages and that can be confusing. For that reason, it may be
> > > simpler to just have s

Re: [DISCUSS] KIP-890 Server Side Defense

2022-12-02 Thread Justine Olshan

 the transaction is ongoing, we need to make a
> round
> > trip from the leader partition to the transaction coordinator. In the
> time
> > we are waiting for this message to come back, in theory we could have
> sent
> > a commit/abort call that would make the original result of the check out
> of
> > date. That is why we can check the leader state before we write to the
> log.
>
> Thanks. Got it.
>
> However, is this really an issue? We put the produce request in
> purgatory, so how could we process the `WriteTxnMarkerRequest` first?
> Don't we need to put the `WriteTxnMarkerRequest` into purgatory, too,
> for this case, and process both request in-order? (Again, my broker
> knowledge is limited and maybe we don't maintain request order for this
> case, what seems to be an issue IMHO, and I am wondering if changing
> request handling to preserve order for this case might be the cleaner
> solution?)
>
>
>
> -Matthias
>
>
>
>
> On 11/30/22 3:28 PM, Artem Livshits wrote:
> > Hi Justine,
> >
> > I think the interesting part is not in this logic (because it tries to
> > figure out when UNKNOWN_PRODUCER_ID is retriable and if it's retryable,
> > it's definitely not fatal), but what happens when this logic doesn't
> return
> > 'true' and falls through.  In the old clients it seems to be fatal, if we
> > keep the behavior in the new clients, I'd expect it would be fatal as
> well.
> >
> > -Artem
> >
> > On Tue, Nov 29, 2022 at 11:57 AM Justine Olshan
> >  wrote:
> >
> >> Hi Artem and Jeff,
> >>
> >>
> >> Thanks for taking a look and sorry for the slow response.
> >>
> >> You both mentioned the change to handle UNKNOWN_PRODUCER_ID errors. To
> be
> >> clear — this error code will only be sent again when the client's
> request
> >> version is high enough to ensure we handle it correctly.
> >> The current (Java) client handles this by the following (somewhat long)
> >> code snippet:
> >>
> >> // An UNKNOWN_PRODUCER_ID means that we have lost the producer state on
> the
> >> broker. Depending on the log start
> >>
> >> // offset, we may want to retry these, as described for each case
> below. If
> >> none of those apply, then for the
> >>
> >> // idempotent producer, we will locally bump the epoch and reset the
> >> sequence numbers of in-flight batches from
> >>
> >> // sequence 0, then retry the failed batch, which should now succeed.
> For
> >> the transactional producer, allow the
> >>
> >> // batch to fail. When processing the failed batch, we will transition
> to
> >> an abortable error and set a flag
> >>
> >> // indicating that we need to bump the epoch (if supported by the
> broker).
> >>
> >> if (error == Errors.*UNKNOWN_PRODUCER_ID*) {
> >>
> >>  if (response.logStartOffset == -1) {
> >>
> >>  // We don't know the log start offset with this response. We
> should
> >> just retry the request until we get it.
> >>
> >>  // The UNKNOWN_PRODUCER_ID error code was added along with the
> new
> >> ProduceResponse which includes the
> >>
> >>  // logStartOffset. So the '-1' sentinel is not for backward
> >> compatibility. Instead, it is possible for
> >>
> >>  // a broker to not know the logStartOffset at when it is
> returning
> >> the response because the partition
> >>
> >>  // may have moved away from the broker from the time the error
> was
> >> initially raised to the time the
> >>
> >>  // response was being constructed. In these cases, we should
> just
> >> retry the request: we are guaranteed
> >>
> >>  // to eventually get a logStartOffset once things settle down.
> >>
> >>  return true;
> >>
> >>  }
> >>
> >>
> >>  if (batch.sequenceHasBeenReset()) {
> >>
> >>  // When the first inflight batch fails due to the truncation
> case,
> >> then the sequences of all the other
> >>
> >>  // in flight batches would have been restarted from the
> beginning.
> >> However, when those responses
> >>
> >>  // come back from the broker, they would also come with an
> >> UNKNOWN_PRODUCER_ID error. In this case, we should not
> >>
> >&g

Re: [DISCUSS] KIP-890 Server Side Defense

2022-12-06 Thread Justine Olshan

Hi all,
After Artem's questions about error behavior, I've re-evaluated the
unknown producer ID exception and had some discussions offline.

I think generally it makes sense to simplify error handling in cases like
this and the UNKNOWN_PRODUCER_ID error has a pretty long and complicated
history. Because of this, I propose adding a new error code ABORTABLE_ERROR
that when encountered by new clients (gated by the produce request version)
will simply abort the transaction. This allows the server to have some say
in whether the client aborts and makes handling much simpler. In the
future, we can also use this error in other situations where we want to
abort the transactions. We can even use on other apis.

I've added this to the KIP. Let me know if there are any questions or
issues.

Justine

On Fri, Dec 2, 2022 at 10:22 AM Justine Olshan  wrote:

> Hey Matthias,
>
>
> 20/30 — Maybe I also didn't express myself clearly. For older clients we
> don't have a way to distinguish between a previous and the current
> transaction since we don't have the epoch bump. This means that a late
> message from the previous transaction may be added to the new one. With
> older clients — we can't guarantee this won't happen if we already sent the
> addPartitionsToTxn call (why we make changes for the newer client) but we
> can at least gate some by ensuring that the partition has been added to the
> transaction. The rationale here is that there are likely LESS late arrivals
> as time goes on, so hopefully most late arrivals will come in BEFORE the
> addPartitionsToTxn call. Those that arrive before will be properly gated
> with the describeTransactions approach.
>
> If we take the approach you suggested, ANY late arrival from a previous
> transaction will be added. And we don't want that. I also don't see any
> benefit in sending addPartitionsToTxn over the describeTxns call. They will
> both be one extra RPC to the Txn coordinator.
>
>
> To be clear — newer clients will use addPartitionsToTxn instead of the
> DescribeTxns.
>
>
> 40)
> My concern is that if we have some delay in the client to bump the epoch,
> it could continue to send epoch 73 and those records would not be fenced.
> Perhaps this is not an issue if we don't allow the next produce to go
> through before the EndTxn request returns. I'm also thinking about cases of
> failure. I will need to think on this a bit.
>
> I wasn't sure if it was that confusing. But if we think it is, we can
> investigate other ways.
>
>
> 60)
>
> I'm not sure these are the same purgatories since one is a produce
> purgatory (I was planning on using a callback rather than purgatory) and
> the other is simply a request to append to the log. Not sure we have any
> structure here for ordering, but my understanding is that the broker could
> handle the write request before it hears back from the Txn Coordinator.
>
> Let me know if I misunderstood something or something was unclear.
>
> Justine
>
> On Thu, Dec 1, 2022 at 12:15 PM Matthias J. Sax  wrote:
>
>> Thanks for the details Justine!
>>
>> > 20)
>> >
>> > The client side change for 2 is removing the addPartitions to
>> transaction
>> > call. We don't need to make this from the producer to the txn
>> coordinator,
>> > only server side.
>>
>> I think I did not express myself clearly. I understand that we can (and
>> should) change the producer to not send the `addPartitions` request any
>> longer. But I don't thinks it's requirement to change the broker?
>>
>> What I am trying to say is: as a safe-guard and improvement for older
>> producers, the partition leader can just send the `addPartitions`
>> request to the TX-coordinator in any case -- if the old producer
>> correctly did send the `addPartition` request to the TX-coordinator
>> already, the TX-coordinator can just "ignore" is as idempotent. However,
>> if the old producer has a bug and did forget to sent the `addPartition`
>> request, we would now ensure that the partition is indeed added to the
>> TX and thus fix a potential producer bug (even if we don't get the
>> fencing via the bump epoch). -- It seems to be a good improvement? Or is
>> there a reason to not do this?
>>
>>
>>
>> > 30)
>> >
>> > Transaction is ongoing = partition was added to transaction via
>> > addPartitionsToTxn. We check this with the DescribeTransactions call.
>> Let
>> > me know if this wasn't sufficiently explained here:
>>
>> If we do what I propose in (20), we don't really need to make this
>> `DescribeT

Re: [DISCUSS] KIP-890 Server Side Defense

2022-12-14 Thread Justine Olshan

Matthias — thanks again for taking time to look a this. You said:

> My proposal was only focusing to avoid dangling

transactions if records are added without registered partition. -- Maybe

you can add a few more details to the KIP about this scenario for better

documentation purpose?

I'm not sure I understand what you mean here. The motivation section
describes two scenarios about how the record can be added without a
registered partition:

> This can happen when a message gets stuck or delayed due to networking
issues or a network partition, the transaction aborts, and then the delayed
message finally comes in.

> Another way hanging transactions can occur is that a client is buggy and
may somehow try to write to a partition before it adds the partition to the
transaction.

For the first example of this would it be helpful to say that this message
comes in after the abort, but before the partition is added to the next
transaction so it becomes "hanging." Perhaps the next sentence describing
the message becoming part of the next transaction (a different case) was
not properly differentiated.

Jun — thanks for reading the KIP.

70. The int typing was a concern. Currently we have a mechanism in place to
fence the final epoch when the epoch is about to overflow and assign a new
producer ID with epoch 0. Of course, this is a bit tricky when it comes to
the response back to the client.
Making this a long could be another option, but I wonder are there any
implications on changing this field if the epoch is persisted to disk? I'd
need to check the usages.

71.This was something Matthias asked about as well. I was considering a
possible edge case where a produce request from a new transaction somehow
gets sent right after the marker is written, but before the producer is
alerted of the newly bumped epoch. In this case, we may include this record
when we don't want to. I suppose we could try to do something client side
to bump the epoch after sending an endTxn as well in this scenario — but I
wonder how it would work when the server is aborting based on a server-side
error. I could also be missing something and this scenario is actually not
possible.

Thanks again to everyone reading and commenting. Let me know about any
further questions or comments.

Justine

On Wed, Dec 14, 2022 at 9:41 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the KIP. A couple of comments.
>
> 70. Currently, the producer epoch is an int. I am not sure if it's enough
> to accommodate all transactions in the lifetime of a producer. Should we
> change that to a long or add a new long field like txnId?
>
> 71. "it will write the prepare commit message with a bumped epoch and send
> WriteTxnMarkerRequests with the bumped epoch." Hmm, the epoch is associated
> with the current txn right? So, it seems weird to write a commit message
> with a bumped epoch. Should we only bump up the epoch in EndTxnResponse and
> rename the field to sth like nextProducerEpoch?
>
> Thanks,
>
> Jun
>
>
>
> On Mon, Dec 12, 2022 at 8:54 PM Matthias J. Sax  wrote:
>
> > Thanks for the background.
> >
> > 20/30: SGTM. My proposal was only focusing to avoid dangling
> > transactions if records are added without registered partition. -- Maybe
> > you can add a few more details to the KIP about this scenario for better
> > documentation purpose?
> >
> > 40: I think you hit a fair point about race conditions or client bugs
> > (incorrectly not bumping the epoch). The complexity/confusion for using
> > the bumped epoch I see, is mainly for internal debugging, ie, inspecting
> > log segment dumps -- it seems harder to reason about the system for us
> > humans. But if we get better guarantees, it would be worth to use the
> > bumped epoch.
> >
> > 60: as I mentioned already, I don't know the broker internals to provide
> > more input. So if nobody else chimes in, we should just move forward
> > with your proposal.
> >
> >
> > -Matthias
> >
> >
> > On 12/6/22 4:22 PM, Justine Olshan wrote:
> > > Hi all,
> > > After Artem's questions about error behavior, I've re-evaluated the
> > > unknown producer ID exception and had some discussions offline.
> > >
> > > I think generally it makes sense to simplify error handling in cases
> like
> > > this and the UNKNOWN_PRODUCER_ID error has a pretty long and
> complicated
> > > history. Because of this, I propose adding a new error code
> > ABORTABLE_ERROR
> > > that when encountered by new clients (gated by the produce request
> > version)
> > > will simply abort the transaction. This allows the server to have some
> > say
> > > in whether

Re: [DISCUSS] KIP-890 Server Side Defense

2022-12-14 Thread Justine Olshan

Thanks Matthias, I think we are on the same page.
The concern I had about your solution with the old clients is that we can't
distinguish between a late message and a message intended for the new
transaction on old clients -- basically any late message can turn into case
1.
I chose to rely on adding partitions to transaction as the determining
factor of whether the record belonged to the transaction to rule out some
of these late message cases. Of course, in some cases we will still run
into case 1 unfortunately if the message is really late (also as you
mention in the last paragraph), but I believed that is ok if we avoid it in
most cases (best effort).

Let me know if you think some part of this is not clear on the original
KIP, and I can better address it there as well.

Thanks again for taking time to think through this with me,
Justine

On Wed, Dec 14, 2022 at 12:24 PM Matthias J. Sax  wrote:

> What I mean is the following:
>
> For both scenarios, late message or missing addPartitionToTxnRequest, a
> record r is written to partition X, but X is not registered at the
> TX-coordinator. Now there are two cases:
>
> (1) A follow up transaction writes more data to the same partition X,
> and r becomes part of the follow up transaction. This is an error
> obviously, but we don't get a hanging transaction.
>
> (2) X is not part of any follow up transaction and thus X starts to
> block consumer reading data.
>
> If we let the partition leader send the addPartitionToTxnRequest to the
> TX-coordinator, scenario (2) always turns into scenario (1) -- at least,
> if there is one more transaction for this producer (what I think we can
> assume). Even if the follow up transaction doesn't write data to X, X
> still becomes part of the TX and X won't hang and won't block consumers
> any longer.
>
> We still end up with not fixing (1) though... Your proposal seems to
> address case (1) in addition to case (2), at least for most cases. There
> is still the race condition (that we cannot fix without the epoch bump)
> that r comes in _very_ late, and the follow up transaction would have
> written more data to X already, and thus X is indeed already registered
> and r would just be added successfully. Of course, the race condition
> window is much smaller, so your proposal is much better than what I had
> in mind.
>
>
> -Matthias
>
> On 12/14/22 10:43 AM, Justine Olshan wrote:
> > Matthias — thanks again for taking time to look a this. You said:
> >
> >> My proposal was only focusing to avoid dangling
> >
> > transactions if records are added without registered partition. -- Maybe
> >
> > you can add a few more details to the KIP about this scenario for better
> >
> > documentation purpose?
> >
> >
> > I'm not sure I understand what you mean here. The motivation section
> > describes two scenarios about how the record can be added without a
> > registered partition:
> >
> >
> >> This can happen when a message gets stuck or delayed due to networking
> > issues or a network partition, the transaction aborts, and then the
> delayed
> > message finally comes in.
> >
> >
> >> Another way hanging transactions can occur is that a client is buggy and
> > may somehow try to write to a partition before it adds the partition to
> the
> > transaction.
> >
> >
> >
> > For the first example of this would it be helpful to say that this
> message
> > comes in after the abort, but before the partition is added to the next
> > transaction so it becomes "hanging." Perhaps the next sentence describing
> > the message becoming part of the next transaction (a different case) was
> > not properly differentiated.
> >
> >
> >
> > Jun — thanks for reading the KIP.
> >
> > 70. The int typing was a concern. Currently we have a mechanism in place
> to
> > fence the final epoch when the epoch is about to overflow and assign a
> new
> > producer ID with epoch 0. Of course, this is a bit tricky when it comes
> to
> > the response back to the client.
> > Making this a long could be another option, but I wonder are there any
> > implications on changing this field if the epoch is persisted to disk?
> I'd
> > need to check the usages.
> >
> > 71.This was something Matthias asked about as well. I was considering a
> > possible edge case where a produce request from a new transaction somehow
> > gets sent right after the marker is written, but before the producer is
> > alerted of the newly bumped epoch. In this case, we may include this
> record
> > when we don't want to. I su

Re: [DISCUSS] KIP-890 Server Side Defense

2022-12-16 Thread Justine Olshan

Hi Jun,

Thanks for replying!

70.We already do the overflow mechanism, so my change would just make it
happen more often.
I was also not suggesting a new field in the log, but in the response,
which would be gated by the client version. Sorry if something there is
unclear. I think we are starting to diverge.
The goal of this KIP is to not change to the marker format at all.

71. Yes, I guess I was going under the assumption that the log would just
look at its last epoch and treat it as the current epoch. I suppose we can
have some special logic that if the last epoch was on a marker we actually
expect the next epoch or something like that. We just need to distinguish
based on whether we had a commit/abort marker.

72.
> if the producer epoch hasn't been bumped on the
broker, it seems that the stucked message will fail the sequence validation
and will be ignored. If the producer epoch has been bumped, we ignore the
sequence check and the stuck message could be appended to the log. So, is
the latter case that we want to guard?

I'm not sure I follow that "the message will fail the sequence validation".
In some of these cases, we had an abort marker (due to an error) and then
the late message comes in with the correct sequence number. This is a case
covered by the KIP.
The latter case is actually not something we've considered here. I think
generally when we bump the epoch, we are accepting that the sequence does
not need to be checked anymore. My understanding is also that we don't
typically bump epoch mid transaction (based on a quick look at the code)
but let me know if that is the case.

Thanks,
Justine

On Thu, Dec 15, 2022 at 12:23 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> 70. Assigning a new pid on int overflow seems a bit hacky. If we need a txn
> level id, it will be better to model this explicitly. Adding a new field
> would require a bit more work since it requires a new txn marker format in
> the log. So, we probably need to guard it with an IBP or metadata version
> and document the impact on downgrade once the new format is written to the
> log.
>
> 71. Hmm, once the marker is written, the partition will expect the next
> append to be on the next epoch. Does that cover the case you mentioned?
>
> 72. Also, just to be clear on the stucked message issue described in the
> motivation. With EoS, we also validate the sequence id for idempotency. So,
> with the current logic, if the producer epoch hasn't been bumped on the
> broker, it seems that the stucked message will fail the sequence validation
> and will be ignored. If the producer epoch has been bumped, we ignore the
> sequence check and the stuck message could be appended to the log. So, is
> the latter case that we want to guard?
>
> Thanks,
>
> Jun
>
> On Wed, Dec 14, 2022 at 10:44 AM Justine Olshan
>  wrote:
>
> > Matthias — thanks again for taking time to look a this. You said:
> >
> > > My proposal was only focusing to avoid dangling
> >
> > transactions if records are added without registered partition. -- Maybe
> >
> > you can add a few more details to the KIP about this scenario for better
> >
> > documentation purpose?
> >
> >
> > I'm not sure I understand what you mean here. The motivation section
> > describes two scenarios about how the record can be added without a
> > registered partition:
> >
> >
> > > This can happen when a message gets stuck or delayed due to networking
> > issues or a network partition, the transaction aborts, and then the
> delayed
> > message finally comes in.
> >
> >
> > > Another way hanging transactions can occur is that a client is buggy
> and
> > may somehow try to write to a partition before it adds the partition to
> the
> > transaction.
> >
> >
> >
> > For the first example of this would it be helpful to say that this
> message
> > comes in after the abort, but before the partition is added to the next
> > transaction so it becomes "hanging." Perhaps the next sentence describing
> > the message becoming part of the next transaction (a different case) was
> > not properly differentiated.
> >
> >
> >
> > Jun — thanks for reading the KIP.
> >
> > 70. The int typing was a concern. Currently we have a mechanism in place
> to
> > fence the final epoch when the epoch is about to overflow and assign a
> new
> > producer ID with epoch 0. Of course, this is a bit tricky when it comes
> to
> > the response back to the client.
> > Making this a long could be another option, but I wonder are there any
> > implications on changing this field if the epoch is persiste

Re: [DISCUSS] KIP-890 Server Side Defense

2022-12-19 Thread Justine Olshan

Hi Jun,

My understanding of the mechanism is that when we get to the last epoch, we
increment to the fencing/last epoch and if any further requests come in for
this producer ID they are fenced. Then the producer gets a new ID and
restarts with epoch/sequence 0. The fenced epoch sticks around for the
duration of producer.id.expiration.ms and blocks any late messages there.
The new ID will get to take advantage of the improved semantics around
non-zero start sequences. So I think we are covered.

The only potential issue is overloading the cache, but hopefully the
improvements (lowered producer.id.expiration.ms) will help with that. Let
me know if you still have concerns.

Thanks,
Justine

On Mon, Dec 19, 2022 at 10:24 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the explanation.
>
> 70. The proposed fencing logic doesn't apply when pid changes, is that
> right? If so, I am not sure how complete we are addressing this issue if
> the pid changes more frequently.
>
> Thanks,
>
> Jun
>
>
>
> On Fri, Dec 16, 2022 at 9:16 AM Justine Olshan
> 
> wrote:
>
> > Hi Jun,
> >
> > Thanks for replying!
> >
> > 70.We already do the overflow mechanism, so my change would just make it
> > happen more often.
> > I was also not suggesting a new field in the log, but in the response,
> > which would be gated by the client version. Sorry if something there is
> > unclear. I think we are starting to diverge.
> > The goal of this KIP is to not change to the marker format at all.
> >
> > 71. Yes, I guess I was going under the assumption that the log would just
> > look at its last epoch and treat it as the current epoch. I suppose we
> can
> > have some special logic that if the last epoch was on a marker we
> actually
> > expect the next epoch or something like that. We just need to distinguish
> > based on whether we had a commit/abort marker.
> >
> > 72.
> > > if the producer epoch hasn't been bumped on the
> > broker, it seems that the stucked message will fail the sequence
> validation
> > and will be ignored. If the producer epoch has been bumped, we ignore the
> > sequence check and the stuck message could be appended to the log. So, is
> > the latter case that we want to guard?
> >
> > I'm not sure I follow that "the message will fail the sequence
> validation".
> > In some of these cases, we had an abort marker (due to an error) and then
> > the late message comes in with the correct sequence number. This is a
> case
> > covered by the KIP.
> > The latter case is actually not something we've considered here. I think
> > generally when we bump the epoch, we are accepting that the sequence does
> > not need to be checked anymore. My understanding is also that we don't
> > typically bump epoch mid transaction (based on a quick look at the code)
> > but let me know if that is the case.
> >
> > Thanks,
> > Justine
> >
> > On Thu, Dec 15, 2022 at 12:23 PM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the reply.
> > >
> > > 70. Assigning a new pid on int overflow seems a bit hacky. If we need a
> > txn
> > > level id, it will be better to model this explicitly. Adding a new
> field
> > > would require a bit more work since it requires a new txn marker format
> > in
> > > the log. So, we probably need to guard it with an IBP or metadata
> version
> > > and document the impact on downgrade once the new format is written to
> > the
> > > log.
> > >
> > > 71. Hmm, once the marker is written, the partition will expect the next
> > > append to be on the next epoch. Does that cover the case you mentioned?
> > >
> > > 72. Also, just to be clear on the stucked message issue described in
> the
> > > motivation. With EoS, we also validate the sequence id for idempotency.
> > So,
> > > with the current logic, if the producer epoch hasn't been bumped on the
> > > broker, it seems that the stucked message will fail the sequence
> > validation
> > > and will be ignored. If the producer epoch has been bumped, we ignore
> the
> > > sequence check and the stuck message could be appended to the log. So,
> is
> > > the latter case that we want to guard?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Dec 14, 2022 at 10:44 AM Justine Olshan
> > >  wrote:
> > >
> > > > Matthias — thanks again for taking time to look a this. You said:
> > > >
> >

Re: [ANNOUNCE] New committer: Justine Olshan

2022-12-30 Thread Justine Olshan

Thanks all!

Justine

On Fri, Dec 30, 2022 at 8:03 AM Mickael Maison 
wrote:

> Congratulations Justine!
>
> On Fri, Dec 30, 2022 at 6:14 AM Yash Mayya  wrote:
> >
> > Congratulations Justine!
> >
> > On Fri, Dec 30, 2022, 02:28 David Jacot  wrote:
> >
> > > Hi all,
> > >
> > > The PMC of Apache Kafka is pleased to announce a new Kafka committer
> > > Justine
> > > Olshan.
> > >
> > > Justine has been contributing to Kafka since June 2019. She
> contributed 53
> > > PRs including the following KIPs.
> > >
> > > KIP-480: Sticky Partitioner
> > > KIP-516: Topic Identifiers & Topic Deletion State Improvements
> > > KIP-854: Separate configuration for producer ID expiry
> > > KIP-890: Transactions Server-Side Defense (in progress)
> > >
> > > Congratulations, Justine!
> > >
> > > Thanks,
> > >
> > > David (on behalf of the Apache Kafka PMC)
> > >
>

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-06 Thread Justine Olshan

Hey Jason -- thanks for the input -- I was just digging a bit deeper into
the design + implementation of the validation calls here and what you say
makes sense.

I was wondering about compatibility here. When we send requests
between brokers, we want to ensure that the receiving broker understands
the request (specifically the new fields). Typically this is done via
IBP/metadata version.
I'm trying to think if there is a way around it but I'm not sure there is.

As for the improvements -- can you clarify how the multiple transactional
IDs would help here? Were you thinking of a case where we wait/batch
multiple produce requests together? My understanding for now was 1
transactional ID and one validation per 1 produce request.

Finally with respect to the authorizations, I think it makes sense to skip
topic authorizations, but I'm a bit confused by the "leader ID" field.
Wouldn't we just want to flag the request as from a broker (does it matter
which one?).

I think I want to adopt these suggestions, just had a few questions on the
details.

Thanks,
Justine

On Thu, Jan 5, 2023 at 5:05 PM Jason Gustafson 
wrote:

> Hi Justine,
>
> Thanks for the proposal.
>
> I was thinking about the implementation a little bit. In the current
> proposal, the behavior depends on whether we have an old or new client. For
> old clients, we send `DescribeTransactions` and verify the result and for
> new clients, we send `AddPartitionsToTxn`. We might be able to simplify the
> implementation if we can use the same request type. For example, what if we
> bump the protocol version for `AddPartitionsToTxn` and add a `validateOnly`
> flag? For older versions, we can set `validateOnly=true` so that the
> request only returns successfully if the partition had already been added.
> For new versions, we can set `validateOnly=false` and the partition will be
> added to the transaction. The other slightly annoying thing that this would
> get around is the need to collect the transaction state for all partitions
> even when we only care about a subset.
>
> Some additional improvements to consider:
>
> - We can give `AddPartitionsToTxn` better batch support for inter-broker
> usage. Currently we only allow one `TransactionalId` to be specified, but
> the broker may get some benefit being able to batch across multiple
> transactions.
> - Another small improvement is skipping topic authorization checks for
> `AddPartitionsToTxn` when the request is from a broker. Perhaps we can add
> a field for the `LeaderId` or something like that and require CLUSTER
> permission when set.
>
> Best,
> Jason
>
>
>
> On Mon, Dec 19, 2022 at 3:56 PM Jun Rao  wrote:
>
> > Hi, Justine,
> >
> > Thanks for the explanation. It makes sense to me now.
> >
> > Jun
> >
> > On Mon, Dec 19, 2022 at 1:42 PM Justine Olshan
> > 
> > wrote:
> >
> > > Hi Jun,
> > >
> > > My understanding of the mechanism is that when we get to the last
> epoch,
> > we
> > > increment to the fencing/last epoch and if any further requests come in
> > for
> > > this producer ID they are fenced. Then the producer gets a new ID and
> > > restarts with epoch/sequence 0. The fenced epoch sticks around for the
> > > duration of producer.id.expiration.ms and blocks any late messages
> > there.
> > > The new ID will get to take advantage of the improved semantics around
> > > non-zero start sequences. So I think we are covered.
> > >
> > > The only potential issue is overloading the cache, but hopefully the
> > > improvements (lowered producer.id.expiration.ms) will help with that.
> > Let
> > > me know if you still have concerns.
> > >
> > > Thanks,
> > > Justine
> > >
> > > On Mon, Dec 19, 2022 at 10:24 AM Jun Rao 
> > wrote:
> > >
> > > > Hi, Justine,
> > > >
> > > > Thanks for the explanation.
> > > >
> > > > 70. The proposed fencing logic doesn't apply when pid changes, is
> that
> > > > right? If so, I am not sure how complete we are addressing this issue
> > if
> > > > the pid changes more frequently.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Fri, Dec 16, 2022 at 9:16 AM Justine Olshan
> > > > 
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thanks for replying!
> > > > >
> > > > > 70.We already do the overflow mechanism, so my change would just
> m

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-06 Thread Justine Olshan

As a follow up, I was just thinking about the batching a bit more.
I suppose if we have one request in flight and we queue up the other
produce requests in some sort of purgatory, we could send information out
for all of them rather than one by one. So that would be a benefit of
batching partitions to add per transaction.

I'll need to think a bit more on the design of this part of the KIP, and
will update the KIP in the next few days.

Thanks,
Justine

On Fri, Jan 6, 2023 at 10:22 AM Justine Olshan  wrote:

> Hey Jason -- thanks for the input -- I was just digging a bit deeper into
> the design + implementation of the validation calls here and what you say
> makes sense.
>
> I was wondering about compatibility here. When we send requests
> between brokers, we want to ensure that the receiving broker understands
> the request (specifically the new fields). Typically this is done via
> IBP/metadata version.
> I'm trying to think if there is a way around it but I'm not sure there is.
>
> As for the improvements -- can you clarify how the multiple transactional
> IDs would help here? Were you thinking of a case where we wait/batch
> multiple produce requests together? My understanding for now was 1
> transactional ID and one validation per 1 produce request.
>
> Finally with respect to the authorizations, I think it makes sense to skip
> topic authorizations, but I'm a bit confused by the "leader ID" field.
> Wouldn't we just want to flag the request as from a broker (does it matter
> which one?).
>
> I think I want to adopt these suggestions, just had a few questions on the
> details.
>
> Thanks,
> Justine
>
> On Thu, Jan 5, 2023 at 5:05 PM Jason Gustafson 
> wrote:
>
>> Hi Justine,
>>
>> Thanks for the proposal.
>>
>> I was thinking about the implementation a little bit. In the current
>> proposal, the behavior depends on whether we have an old or new client.
>> For
>> old clients, we send `DescribeTransactions` and verify the result and for
>> new clients, we send `AddPartitionsToTxn`. We might be able to simplify
>> the
>> implementation if we can use the same request type. For example, what if
>> we
>> bump the protocol version for `AddPartitionsToTxn` and add a
>> `validateOnly`
>> flag? For older versions, we can set `validateOnly=true` so that the
>> request only returns successfully if the partition had already been added.
>> For new versions, we can set `validateOnly=false` and the partition will
>> be
>> added to the transaction. The other slightly annoying thing that this
>> would
>> get around is the need to collect the transaction state for all partitions
>> even when we only care about a subset.
>>
>> Some additional improvements to consider:
>>
>> - We can give `AddPartitionsToTxn` better batch support for inter-broker
>> usage. Currently we only allow one `TransactionalId` to be specified, but
>> the broker may get some benefit being able to batch across multiple
>> transactions.
>> - Another small improvement is skipping topic authorization checks for
>> `AddPartitionsToTxn` when the request is from a broker. Perhaps we can add
>> a field for the `LeaderId` or something like that and require CLUSTER
>> permission when set.
>>
>> Best,
>> Jason
>>
>>
>>
>> On Mon, Dec 19, 2022 at 3:56 PM Jun Rao  wrote:
>>
>> > Hi, Justine,
>> >
>> > Thanks for the explanation. It makes sense to me now.
>> >
>> > Jun
>> >
>> > On Mon, Dec 19, 2022 at 1:42 PM Justine Olshan
>> > 
>> > wrote:
>> >
>> > > Hi Jun,
>> > >
>> > > My understanding of the mechanism is that when we get to the last
>> epoch,
>> > we
>> > > increment to the fencing/last epoch and if any further requests come
>> in
>> > for
>> > > this producer ID they are fenced. Then the producer gets a new ID and
>> > > restarts with epoch/sequence 0. The fenced epoch sticks around for the
>> > > duration of producer.id.expiration.ms and blocks any late messages
>> > there.
>> > > The new ID will get to take advantage of the improved semantics around
>> > > non-zero start sequences. So I think we are covered.
>> > >
>> > > The only potential issue is overloading the cache, but hopefully the
>> > > improvements (lowered producer.id.expiration.ms) will help with that.
>> > Let
>> > > me know if you still have concerns.
>> > >
>> > > Thanks,
>> > > Justine
&

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-06 Thread Justine Olshan

Thanks Jason. Those changes make sense to me. I will update the KIP.



On Fri, Jan 6, 2023 at 3:31 PM Jason Gustafson 
wrote:

> Hey Justine,
>
> > I was wondering about compatibility here. When we send requests
> between brokers, we want to ensure that the receiving broker understands
> the request (specifically the new fields). Typically this is done via
> IBP/metadata version.
> I'm trying to think if there is a way around it but I'm not sure there is.
>
> Yes. I think we would gate usage of this behind an IBP bump. Does that seem
> reasonable?
>
> > As for the improvements -- can you clarify how the multiple transactional
> IDs would help here? Were you thinking of a case where we wait/batch
> multiple produce requests together? My understanding for now was 1
> transactional ID and one validation per 1 produce request.
>
> Each call to `AddPartitionsToTxn` is essentially a write to the transaction
> log and must block on replication. The more we can fit into a single
> request, the more writes we can do in parallel. The alternative is to make
> use of more connections, but usually we prefer batching since the network
> stack is not really optimized for high connection/request loads.
>
> > Finally with respect to the authorizations, I think it makes sense to
> skip
> topic authorizations, but I'm a bit confused by the "leader ID" field.
> Wouldn't we just want to flag the request as from a broker (does it matter
> which one?).
>
> We could also make it version-based. For the next version, we could require
> CLUSTER auth. So clients would not be able to use the API anymore, which is
> probably what we want.
>
> -Jason
>
> On Fri, Jan 6, 2023 at 10:43 AM Justine Olshan
> 
> wrote:
>
> > As a follow up, I was just thinking about the batching a bit more.
> > I suppose if we have one request in flight and we queue up the other
> > produce requests in some sort of purgatory, we could send information out
> > for all of them rather than one by one. So that would be a benefit of
> > batching partitions to add per transaction.
> >
> > I'll need to think a bit more on the design of this part of the KIP, and
> > will update the KIP in the next few days.
> >
> > Thanks,
> > Justine
> >
> > On Fri, Jan 6, 2023 at 10:22 AM Justine Olshan 
> > wrote:
> >
> > > Hey Jason -- thanks for the input -- I was just digging a bit deeper
> into
> > > the design + implementation of the validation calls here and what you
> say
> > > makes sense.
> > >
> > > I was wondering about compatibility here. When we send requests
> > > between brokers, we want to ensure that the receiving broker
> understands
> > > the request (specifically the new fields). Typically this is done via
> > > IBP/metadata version.
> > > I'm trying to think if there is a way around it but I'm not sure there
> > is.
> > >
> > > As for the improvements -- can you clarify how the multiple
> transactional
> > > IDs would help here? Were you thinking of a case where we wait/batch
> > > multiple produce requests together? My understanding for now was 1
> > > transactional ID and one validation per 1 produce request.
> > >
> > > Finally with respect to the authorizations, I think it makes sense to
> > skip
> > > topic authorizations, but I'm a bit confused by the "leader ID" field.
> > > Wouldn't we just want to flag the request as from a broker (does it
> > matter
> > > which one?).
> > >
> > > I think I want to adopt these suggestions, just had a few questions on
> > the
> > > details.
> > >
> > > Thanks,
> > > Justine
> > >
> > > On Thu, Jan 5, 2023 at 5:05 PM Jason Gustafson
> > 
> > > wrote:
> > >
> > >> Hi Justine,
> > >>
> > >> Thanks for the proposal.
> > >>
> > >> I was thinking about the implementation a little bit. In the current
> > >> proposal, the behavior depends on whether we have an old or new
> client.
> > >> For
> > >> old clients, we send `DescribeTransactions` and verify the result and
> > for
> > >> new clients, we send `AddPartitionsToTxn`. We might be able to
> simplify
> > >> the
> > >> implementation if we can use the same request type. For example, what
> if
> > >> we
> > >> bump the protocol version for `AddPartitionsToTxn` and add a
> > >> `validateOnly`
> > >&g

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-09 Thread Justine Olshan

Hey all, I've updated the KIP to incorporate Jason's suggestions.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense


1. Use AddPartitionsToTxn + verify flag to check on old clients
2. Updated AddPartitionsToTxn API to support transaction batching
3. Mention IBP bump
4. Mention auth change on new AddPartitionsToTxn version.

I'm planning on opening a vote soon.
Thanks,
Justine

On Fri, Jan 6, 2023 at 3:32 PM Justine Olshan  wrote:

> Thanks Jason. Those changes make sense to me. I will update the KIP.
>
>
>
> On Fri, Jan 6, 2023 at 3:31 PM Jason Gustafson 
> wrote:
>
>> Hey Justine,
>>
>> > I was wondering about compatibility here. When we send requests
>> between brokers, we want to ensure that the receiving broker understands
>> the request (specifically the new fields). Typically this is done via
>> IBP/metadata version.
>> I'm trying to think if there is a way around it but I'm not sure there is.
>>
>> Yes. I think we would gate usage of this behind an IBP bump. Does that
>> seem
>> reasonable?
>>
>> > As for the improvements -- can you clarify how the multiple
>> transactional
>> IDs would help here? Were you thinking of a case where we wait/batch
>> multiple produce requests together? My understanding for now was 1
>> transactional ID and one validation per 1 produce request.
>>
>> Each call to `AddPartitionsToTxn` is essentially a write to the
>> transaction
>> log and must block on replication. The more we can fit into a single
>> request, the more writes we can do in parallel. The alternative is to make
>> use of more connections, but usually we prefer batching since the network
>> stack is not really optimized for high connection/request loads.
>>
>> > Finally with respect to the authorizations, I think it makes sense to
>> skip
>> topic authorizations, but I'm a bit confused by the "leader ID" field.
>> Wouldn't we just want to flag the request as from a broker (does it matter
>> which one?).
>>
>> We could also make it version-based. For the next version, we could
>> require
>> CLUSTER auth. So clients would not be able to use the API anymore, which
>> is
>> probably what we want.
>>
>> -Jason
>>
>> On Fri, Jan 6, 2023 at 10:43 AM Justine Olshan
>> 
>> wrote:
>>
>> > As a follow up, I was just thinking about the batching a bit more.
>> > I suppose if we have one request in flight and we queue up the other
>> > produce requests in some sort of purgatory, we could send information
>> out
>> > for all of them rather than one by one. So that would be a benefit of
>> > batching partitions to add per transaction.
>> >
>> > I'll need to think a bit more on the design of this part of the KIP, and
>> > will update the KIP in the next few days.
>> >
>> > Thanks,
>> > Justine
>> >
>> > On Fri, Jan 6, 2023 at 10:22 AM Justine Olshan 
>> > wrote:
>> >
>> > > Hey Jason -- thanks for the input -- I was just digging a bit deeper
>> into
>> > > the design + implementation of the validation calls here and what you
>> say
>> > > makes sense.
>> > >
>> > > I was wondering about compatibility here. When we send requests
>> > > between brokers, we want to ensure that the receiving broker
>> understands
>> > > the request (specifically the new fields). Typically this is done via
>> > > IBP/metadata version.
>> > > I'm trying to think if there is a way around it but I'm not sure there
>> > is.
>> > >
>> > > As for the improvements -- can you clarify how the multiple
>> transactional
>> > > IDs would help here? Were you thinking of a case where we wait/batch
>> > > multiple produce requests together? My understanding for now was 1
>> > > transactional ID and one validation per 1 produce request.
>> > >
>> > > Finally with respect to the authorizations, I think it makes sense to
>> > skip
>> > > topic authorizations, but I'm a bit confused by the "leader ID" field.
>> > > Wouldn't we just want to flag the request as from a broker (does it
>> > matter
>> > > which one?).
>> > >
>> > > I think I want to adopt these suggestions, just had a few questions on
>> > the
>> > > details.
>> > >
>> > > Thanks,
>> > > Justine
>&

[VOTE] KIP-890: Transactions Server Side Defense

2023-01-10 Thread Justine Olshan

Hi everyone,

I would like to start a vote on KIP-890 which aims to prevent some of the
common causes of hanging transactions and make other general improvements
to transactions in Kafka.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense

Please take a look if you haven't already and vote!

Justine

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-12 Thread Justine Olshan

Thanks for the discussion Artem.

With respect to the handling of fenced producers, we have some behavior
already in place. As of KIP-588:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts,
we handle timeouts more gracefully. The producer can recover.

Produce requests can also recover from epoch fencing by aborting the
transaction and starting over.

What other cases were you considering that would cause us to have a fenced
epoch but we'd want to recover?

The first point about handling epoch overflows is fair. I think there is
some logic we'd need to consider. (ie, if we are one away from the max
epoch, we need to reset the producer ID.) I'm still wondering if there is a
way to direct this from the response, or if everything should be done on
the client side. Let me know if you have any thoughts here.

Thanks,
Justine

On Tue, Jan 10, 2023 at 4:06 PM Artem Livshits
 wrote:

> There are some workflows in the client that are implied by protocol
> changes, e.g.:
>
> - for new clients, epoch changes with every transaction and can overflow,
> in old clients this condition was handled transparently, because epoch was
> bumped in InitProducerId and it would return a new producer id if epoch
> overflows, the new clients would need to implement some workflow to refresh
> producer id
> - how to handle fenced producers, for new clients epoch changes with every
> transaction, so in presence of failures during commits / aborts, the
> producer could get easily fenced, old clients would pretty much would get
> fenced when a new incarnation of the producer was initialized with
> InitProducerId so it's ok to treat as a fatal error, the new clients would
> need to implement some workflow to handle that error, otherwise they could
> get fenced by themselves
> - in particular (as a subset of the previous issue), what would the client
> do if it got a timeout during commit?  commit could've succeeded or failed
>
> Not sure if this has to be defined in the KIP as implementing those
> probably wouldn't require protocol changes, but we have multiple
> implementations of Kafka clients, so probably would be good to have some
> client implementation guidance.  Could also be done as a separate doc.
>
> -Artem
>
> On Mon, Jan 9, 2023 at 3:38 PM Justine Olshan  >
> wrote:
>
> > Hey all, I've updated the KIP to incorporate Jason's suggestions.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
> >
> >
> > 1. Use AddPartitionsToTxn + verify flag to check on old clients
> > 2. Updated AddPartitionsToTxn API to support transaction batching
> > 3. Mention IBP bump
> > 4. Mention auth change on new AddPartitionsToTxn version.
> >
> > I'm planning on opening a vote soon.
> > Thanks,
> > Justine
> >
> > On Fri, Jan 6, 2023 at 3:32 PM Justine Olshan 
> > wrote:
> >
> > > Thanks Jason. Those changes make sense to me. I will update the KIP.
> > >
> > >
> > >
> > > On Fri, Jan 6, 2023 at 3:31 PM Jason Gustafson
> > 
> > > wrote:
> > >
> > >> Hey Justine,
> > >>
> > >> > I was wondering about compatibility here. When we send requests
> > >> between brokers, we want to ensure that the receiving broker
> understands
> > >> the request (specifically the new fields). Typically this is done via
> > >> IBP/metadata version.
> > >> I'm trying to think if there is a way around it but I'm not sure there
> > is.
> > >>
> > >> Yes. I think we would gate usage of this behind an IBP bump. Does that
> > >> seem
> > >> reasonable?
> > >>
> > >> > As for the improvements -- can you clarify how the multiple
> > >> transactional
> > >> IDs would help here? Were you thinking of a case where we wait/batch
> > >> multiple produce requests together? My understanding for now was 1
> > >> transactional ID and one validation per 1 produce request.
> > >>
> > >> Each call to `AddPartitionsToTxn` is essentially a write to the
> > >> transaction
> > >> log and must block on replication. The more we can fit into a single
> > >> request, the more writes we can do in parallel. The alternative is to
> > make
> > >> use of more connections, but usually we prefer batching since the
> > network
> > >> stack is not really optimized for high connection/request loads.
> > >>
> > >> > Finally

Re: [ANNOUNCE] New committer: Stanislav Kozlovski

2023-01-17 Thread Justine Olshan

Congrats Stan! Well deserved!

On Tue, Jan 17, 2023 at 7:50 AM Jun Rao  wrote:

> Hi, Everyone,
>
> The PMC of Apache Kafka is pleased to announce a new Kafka committer
> Stanislav Kozlovski.
>
> Stan has been contributing to Apache Kafka since June 2018. He made various
> contributions including the following KIPs.
>
> KIP-455: Create an Administrative API for Replica Reassignment
> KIP-412: Extend Admin API to support dynamic application log levels
>
> Congratulations, Stan!
>
> Thanks,
>
> Jun (on behalf of the Apache Kafka PMC)
>

Re: Reviews for KAFKA-13668

2023-01-17 Thread Justine Olshan

Hey Philip,
Thanks for the PR. I had left some comments a little while back and I think
most were resolved but I was curious if there was any further thought on
this one:
https://github.com/apache/kafka/pull/12149#discussion_r1007207887

I can take another look at the PR soon. :)
Justine

On Tue, Jan 17, 2023 at 10:11 AM Philip Nee  wrote:

> Hey committers,
>
> Trying to get some attention to the KAFKA-13668. Is anyone familiar with
> the producer code able to provide feedback for the issue KAFKA-13668
> ?  In a nutshell, it allows
> producers to retry upon receiving authorization errors w/o needing to be
> instantiated.
>
> Thanks!
> P
>

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-18 Thread Justine Olshan

Hey all, just wanted to quickly update and say I've modified the KIP to
explicitly mention that AddOffsetCommitsToTxnRequest will be replaced by a
coordinator-side (inter-broker) AddPartitionsToTxn implicit request. This
mirrors the user partitions and will implicitly add offset partitions to
transactions when we commit offsets on them. We will deprecate
AddOffsetCommitsToTxnRequest
for new clients.

Also to address Artem's comments --
I'm a bit unsure if the changes here will change the previous behavior for
fencing producers. In the case you mention in the first paragraph, are you
saying we bump the epoch before we try to abort the transaction? I think I
need to understand the scenarios you mention a bit better.

As for the second part -- I think it makes sense to have some sort of
"sentinel" epoch to signal epoch is about to overflow (I think we sort of
have this value in place in some ways) so we can codify it in the KIP. I'll
look into that and try to update soon.

Thanks,
Justine.

On Fri, Jan 13, 2023 at 5:01 PM Artem Livshits
 wrote:

> It's good to know that KIP-588 addressed some of the issues.  Looking at
> the code, it still looks like there are some cases that would result in
> fatal error, e.g. PRODUCER_FENCED is issued by the transaction coordinator
> if epoch doesn't match, and the client treats it as a fatal error (code in
> TransactionManager request handling).  If we consider, for example,
> committing a transaction that returns a timeout, but actually succeeds,
> trying to abort it or re-commit may result in PRODUCER_FENCED error
> (because of epoch bump).
>
> For failed commits, specifically, we need to know the actual outcome,
> because if we return an error the application may think that the
> transaction is aborted and redo the work, leading to duplicates.
>
> Re: overflowing epoch.  We could either do it on the TC and return both
> producer id and epoch (e.g. change the protocol), or signal the client that
> it needs to get a new producer id.  Checking for max epoch could be a
> reasonable signal, the value to check should probably be present in the KIP
> as this is effectively a part of the contract.  Also, the TC should
> probably return an error if the client didn't change producer id after
> hitting max epoch.
>
> -Artem
>
>
> On Thu, Jan 12, 2023 at 10:31 AM Justine Olshan
>  wrote:
>
> > Thanks for the discussion Artem.
> >
> > With respect to the handling of fenced producers, we have some behavior
> > already in place. As of KIP-588:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> > ,
> > we handle timeouts more gracefully. The producer can recover.
> >
> > Produce requests can also recover from epoch fencing by aborting the
> > transaction and starting over.
> >
> > What other cases were you considering that would cause us to have a
> fenced
> > epoch but we'd want to recover?
> >
> > The first point about handling epoch overflows is fair. I think there is
> > some logic we'd need to consider. (ie, if we are one away from the max
> > epoch, we need to reset the producer ID.) I'm still wondering if there
> is a
> > way to direct this from the response, or if everything should be done on
> > the client side. Let me know if you have any thoughts here.
> >
> > Thanks,
> > Justine
> >
> > On Tue, Jan 10, 2023 at 4:06 PM Artem Livshits
> >  wrote:
> >
> > > There are some workflows in the client that are implied by protocol
> > > changes, e.g.:
> > >
> > > - for new clients, epoch changes with every transaction and can
> overflow,
> > > in old clients this condition was handled transparently, because epoch
> > was
> > > bumped in InitProducerId and it would return a new producer id if epoch
> > > overflows, the new clients would need to implement some workflow to
> > refresh
> > > producer id
> > > - how to handle fenced producers, for new clients epoch changes with
> > every
> > > transaction, so in presence of failures during commits / aborts, the
> > > producer could get easily fenced, old clients would pretty much would
> get
> > > fenced when a new incarnation of the producer was initialized with
> > > InitProducerId so it's ok to treat as a fatal error, the new clients
> > would
> > > need to implement some workflow to handle that error, otherwise they
> > could
> > > get fenced by themselves
> > > - in particular (as a subset of the previous issue), what would the
> > client
> > >

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-18 Thread Justine Olshan

Yeah -- looks like we already have code to handle bumping the epoch and
when the epoch is Short.MAX_VALUE, we get a new producer ID. Since this is
already the behavior, do we want to change it further?

Justine

On Wed, Jan 18, 2023 at 1:12 PM Justine Olshan  wrote:

> Hey all, just wanted to quickly update and say I've modified the KIP to
> explicitly mention that AddOffsetCommitsToTxnRequest will be replaced by
> a coordinator-side (inter-broker) AddPartitionsToTxn implicit request. This
> mirrors the user partitions and will implicitly add offset partitions to
> transactions when we commit offsets on them. We will deprecate 
> AddOffsetCommitsToTxnRequest
> for new clients.
>
> Also to address Artem's comments --
> I'm a bit unsure if the changes here will change the previous behavior for
> fencing producers. In the case you mention in the first paragraph, are you
> saying we bump the epoch before we try to abort the transaction? I think I
> need to understand the scenarios you mention a bit better.
>
> As for the second part -- I think it makes sense to have some sort of
> "sentinel" epoch to signal epoch is about to overflow (I think we sort of
> have this value in place in some ways) so we can codify it in the KIP. I'll
> look into that and try to update soon.
>
> Thanks,
> Justine.
>
> On Fri, Jan 13, 2023 at 5:01 PM Artem Livshits
>  wrote:
>
>> It's good to know that KIP-588 addressed some of the issues.  Looking at
>> the code, it still looks like there are some cases that would result in
>> fatal error, e.g. PRODUCER_FENCED is issued by the transaction coordinator
>> if epoch doesn't match, and the client treats it as a fatal error (code in
>> TransactionManager request handling).  If we consider, for example,
>> committing a transaction that returns a timeout, but actually succeeds,
>> trying to abort it or re-commit may result in PRODUCER_FENCED error
>> (because of epoch bump).
>>
>> For failed commits, specifically, we need to know the actual outcome,
>> because if we return an error the application may think that the
>> transaction is aborted and redo the work, leading to duplicates.
>>
>> Re: overflowing epoch.  We could either do it on the TC and return both
>> producer id and epoch (e.g. change the protocol), or signal the client
>> that
>> it needs to get a new producer id.  Checking for max epoch could be a
>> reasonable signal, the value to check should probably be present in the
>> KIP
>> as this is effectively a part of the contract.  Also, the TC should
>> probably return an error if the client didn't change producer id after
>> hitting max epoch.
>>
>> -Artem
>>
>>
>> On Thu, Jan 12, 2023 at 10:31 AM Justine Olshan
>>  wrote:
>>
>> > Thanks for the discussion Artem.
>> >
>> > With respect to the handling of fenced producers, we have some behavior
>> > already in place. As of KIP-588:
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
>> > ,
>> > we handle timeouts more gracefully. The producer can recover.
>> >
>> > Produce requests can also recover from epoch fencing by aborting the
>> > transaction and starting over.
>> >
>> > What other cases were you considering that would cause us to have a
>> fenced
>> > epoch but we'd want to recover?
>> >
>> > The first point about handling epoch overflows is fair. I think there is
>> > some logic we'd need to consider. (ie, if we are one away from the max
>> > epoch, we need to reset the producer ID.) I'm still wondering if there
>> is a
>> > way to direct this from the response, or if everything should be done on
>> > the client side. Let me know if you have any thoughts here.
>> >
>> > Thanks,
>> > Justine
>> >
>> > On Tue, Jan 10, 2023 at 4:06 PM Artem Livshits
>> >  wrote:
>> >
>> > > There are some workflows in the client that are implied by protocol
>> > > changes, e.g.:
>> > >
>> > > - for new clients, epoch changes with every transaction and can
>> overflow,
>> > > in old clients this condition was handled transparently, because epoch
>> > was
>> > > bumped in InitProducerId and it would return a new producer id if
>> epoch
>> > > overflows, the new clients would need to implement some workflow to
>> > refresh
>> > > producer id
>> > > - how to ha

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Justine Olshan

gt; produce-request finally received by the partition leader, before this
> KIP e) step would be accepted causing a dangling txn; now it would be
> rejected in step e) which is good. But from the client's point of view
> now it becomes confusing since the `commitTransaction()` returns
> successfully, but the "future" throws an invalid-epoch error, and they
> are not sure if the transaction did succeed or not. In fact, it
> "partially succeeded" with some msgs being rejected but others
> committed successfully.
>
> Of course the easy way to avoid this is, always call
> "producer.flush()" before commitTxn and that's what we do ourselves,
> and what we recommend users do. But I suspect not everyone does it. In
> fact I just checked the javadoc in KafkaProducer and our code snippet
> does not include a `flush()` call. So I'm thinking maybe we can in
> side the `commitTxn` code to enforce flushing before sending the
> end-txn request.
>
> 2. I'd like to clarify a bit details on "just add partitions to the
> transaction on the first produce request during a transaction". My
> understanding is that the partition leader's cache has the producer id
> / sequence / epoch for the latest txn, either on-going or is completed
> (upon receiving the marker request from coordinator). When a produce
> request is received, if
>
> * producer's epoch < cached epoch, or producer's epoch == cached epoch
> but the latest txn is completed, leader directly reject with
> invalid-epoch.
> * producer's epoch > cached epoch, park the the request and send
> add-partitions request to coordinator.
>
> In order to do it, does the coordinator need to bump the sequence and
> reset epoch to 0 when the next epoch is going to overflow? If no need
> to do so, then how we handle the (admittedly rare, but still may
> happen) epoch overflow situation?
>
> 3. I'm a bit concerned about adding a generic "ABORTABLE_ERROR" given
> we already have a pretty messy error classification and error handling
> on the producer clients side --- I have a summary about the issues and
> a proposal to address this in
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> -- I understand we do not want to use "UNKNOWN_PRODUCER_ID" anymore
> and in fact we intend to deprecate it in KIP-360 and eventually remove
> it; but I'm wondering can we still use specific error codes. E.g. what
> about "InvalidProducerEpochException" since for new clients, the
> actual reason this would actually be rejected is indeed because the
> epoch on the coordinator caused the add-partitions-request from the
> brokers to be rejected anyways?
>
> 4. It seems we put the producer request into purgatory before we ever
> append the records, while other producer's records may still be
> appended during the time; and that potentially may result in some
> re-ordering compared with reception order. I'm not super concerned
> about it since Kafka does not guarantee reception ordering across
> producers anyways, but it may make the timestamps of records inside a
> partition to be more out-of-ordered. Are we aware of any scenarios
> such as future enhancements on log compactions that may be affected by
> this effect?
>
> Below are just minor comments:
>
> 5. In "AddPartitionsToTxnTransaction" field of
> "AddPartitionsToTxnRequest" RPC, the versions of those inner fields
> are "0-3" while I thought they should be "0+" still?
>
> 6. Regarding "we can place the request in a purgatory of sorts and
> check if there is any state for the transaction on the broker": i
> think at this time when we just do the checks against the cached
> state, we do not need to put the request to purgatory yet?
>
> 7. This is related to 3) above. I feel using "InvalidRecordException"
> for older clients may also be a bit confusing, and also it is not
> fatal -- for old clients, it better to be fatal since this indicates
> the clients is doing something wrong and hence it should be closed.
> And in general I'd prefer to use slightly more specific meaning error
> codes for clients. That being said, I also feel
> "InvalidProducerEpochException" is not suitable for old versioned
> clients, and we'd have to pick one that old clients recognize. I'd
> prefer "InvalidTxnStateException" but that one is supposed to be
> returned from txn coordinators only today. I'd suggest we do a quick
> check in the current client's code path and see if that one would be
> handled if it's from a pro

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Justine Olshan

That's a fair point about other clients.

I think the abortable error case is interesting because I'm curious how
other clients would handle this. I assume they would need to implement
handling for the error code unless they did something like "any unknown
error codes/any codes that aren't x,y,z are retriable." I would hope that
unknown error codes were fatal, and if the code was implemented it would
abort the transaction. But I will think on this too.

As for InvalidRecord -- you mentioned it was not fatal, but I'm taking a
look through the code. We would see this on handling the produce response.
If I recall correctly, we check if errors are retriable. I think this error
would not be retriable. But I guess the concern here is that it is not
enough for just that batch to fail. I guess I hadn't considered fully
fencing the old producer but there are valid arguments here why we would
want to.

Thanks,
Justine

On Fri, Jan 20, 2023 at 2:35 PM Guozhang Wang 
wrote:

> Thanks Justine for the replies! I agree with most of your thoughts.
>
> Just for 3/7), though I agree for our own AK producer, since we do
> "nextRequest(boolean hasIncompleteBatches)", we guarantee the end-txn
> would not be sent until we've effectively flushed, but I was referring
> to any future bugs or other buggy clients that the same client may get
> into this situation, in which case we should give the client a clear
> msg that "you did something wrong, and hence now you should fatally
> close yourself". What I'm concerned about is that, by seeing an
> "abortable error" or in some rare cases an "invalid record", the
> client could not realize "something that's really bad happened". So
> it's not about adding a new error, it's mainly about those real buggy
> situations causing such "should never happen" cases, the errors return
> would not be informative enough.
>
> Thinking in other ways, if we believe that for most cases such error
> codes would not reach the original clients since they would be
> disconnected or even gone by that time, and only in some rare cases
> they would still be seen by the sending clients, then why not make
> them more fatal and more specific than generic.
>
> Guozhang
>
> On Fri, Jan 20, 2023 at 1:59 PM Justine Olshan
>  wrote:
> >
> > Hey Guozhang. Thanks for taking a look and for the detailed comments!
> I'll
> > do my best to address below.
> >
> > 1. I see what you are saying here, but I think I need to look through the
> > sequence of events you mention. Typically we've seen this issue in a few
> > cases.
> >
> >  One is when we have a producer disconnect when trying to produce.
> > Typically in these cases, we abort the transaction. We've seen that after
> > the markers are written, the disconnection can sometimes cause the
> request
> > to get flushed to the broker. In this case, we don't need client handling
> > because the producer we are responding to is gone. We just needed to make
> > sure we didn't write to the log on the broker side. I'm trying to think
> of
> > a case where we do have the client to return to. I'd think the same
> client
> > couldn't progress to committing the transaction unless the produce
> request
> > returned right? Of course, there is the incorrectly written clients case.
> > I'll think on this a bit more and let you know if I come up with another
> > scenario when we would return to an active client when the transaction is
> > no longer ongoing.
> >
> > I was not aware that we checked the result of a send after we commit
> > though. I'll need to look into that a bit more.
> >
> > 2. There were some questions about this in the discussion. The plan is to
> > handle overflow with the mechanism we currently have in the producer. If
> we
> > try to bump and the epoch will overflow, we actually allocate a new
> > producer ID. I need to confirm the fencing logic on the last epoch (ie,
> we
> > probably shouldn't allow any records to be produced with the final epoch
> > since we can never properly fence that one).
> >
> > 3. I can agree with you that the current error handling is messy. I
> recall
> > taking a look at your KIP a while back, but I think I mostly saw the
> > section about how the errors were wrapped. Maybe I need to take another
> > look. As for abortable error, the idea was that the handling would be
> > simple -- if this error is seen, the transaction should be aborted -- no
> > other logic about previous state or requests necessary. Is your concern
> > simply abou

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-24 Thread Justine Olshan

Hey Matthias,

I have actually never heard of KIP-280 so thanks for bringing it up. That
seems interesting. I wonder how it would work though -- would it build an
offset map with just the latest timestamp for a key? I wonder if ordering
assumptions are baked in there, why not use offset-based compaction.

I was also not aware of this "guarantee" with regards to broker side time.
I think that we can do in order handling for a given producer, but not
across all producers. However, we can't guarantee that anyway.

Let me know if you have any concerns here.

Thanks,
Justine

On Mon, Jan 23, 2023 at 6:33 PM Matthias J. Sax  wrote:

> Just a side note about Guozhang comments about timestamps.
>
> If the producer sets the timestamp, putting the record into purgatory
> seems not to be an issue (as already said: for this case we don't
> guarantee timestamp order between writes of different producers anyway).
> However, if the broker sets the timestamp, the expectation is that there
> is no out-of-order data in the partition ever; if we would introduce
> out-of-order data for this case (for interleaved writes of different
> producers), it seems we would violate the current contract? (To be fair:
> I don't know if that's an official contract, but I assume people rely on
> this behavior -- and it "advertised" in many public talks...)
>
> About compaction: there is actually KIP-280 that adds timestamp based
> compaction what is a very useful feature for Kafka Streams with regard
> to out-of-order data handling. So the impact if we introduce
> out-of-order data could be larger scoped.
>
>
> -Matthias
>
>
> On 1/20/23 4:48 PM, Justine Olshan wrote:
> > Hey Artem,
> >
> > I see there is a check for transactional producers. I'm wondering if we
> > don't handle the epoch overflow case. I'm also not sure it will be a huge
> > issue to extend to transactional producers, but maybe I'm missing
> something.
> >
> > As for the recovery path -- I think Guozhang's point was if we have a bad
> > client that repeatedly tries to produce without adding to the transaction
> > we would do the following:
> > a) if not fatal, we just fail the produce request over and over
> > b) if fatal, we fence the producer
> >
> > Here with B, the issue with the client would be made clear more quickly.
> I
> > suppose there are some intermediate cases where the issue only occurs
> > sometimes, but I wonder if we should consider how to recover with clients
> > who don't behave as expected anyway.
> >
> > I think there is a place for the abortable error that we are adding --
> just
> > abort and try again. But I think there are also some cases where trying
> to
> > recover overcomplicates some logic. Especially if we are considering
> older
> > clients -- there I'm not sure if there's a ton we can do besides fail the
> > batch or fence the producer. With newer clients, we can consider more
> > options for what can just be recovered after aborting. But epochs might
> be
> > a hard one unless we also want to reset producer ID.
> >
> > Thanks,
> > Justine
> >
> >
> >
> > On Fri, Jan 20, 2023 at 3:59 PM Artem Livshits
> >  wrote:
> >
> >>>   besides the poorly written client case
> >>
> >> A poorly written client could create a lot of grief to people who run
> Kafka
> >> brokers :-), so when deciding to make an error fatal I would see if
> there
> >> is a reasonable recovery path rather than how often it could happen.
> If we
> >> have solid implementation of transactions (which I hope we'll do as a
> >> result of this KIP), it would help to recover from a large class of
> errors
> >> by just aborting a transaction, even if the cause of error is a race
> >> condition or etc.
> >>
> >> -Artem
> >>
> >> On Fri, Jan 20, 2023 at 3:26 PM Justine Olshan
> >> 
> >> wrote:
> >>
> >>> Artem --
> >>> I guess the discussion path we were going down is when we expect to see
> >>> this error. I mentioned that it was hard to come up with cases for when
> >> the
> >>> producer would still be around to receive the error besides the poorly
> >>> written client case.
> >>> If we don't expect to have a producer to receive the response, it sort
> of
> >>> makes sense for it to be fatal.
> >>>
> >>> I had some discussion with Jason offline about the epoch being off
> cases
> >>> and I'm not sure we could find a ton

Re: [VOTE] KIP-890: Transactions Server Side Defense

2023-02-03 Thread Justine Olshan

Thanks everyone! I'm going to close the vote.
The KIP is accepted with five binding votes from Jason, Guozhang, Matthias,
David (and me), and two non-binding votes from Colt and Artem.

Thanks again,
Justine

On Thu, Feb 2, 2023 at 11:41 PM David Jacot 
wrote:

> Thanks for the KIP, Justine. +1 (binding)
>
> On Fri, Feb 3, 2023 at 1:36 AM Matthias J. Sax  wrote:
>
> > Thanks for the KIP!
> >
> > +1 (binding)
> >
> >
> > On 2/2/23 4:18 PM, Artem Livshits wrote:
> > > (non-binding) +1.  Looking forward to the implementation and fixing the
> > > issues that we've got.
> > >
> > > -Artem
> > >
> > > On Mon, Jan 23, 2023 at 2:25 PM Guozhang Wang <
> > guozhang.wang...@gmail.com>
> > > wrote:
> > >
> > >> Thanks Justine, I have no further comments on the KIP. +1.
> > >>
> > >> On Tue, Jan 17, 2023 at 10:34 AM Jason Gustafson
> > >>  wrote:
> > >>>
> > >>> +1. Thanks Justine!
> > >>>
> > >>> -Jason
> > >>>
> > >>> On Tue, Jan 10, 2023 at 3:46 PM Colt McNealy 
> > >> wrote:
> > >>>
> > >>>> (non-binding) +1. Thank you for the KIP, Justine! I've read it; it
> > >> makes
> > >>>> sense to me and I am excited for the implementation.
> > >>>>
> > >>>> Colt McNealy
> > >>>> *Founder, LittleHorse.io*
> > >>>>
> > >>>>
> > >>>> On Tue, Jan 10, 2023 at 10:46 AM Justine Olshan
> > >>>>  wrote:
> > >>>>
> > >>>>> Hi everyone,
> > >>>>>
> > >>>>> I would like to start a vote on KIP-890 which aims to prevent some
> > >> of the
> > >>>>> common causes of hanging transactions and make other general
> > >> improvements
> > >>>>> to transactions in Kafka.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
> > >>>>>
> > >>>>> Please take a look if you haven't already and vote!
> > >>>>>
> > >>>>> Justine
> > >>>>>
> > >>>>
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-30 Thread Justine Olshan

Hi Artem,
Thanks for the update to include motivation. Makes sense to me.
Justine

On Wed, Jun 29, 2022 at 6:51 PM Luke Chen  wrote:

> Hi Artem,
>
> Thanks for the update.
> LGTM.
>
> Luke
>
> On Thu, Jun 30, 2022 at 6:51 AM Artem Livshits
>  wrote:
>
> > Thank you for your feedback. I've updated the KIP to elaborate on the
> > motivation and provide some background on producer ids and how we measure
> > them.
> >
> > Also, after some thinking and discussing it offline with some folks, I
> > think that we don't really need partitioner level metrics.  We can use
> > existing tools to do granular debugging.  I've moved partition level
> > metrics to the rejected alternatives section.
> >
> > -Artem
> >
> > On Wed, Jun 29, 2022 at 1:57 AM Luke Chen  wrote:
> >
> > > Hi Artem,
> > >
> > > Could you elaborate more in the motivation section?
> > > I'm interested to know what kind of scenarios this metric can benefit
> > for.
> > > What could it bring to us when a topic partition has 100
> ProducerIdCount
> > VS
> > > another topic partition has 10 ProducerIdCount?
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Wed, Jun 29, 2022 at 6:30 AM Jun Rao 
> > wrote:
> > >
> > > > Hi, Artem,
> > > >
> > > > Thanks for the KIP.
> > > >
> > > > Could you explain the partition level ProducerIdCount a bit more?
> Does
> > > that
> > > > reflect the number of PIDs ever produced to a partition since the
> > broker
> > > is
> > > > started? Do we reduce the count after a PID expires?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Wed, Jun 22, 2022 at 1:08 AM David Jacot
> >  > > >
> > > > wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > The KIP LGTM.
> > > > >
> > > > > Thanks,
> > > > > David
> > > > >
> > > > > On Tue, Jun 21, 2022 at 9:32 PM Artem Livshits
> > > > >  wrote:
> > > > > >
> > > > > > If there is no other feedback I'm going to start voting in a
> couple
> > > > days.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Fri, Jun 17, 2022 at 3:50 PM Artem Livshits <
> > > alivsh...@confluent.io
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thank you for your feedback.  Updated the KIP and added the
> > > Rejected
> > > > > > > Alternatives section.
> > > > > > >
> > > > > > > -Artem
> > > > > > >
> > > > > > > On Fri, Jun 17, 2022 at 1:16 PM Ismael Juma  >
> > > > wrote:
> > > > > > >
> > > > > > >> If we don't track them separately, then it makes sense to keep
> > it
> > > as
> > > > > one
> > > > > > >> metric. I'd probably name it ProducerIdCount in that case.
> > > > > > >>
> > > > > > >> Ismael
> > > > > > >>
> > > > > > >> On Fri, Jun 17, 2022 at 12:04 PM Artem Livshits
> > > > > > >>  wrote:
> > > > > > >>
> > > > > > >> > Do you propose to have 2 metrics instead of one?  Right now
> we
> > > > don't
> > > > > > >> track
> > > > > > >> > if the producer id was transactional or idempotent and for
> > > metric
> > > > > > >> > collection we'd either have to pay the cost of iterating
> over
> > > > > producer
> > > > > > >> ids
> > > > > > >> > (which could be a lot) or split the producer map into 2 or
> > cache
> > > > the
> > > > > > >> > counts, which complicates the code.
> > > > > > >> >
> > > > > > >> > From the monitoring perspective, I think one metric should
> be
> > > > good,
> > > > > but
> > > > > > >> > maybe I'm missing some scenarios.
> > > > > > >> >
> > > > > > >> > -Artem
> > > > > > >> >
> > > > > > >> > On Fri, Jun 17, 2022 at 12:28 AM Ismael Juma <
> > ism...@juma.me.uk
> > > >
> > > > > wrote:
> > > > > > >> >
> > > > > > >> > > I like the suggestion to have IdempotentProducerCount and
> > > > > > >> > > TransactionalProducerCount metrics.
> > > > > > >> > >
> > > > > > >> > > Ismael
> > > > > > >> > >
> > > > > > >> > > On Thu, Jun 16, 2022 at 2:27 PM Artem Livshits
> > > > > > >> > >  wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Ismael,
> > > > > > >> > > >
> > > > > > >> > > > Thank you for your feedback.  Yes, this is counting the
> > > number
> > > > > of
> > > > > > >> > > producer
> > > > > > >> > > > ids tracked by the partition and broker.  Another
> options
> > I
> > > > was
> > > > > > >> > thinking
> > > > > > >> > > of
> > > > > > >> > > > are the following:
> > > > > > >> > > >
> > > > > > >> > > > - IdempotentProducerCount
> > > > > > >> > > > - TransactionalProducerCount
> > > > > > >> > > > - ProducerIdCount
> > > > > > >> > > >
> > > > > > >> > > > Let me know if one of these seems better, or I'm open to
> > > other
> > > > > name
> > > > > > >> > > > suggestions as well.
> > > > > > >> > > >
> > > > > > >> > > > -Artem
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Jun 15, 2022 at 11:49 PM Ismael Juma <
> > > > ism...@juma.me.uk
> > > > > >
> > > > > > >> > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Thanks for the KIP.
> > > > > > >> > > > >
> > > > > > >> > > > > ProducerCount seems like a misleading name since
> > producers
> > > > > > >> without a
> > > > > > >> > > > > producer id are not counted. Is this meant

Re: [DISCUSS] KIP-848: The Next Generation of the Consumer Rebalance Protocol

2022-07-08 Thread Justine Olshan

Hi David,
Thanks for sharing this KIP! Really exciting to hear how we are changing
the protocol! The motivation section really made me realize how useful this
change will be.

I've done a first pass of the KIP, and may have more questions, but thought
I'd start with a few I thought of already.

   - I saw some usages of topic IDs in the new
   protocols/records/interfaces, but wasn't sure if they were used everywhere.
   Are you planning on relying on topic IDs for the new protocol?
   - I saw the section about using a feature flag first before integrating
   the feature with ibp/metadata version. I understand the logic for testing
   with the flag, but it also seems like a bit of work to deprecate and switch
   to the ibp/metadata version approach. What was the reasoning behind
   switching the enablement mechanism?
   - Generally, are there implications for KRaft here? (IBP/metadata
   version is something that I think of) And if so, will both cluster types be
   supported?

Thanks again to everyone who worked on this KIP!
Justine

On Wed, Jul 6, 2022 at 1:45 AM David Jacot 
wrote:

> Hi all,
>
> I would like to start a discussion thread on KIP-848: The Next
> Generation of the Consumer Rebalance Protocol. With this KIP, we aim
> to make the rebalance protocol (for consumers) more reliable, more
> scalable, easier to implement for clients, and easier to debug for
> operators.
>
> The KIP is here: https://cwiki.apache.org/confluence/x/HhD1D.
>
> Please take a look and let me know what you think.
>
> Best,
> David
>
> PS: I will be away from July 18th to August 8th. That gives you a bit
> of time to read and digest this long KIP.
>

Discard KIP-185

2022-07-13 Thread Justine Olshan

Hey Apache Kafka community

I was doing some research on idempotent producers and found KIP-185

.

It seems like this KIP was replaced by KIP-679
.
There are minor differences between the KIPs, but I'd say KIP-679 replaces
KIP-185.

Do we think it is valid to discard KIP-185? Or at least move it from "under
discussion" on the KIP page

?

Thanks,
Justine

Re: [DISCUSS] KIP-853: KRaft Voters Change

2022-07-21 Thread Justine Olshan

Hey Jose -- this seems like an important KIP! And I enjoy seeing more Uuid
usage :)
I was curious a bit more about the motivation here. That section seems to
be missing.

Thanks for sharing the KIP!
Justine

On Thu, Jul 21, 2022 at 10:30 AM Niket Goel 
wrote:

> Hey Jose,
>
> Thanks for the KIP. This is a good improvement and will make the KRaft
> implementation much more robust in the face of failures and generally make
> it more flexible for users.
>
> I did a first pass through the KIP and here are some comments (some of
> these might just be a little uninformed, so feel free to direct me to
> supplemental reading):
> Overall protocol safety wise, the reconfiguration operations look safe.
>
> > This UUID will be generated once and persisted as part of the quorum
> state for the topic partition
> Do we mean that it will be generated every time the disk on the replica is
> primed (so every disk incarnation will have UUID). I think you describe it
> in a section further below. Best to pull it up to here — “the replica uuid
> is automatically generated once by the replica when persisting the quorum
> state for the first time.”
>
> > If there are any pending voter change operations the leader will wait
> for them to finish.
> Will new requests be rejected or queued up behind the pending operation.
> (I am assuming rejected, but want to confirm)
>
> > When this option is used the leader of the KRaft topic partition will
> not allow the AddVoter RPC to add replica IDs that are not describe in the
> configuration and it would not allow the RemoveVoter RPC to remove replica
> IDs that are described in the configuration.
> Bootstrapping is a little tricky I think. Is it safer/simpler to say that
> the any add/remove RPC operations are blocked until all nodes in the config
> are processed? The way it is worded above makes it seem like the leader
> will accept adds of the same node from outside. Is that the case?
>
> > The KRaft leader will not perform writes from the state machine (active
> controller) until is has written to the log an AddVoterRecord for every
> replica id in the controller.quorum.voters  configuration.
> Just thinking through - One of the safety requirements for the protocol is
> for a leader to commit at least one write in an epoch before doing config
> changes, right? In this special case we should be ok because the quorum has
> no one but the leader in the beginning. Is that the thought process?
>
> > controller.quorum.bootstrap.servers vs controller.quorum.voters
> I understand the use of quorum.voters, but the bootstrap.servers is not
> entirely clear to me. So in the example of starting the cluster with one
> voter, will that one voter be listed here? And when using this
> configuration, is the expectation that quorum.voters is empty, or will it
> eventually get populated with the new quorum members? e.g. further in the
> kip we say — “Replica 3 will discover the partition leader using
> controller.quorum.voters”; so I guess it will be populated?
>
> > This check will be removed and replicas will reply to votes request when
> the candidate is not in the voter set or the voting replica is not in the
> voter set.
> This is a major change IMO and I think it would be good if we could
> somehow highlight it in the KIP to aid a future reader.
>
> > This also means that the KRaft implementation needs to handle this
> uncommitted state getting truncated and reverted.
> Do we need to talk about the specific behavior a little more here? I mean
> how does this affect any inflight messages with quorums moving between
> different values. (Just a brief except to why it works)
>
> > This state can be discovered by a client by using the DescribeQuorum
> RPC, the Admin client or the kafka-quorum.sh CLI.
> The describeQuorum RPC does not respond with a list of observers today. We
> would need a section to fix that.
>
> > The client can now decide to add replica (3, UUID3') to the set of
> voters using the AddVoter RPC, the Admin client or the kafka-quorum.sh CLI.
> Trying the understand the general thought process‚ the addition of this
> node back into the quorum will be a manually triggered operation and not
> something the node will attempt by itself?
>
> This is a general wonderment, and feel free to ignore it:
> Might be good to list some scenarios demonstrating the safety , e.g. how
> do we ensure that there is no loss of availability during an addVoter
> operation when the leader goes down. Or how a failed operation is safely
> removed from the log and reverted.
>
> > On Jul 21, 2022, at 9:49 AM, José Armando García Sancio
>  wrote:
> >
> > Hi all,
> >
> > I would like to start the discussion on my design to support
> > dynamically changing the set of voters in the KRaft cluster metadata
> > topic partition.
> >
> > KIP URL:
> https://www.google.com/url?q=https://cwiki.apache.org/confluence/x/nyH1D&source=gmail-imap&ust=165902699300&usg=AOvVaw12sPgdPT9X6LeINEVmj-iO
> >
> > Thanks!
> > -José
>
>

[DISCUSS] KIP-854 Separate configuration for producer ID expiry

2022-07-21 Thread Justine Olshan

Hey all!

I'd like to start a discussion on my proposal to separate time-based
producer ID expiration from transactional ID expiration by introducing a
new configuration.

The KIP Is pretty small and simple, but will be helpful in controlling
memory usage in brokers -- especially now that by default producers are
idempotent and create producer ID state.

Please take a look and leave any comments you may have!

KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
JIRA: https://issues.apache.org/jira/browse/KAFKA-14097

Thanks!
Justine

Re: [DISCUSS] KIP-854 Separate configuration for producer ID expiry

2022-07-25 Thread Justine Olshan

Hey Bill,
Thanks! I was just going to say that hopefully
transactional.id.expiration.ms would also be over the delivery timeout. :)
Thanks for the +1!

Justine

On Mon, Jul 25, 2022 at 9:17 AM Bill Bejeck  wrote:

> Hi Justine,
>
> I just took another look at the KIP, and I realize my question/suggestion
> about default values has already been addressed in the `Compatibility`
> section.
>
> I'm +1 on the KIP.
>
> -Bill
>
> On Thu, Jul 21, 2022 at 6:20 PM Bill Bejeck  wrote:
>
> > Hi Justine,
> >
> > Thanks for the well written KIP, this looks like it will be a useful
> > addition.
> >
> > Overall the KIP looks good to me, I have one question/comment.
> >
> > You mentioned that setting the `producer.id.expiration.ms` less than the
> > delivery timeout could lead to duplicates, which makes sense.  To help
> > avoid this situation, do we want to consider a default value that is the
> > same as the delivery timeout?
> >
> > Thanks again for the KIP.
> >
> > Bill
> >
> > On Thu, Jul 21, 2022 at 4:54 PM Justine Olshan
> >  wrote:
> >
> >> Hey all!
> >>
> >> I'd like to start a discussion on my proposal to separate time-based
> >> producer ID expiration from transactional ID expiration by introducing a
> >> new configuration.
> >>
> >> The KIP Is pretty small and simple, but will be helpful in controlling
> >> memory usage in brokers -- especially now that by default producers are
> >> idempotent and create producer ID state.
> >>
> >> Please take a look and leave any comments you may have!
> >>
> >> KIP:
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
> >> JIRA: https://issues.apache.org/jira/browse/KAFKA-14097
> >>
> >> Thanks!
> >> Justine
> >>
> >
>

< 1 2 3 4 5 6 >

201 - 300 of 503 matches

Mail list logo