Re: [VOTE] 3.7.0 RC4

2024-03-01 Thread Stanislav Kozlovski
Makes sense, thanks Ismael. I've updated it.

We are released now! Thank you to all who contributed!

There are two final things to do with relation to the release as per the
wiki - and that is introduce 3.7 to the respective compatibility system
tests for Core, Clients and Streams.

Here are the PRs for that:
- https://github.com/apache/kafka/pull/15452
- https://github.com/apache/kafka/pull/15453

On Wed, Feb 28, 2024 at 4:17 PM Ismael Juma  wrote:

> Hi,
>
> mvnrepository doesn't matter, we should update the guidelines to remove it.
> All that matters is that it has reached the maven central repository.
>
> Ismael
>
> On Tue, Feb 27, 2024 at 9:49 AM Stanislav Kozlovski
>  wrote:
>
> > The thinking is that it is available for use - and it is in maven
> central -
> > https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13 - but
> > the
> > mvnrepository seems to be an unofficial website that offers a nice UI for
> > accessing maven central. The release is in the latter -
> > https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13
> >
> > On Tue, Feb 27, 2024 at 6:35 PM Divij Vaidya 
> > wrote:
> >
> > > We wait before making the announcement. The rationale is that there is
> > not
> > > much point announcing a release if folks cannot start using that
> version
> > > artifacts immediately.
> > >
> > > See "Wait for about a day for the artifacts to show up in apache mirror
> > > (releases, public group) and maven central (mvnrepository.com or
> > maven.org
> > > )."
> > > in the release process wiki.
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Tue, Feb 27, 2024 at 4:43 PM Stanislav Kozlovski
> > >  wrote:
> > >
> > > > Hey all,
> > > >
> > > > Everything site-related is merged.
> > > >
> > > > I have been following the final steps of the release process.
> > > > - Docker contains the release -
> > > https://hub.docker.com/r/apache/kafka/tags
> > > > - Maven central contains the release -
> > > >
> > > >
> > >
> >
> https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13/3.7.0/versions
> > > > .
> > > > Note it says Feb 9 publish date, but it was just published. The RC4
> > files
> > > > were created on Feb 9 though, so I assume that's why it says that
> > > > - mvnrepository is NOT yet up to date -
> > > > https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
> and
> > > > https://mvnrepository.com/artifact/org.apache.kafka/kafka
> > > >
> > > > Am I free to announce the release, or should I wait more for
> > > MVNRepository
> > > > to get up to date? For what it's worth, I "Released" the files 24
> hours
> > > ago
> > > >
> > > > On Mon, Feb 26, 2024 at 10:42 AM Stanislav Kozlovski <
> > > > stanis...@confluent.io>
> > > > wrote:
> > > >
> > > > >
> > > > > This vote passes with *10 +1 votes* (3 bindings) and no 0 or -1
> > votes.
> > > > >
> > > > > +1 votes
> > > > >
> > > > > PMC Members (binding):
> > > > > * Mickael Maison
> > > > > * Justine Olshan
> > > > > * Divij Vaidya
> > > > >
> > > > > Community (non-binding):
> > > > > * Proven Provenzano
> > > > > * Federico Valeri
> > > > > * Vedarth Sharma
> > > > > * Andrew Schofield
> > > > > * Paolo Patierno
> > > > > * Jakub Scholz
> > > > > * Josep Prat
> > > > >
> > > > > 
> > > > >
> > > > > 0 votes
> > > > >
> > > > > * No votes
> > > > >
> > > > > 
> > > > >
> > > > > -1 votes
> > > > >
> > > > > * No votes
> > > > >
> > > > > 
> > > > >
> > > > > Vote thread:
> > > > > https://lists.apache.org/thread/71djwz292y2lzgwzm7n6n8o7x56zbgh9
> > > > >
> > > > > I'll continue with the release process and the release announcement
> > > will
> > > > > follow ASAP.
> > > > >
> > > > > Best,
> > > > > Stanislav
> > > > >
> > > > &g

[ANNOUNCE] Apache Kafka 3.7.0

2024-02-27 Thread Stanislav Kozlovski
The Apache Kafka community is pleased to announce the release of
Apache Kafka 3.7.0

This is a minor release that includes new features, fixes, and
improvements from 296 JIRAs

An overview of the release and its notable changes can be found in the
release blog post:
https://kafka.apache.org/blog#apache_kafka_370_release_announcement

All of the changes in this release can be found in the release notes:
https://www.apache.org/dist/kafka/3.7.0/RELEASE_NOTES.html

You can download the source and binary release (Scala 2.12, 2.13) from:
https://kafka.apache.org/downloads#3.7.0

---


Apache Kafka is a distributed streaming platform with four core APIs:


** The Producer API allows an application to publish a stream of records to
one or more Kafka topics.

** The Consumer API allows an application to subscribe to one or more
topics and process the stream of records produced to them.

** The Streams API allows an application to act as a stream processor,
consuming an input stream from one or more topics and producing an
output stream to one or more output topics, effectively transforming the
input streams to output streams.

** The Connector API allows building and running reusable producers or
consumers that connect Kafka topics to existing applications or data
systems. For example, a connector to a relational database might
capture every change to a table.


With these APIs, Kafka can be used for two broad classes of application:

** Building real-time streaming data pipelines that reliably get data
between systems or applications.

** Building real-time streaming applications that transform or react
to the streams of data.


Apache Kafka is in use at large and small companies worldwide, including
Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
Target, The New York Times, Uber, Yelp, and Zalando, among others.

A big thank you to the following 146 contributors to this release!
(Please report an unintended omission)

Abhijeet Kumar, Akhilesh Chaganti, Alieh, Alieh Saeedi, Almog Gavra,
Alok Thatikunta, Alyssa Huang, Aman Singh, Andras Katona, Andrew
Schofield, Anna Sophie Blee-Goldman, Anton Agestam, Apoorv Mittal,
Arnout Engelen, Arpit Goyal, Artem Livshits, Ashwin Pankaj,
ashwinpankaj, atu-sharm, bachmanity1, Bob Barrett, Bruno Cadonna,
Calvin Liu, Cerchie, chern, Chris Egerton, Christo Lolov, Colin
Patrick McCabe, Colt McNealy, Crispin Bernier, David Arthur, David
Jacot, David Mao, Deqi Hu, Dimitar Dimitrov, Divij Vaidya, Dongnuo
Lyu, Eaugene Thomas, Eduwer Camacaro, Eike Thaden, Federico Valeri,
Florin Akermann, Gantigmaa Selenge, Gaurav Narula, gongzhongqiang,
Greg Harris, Guozhang Wang, Gyeongwon, Do, Hailey Ni, Hanyu Zheng, Hao
Li, Hector Geraldino, hudeqi, Ian McDonald, Iblis Lin, Igor Soarez,
iit2009060, Ismael Juma, Jakub Scholz, James Cheng, Jason Gustafson,
Jay Wang, Jeff Kim, Jim Galasyn, John Roesler, Jorge Esteban Quilcate
Otoya, Josep Prat, José Armando García Sancio, Jotaniya Jeel, Jouni
Tenhunen, Jun Rao, Justine Olshan, Kamal Chandraprakash, Kirk True,
kpatelatwork, kumarpritam863, Laglangyue, Levani Kokhreidze, Lianet
Magrans, Liu Zeyu, Lucas Brutschy, Lucia Cerchie, Luke Chen, maniekes,
Manikumar Reddy, mannoopj, Maros Orsak, Matthew de Detrich, Matthias
J. Sax, Max Riedel, Mayank Shekhar Narula, Mehari Beyene, Michael
Westerby, Mickael Maison, Nick Telford, Nikhil Ramakrishnan, Nikolay,
Okada Haruki, olalamichelle, Omnia G.H Ibrahim, Owen Leung, Paolo
Patierno, Philip Nee, Phuc-Hong-Tran, Proven Provenzano, Purshotam
Chauhan, Qichao Chu, Matthias J. Sax, Rajini Sivaram, Renaldo Baur
Filho, Ritika Reddy, Robert Wagner, Rohan, Ron Dagostino, Roon, runom,
Ruslan Krivoshein, rykovsi, Sagar Rao, Said Boudjelda, Satish Duggana,
shuoer86, Stanislav Kozlovski, Taher Ghaleb, Tang Yunzi, TapDang,
Taras Ledkov, tkuramoto33, Tyler Bertrand, vamossagar12, Vedarth
Sharma, Viktor Somogyi-Vass, Vincent Jiang, Walker Carlson,
Wuzhengyu97, Xavier Léauté, Xiaobing Fang, yangy, Ritika Reddy,
Yanming Zhou, Yash Mayya, yuyli, zhaohaidao, Zihao Lin, Ziming Deng

We welcome your help and feedback. For more information on how to
report problems, and to get involved, visit the project website at
https://kafka.apache.org/

Thank you!


Regards,

Stanislav Kozlovski
Release Manager for Apache Kafka 3.7.0


Re: [VOTE] 3.7.0 RC4

2024-02-27 Thread Stanislav Kozlovski
The thinking is that it is available for use - and it is in maven central -
https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13 - but the
mvnrepository seems to be an unofficial website that offers a nice UI for
accessing maven central. The release is in the latter -
https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13

On Tue, Feb 27, 2024 at 6:35 PM Divij Vaidya 
wrote:

> We wait before making the announcement. The rationale is that there is not
> much point announcing a release if folks cannot start using that version
> artifacts immediately.
>
> See "Wait for about a day for the artifacts to show up in apache mirror
> (releases, public group) and maven central (mvnrepository.com or maven.org
> )."
> in the release process wiki.
>
> --
> Divij Vaidya
>
>
>
> On Tue, Feb 27, 2024 at 4:43 PM Stanislav Kozlovski
>  wrote:
>
> > Hey all,
> >
> > Everything site-related is merged.
> >
> > I have been following the final steps of the release process.
> > - Docker contains the release -
> https://hub.docker.com/r/apache/kafka/tags
> > - Maven central contains the release -
> >
> >
> https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13/3.7.0/versions
> > .
> > Note it says Feb 9 publish date, but it was just published. The RC4 files
> > were created on Feb 9 though, so I assume that's why it says that
> > - mvnrepository is NOT yet up to date -
> > https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients and
> > https://mvnrepository.com/artifact/org.apache.kafka/kafka
> >
> > Am I free to announce the release, or should I wait more for
> MVNRepository
> > to get up to date? For what it's worth, I "Released" the files 24 hours
> ago
> >
> > On Mon, Feb 26, 2024 at 10:42 AM Stanislav Kozlovski <
> > stanis...@confluent.io>
> > wrote:
> >
> > >
> > > This vote passes with *10 +1 votes* (3 bindings) and no 0 or -1 votes.
> > >
> > > +1 votes
> > >
> > > PMC Members (binding):
> > > * Mickael Maison
> > > * Justine Olshan
> > > * Divij Vaidya
> > >
> > > Community (non-binding):
> > > * Proven Provenzano
> > > * Federico Valeri
> > > * Vedarth Sharma
> > > * Andrew Schofield
> > > * Paolo Patierno
> > > * Jakub Scholz
> > > * Josep Prat
> > >
> > > 
> > >
> > > 0 votes
> > >
> > > * No votes
> > >
> > > 
> > >
> > > -1 votes
> > >
> > > * No votes
> > >
> > > ----
> > >
> > > Vote thread:
> > > https://lists.apache.org/thread/71djwz292y2lzgwzm7n6n8o7x56zbgh9
> > >
> > > I'll continue with the release process and the release announcement
> will
> > > follow ASAP.
> > >
> > > Best,
> > > Stanislav
> > >
> > > On Sun, Feb 25, 2024 at 7:08 PM Mickael Maison <
> mickael.mai...@gmail.com
> > >
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Thanks for sorting out the docs issues.
> > >> +1 (binding)
> > >>
> > >> Mickael
> > >>
> > >> On Fri, Feb 23, 2024 at 11:50 AM Stanislav Kozlovski
> > >>  wrote:
> > >> >
> > >> > Some quick updates:
> > >> >
> > >> > There were some inconsistencies between the documentation in the
> > >> > apache/kafka repo and the one in kafka-site. The process is such
> that
> > >> the
> > >> > apache/kafka docs are the source of truth, but we had a few
> > divergences
> > >> in
> > >> > the other repo. I have worked on correcting those with:
> > >> > - MINOR: Reconcile upgrade.html with kafka-site/36's version
> > >> > <https://github.com/apache/kafka/pull/15406> and cherry-picked it
> > into
> > >> the
> > >> > 3.6 and 3.7 branches too
> > >> >
> > >> > Additionally, the 3.7 upgrade notes have been merged in
> apache/kafka -
> > >> MINOR:
> > >> > Add 3.7 upgrade notes <
> > https://github.com/apache/kafka/pull/15407/files
> > >> >.
> > >> >
> > >> > With that, I have opened a PR to move them to the kafka-site
> > repository
> > >> -
> > >> > https://github.com/apache/kafka-site/pull/587. That is awaiting
> > revie

Re: [VOTE] 3.7.0 RC4

2024-02-27 Thread Stanislav Kozlovski
Hey all,

Everything site-related is merged.

I have been following the final steps of the release process.
- Docker contains the release - https://hub.docker.com/r/apache/kafka/tags
- Maven central contains the release -
https://central.sonatype.com/artifact/org.apache.kafka/kafka_2.13/3.7.0/versions.
Note it says Feb 9 publish date, but it was just published. The RC4 files
were created on Feb 9 though, so I assume that's why it says that
- mvnrepository is NOT yet up to date -
https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients and
https://mvnrepository.com/artifact/org.apache.kafka/kafka

Am I free to announce the release, or should I wait more for MVNRepository
to get up to date? For what it's worth, I "Released" the files 24 hours ago

On Mon, Feb 26, 2024 at 10:42 AM Stanislav Kozlovski 
wrote:

>
> This vote passes with *10 +1 votes* (3 bindings) and no 0 or -1 votes.
>
> +1 votes
>
> PMC Members (binding):
> * Mickael Maison
> * Justine Olshan
> * Divij Vaidya
>
> Community (non-binding):
> * Proven Provenzano
> * Federico Valeri
> * Vedarth Sharma
> * Andrew Schofield
> * Paolo Patierno
> * Jakub Scholz
> * Josep Prat
>
> 
>
> 0 votes
>
> * No votes
>
> 
>
> -1 votes
>
> * No votes
>
> 
>
> Vote thread:
> https://lists.apache.org/thread/71djwz292y2lzgwzm7n6n8o7x56zbgh9
>
> I'll continue with the release process and the release announcement will
> follow ASAP.
>
> Best,
> Stanislav
>
> On Sun, Feb 25, 2024 at 7:08 PM Mickael Maison 
> wrote:
>
>> Hi,
>>
>> Thanks for sorting out the docs issues.
>> +1 (binding)
>>
>> Mickael
>>
>> On Fri, Feb 23, 2024 at 11:50 AM Stanislav Kozlovski
>>  wrote:
>> >
>> > Some quick updates:
>> >
>> > There were some inconsistencies between the documentation in the
>> > apache/kafka repo and the one in kafka-site. The process is such that
>> the
>> > apache/kafka docs are the source of truth, but we had a few divergences
>> in
>> > the other repo. I have worked on correcting those with:
>> > - MINOR: Reconcile upgrade.html with kafka-site/36's version
>> > <https://github.com/apache/kafka/pull/15406> and cherry-picked it into
>> the
>> > 3.6 and 3.7 branches too
>> >
>> > Additionally, the 3.7 upgrade notes have been merged in apache/kafka -
>> MINOR:
>> > Add 3.7 upgrade notes <https://github.com/apache/kafka/pull/15407/files
>> >.
>> >
>> > With that, I have opened a PR to move them to the kafka-site repository
>> -
>> > https://github.com/apache/kafka-site/pull/587. That is awaiting review.
>> >
>> > Similarly, the 3.7 blog post is ready for review again
>> > <https://github.com/apache/kafka-site/pull/578> and awaiting a review
>> on 37:
>> > Update default docs to point to the 3.7.0 release docs
>> > <https://github.com/apache/kafka-site/pull/582>
>> >
>> > I also have a WIP for fixing the 3.6 docs in the kafka-site repo
>> > <https://github.com/apache/kafka-site/pull/586>. This isn't really
>> related
>> > to the release, but it's good to do.
>> >
>> > On Wed, Feb 21, 2024 at 4:55 AM Luke Chen  wrote:
>> >
>> > > Hi all,
>> > >
>> > > I found there is a bug (KAFKA-16283
>> > > <https://issues.apache.org/jira/browse/KAFKA-16283>) in the built-in
>> > > `RoundRobinPartitioner`, and it will cause only half of the partitions
>> > > receiving the records. (I'm surprised our tests didn't catch it!?)
>> > > After further testing, I found this issue already existed in v3.0.0.
>> (I
>> > > didn't test 2.x versions)
>> > > I think this should not be a blocker to v3.7.0 since it's been there
>> for
>> > > years.
>> > > But I think we should, at least, document it to notify users about
>> this
>> > > known issue.
>> > > I've created 2 PRs to document it:
>> > > https://github.com/apache/kafka/pull/15400
>> > > https://github.com/apache/kafka-site/pull/585
>> > >
>> > > Let me know what you think.
>> > >
>> > > Thanks.
>> > > Luke
>> > >
>> > > On Wed, Feb 21, 2024 at 10:52 AM Proven Provenzano
>> > >  wrote:
>> > >
>> > > > HI,
>> > > >
>> > > > I've downloaded, built from source and then validated JBOD with
>> KRaft
>> > > works
>> > > > along 

Re: [VOTE] 3.7.0 RC4

2024-02-26 Thread Stanislav Kozlovski
*(double-sending this to kafka-clients & users, as I forgot initially)*

This vote passes with *10 +1 votes* (3 bindings) and no 0 or -1 votes.

+1 votes

PMC Members (binding):
* Mickael Maison
* Justine Olshan
* Divij Vaidya

Community (non-binding):
* Proven Provenzano
* Federico Valeri
* Vedarth Sharma
* Andrew Schofield
* Paolo Patierno
* Jakub Scholz
* Josep Prat



0 votes

* No votes



-1 votes

* No votes



Vote thread:
https://lists.apache.org/thread/71djwz292y2lzgwzm7n6n8o7x56zbgh9

I'll continue with the release process and the release announcement will
follow ASAP.

On Mon, Feb 26, 2024 at 10:42 AM Stanislav Kozlovski 
wrote:

>
> This vote passes with *10 +1 votes* (3 bindings) and no 0 or -1 votes.
>
> +1 votes
>
> PMC Members (binding):
> * Mickael Maison
> * Justine Olshan
> * Divij Vaidya
>
> Community (non-binding):
> * Proven Provenzano
> * Federico Valeri
> * Vedarth Sharma
> * Andrew Schofield
> * Paolo Patierno
> * Jakub Scholz
> * Josep Prat
>
> 
>
> 0 votes
>
> * No votes
>
> 
>
> -1 votes
>
> * No votes
>
> 
>
> Vote thread:
> https://lists.apache.org/thread/71djwz292y2lzgwzm7n6n8o7x56zbgh9
>
> I'll continue with the release process and the release announcement will
> follow ASAP.
>
> Best,
> Stanislav
>
> On Sun, Feb 25, 2024 at 7:08 PM Mickael Maison 
> wrote:
>
>> Hi,
>>
>> Thanks for sorting out the docs issues.
>> +1 (binding)
>>
>> Mickael
>>
>> On Fri, Feb 23, 2024 at 11:50 AM Stanislav Kozlovski
>>  wrote:
>> >
>> > Some quick updates:
>> >
>> > There were some inconsistencies between the documentation in the
>> > apache/kafka repo and the one in kafka-site. The process is such that
>> the
>> > apache/kafka docs are the source of truth, but we had a few divergences
>> in
>> > the other repo. I have worked on correcting those with:
>> > - MINOR: Reconcile upgrade.html with kafka-site/36's version
>> > <https://github.com/apache/kafka/pull/15406> and cherry-picked it into
>> the
>> > 3.6 and 3.7 branches too
>> >
>> > Additionally, the 3.7 upgrade notes have been merged in apache/kafka -
>> MINOR:
>> > Add 3.7 upgrade notes <https://github.com/apache/kafka/pull/15407/files
>> >.
>> >
>> > With that, I have opened a PR to move them to the kafka-site repository
>> -
>> > https://github.com/apache/kafka-site/pull/587. That is awaiting review.
>> >
>> > Similarly, the 3.7 blog post is ready for review again
>> > <https://github.com/apache/kafka-site/pull/578> and awaiting a review
>> on 37:
>> > Update default docs to point to the 3.7.0 release docs
>> > <https://github.com/apache/kafka-site/pull/582>
>> >
>> > I also have a WIP for fixing the 3.6 docs in the kafka-site repo
>> > <https://github.com/apache/kafka-site/pull/586>. This isn't really
>> related
>> > to the release, but it's good to do.
>> >
>> > On Wed, Feb 21, 2024 at 4:55 AM Luke Chen  wrote:
>> >
>> > > Hi all,
>> > >
>> > > I found there is a bug (KAFKA-16283
>> > > <https://issues.apache.org/jira/browse/KAFKA-16283>) in the built-in
>> > > `RoundRobinPartitioner`, and it will cause only half of the partitions
>> > > receiving the records. (I'm surprised our tests didn't catch it!?)
>> > > After further testing, I found this issue already existed in v3.0.0.
>> (I
>> > > didn't test 2.x versions)
>> > > I think this should not be a blocker to v3.7.0 since it's been there
>> for
>> > > years.
>> > > But I think we should, at least, document it to notify users about
>> this
>> > > known issue.
>> > > I've created 2 PRs to document it:
>> > > https://github.com/apache/kafka/pull/15400
>> > > https://github.com/apache/kafka-site/pull/585
>> > >
>> > > Let me know what you think.
>> > >
>> > > Thanks.
>> > > Luke
>> > >
>> > > On Wed, Feb 21, 2024 at 10:52 AM Proven Provenzano
>> > >  wrote:
>> > >
>> > > > HI,
>> > > >
>> > > > I've downloaded, built from source and then validated JBOD with
>> KRaft
>> > > works
>> > > > along with migrating a cluster with JBOD from ZK to KRaft works.
>> > > >
>> > > > +1 (nonbinding) from me.
>> > >

Re: [VOTE] 3.7.0 RC4

2024-02-26 Thread Stanislav Kozlovski
This vote passes with *10 +1 votes* (3 bindings) and no 0 or -1 votes.

+1 votes

PMC Members (binding):
* Mickael Maison
* Justine Olshan
* Divij Vaidya

Community (non-binding):
* Proven Provenzano
* Federico Valeri
* Vedarth Sharma
* Andrew Schofield
* Paolo Patierno
* Jakub Scholz
* Josep Prat



0 votes

* No votes



-1 votes

* No votes



Vote thread:
https://lists.apache.org/thread/71djwz292y2lzgwzm7n6n8o7x56zbgh9

I'll continue with the release process and the release announcement will
follow ASAP.

Best,
Stanislav

On Sun, Feb 25, 2024 at 7:08 PM Mickael Maison 
wrote:

> Hi,
>
> Thanks for sorting out the docs issues.
> +1 (binding)
>
> Mickael
>
> On Fri, Feb 23, 2024 at 11:50 AM Stanislav Kozlovski
>  wrote:
> >
> > Some quick updates:
> >
> > There were some inconsistencies between the documentation in the
> > apache/kafka repo and the one in kafka-site. The process is such that the
> > apache/kafka docs are the source of truth, but we had a few divergences
> in
> > the other repo. I have worked on correcting those with:
> > - MINOR: Reconcile upgrade.html with kafka-site/36's version
> > <https://github.com/apache/kafka/pull/15406> and cherry-picked it into
> the
> > 3.6 and 3.7 branches too
> >
> > Additionally, the 3.7 upgrade notes have been merged in apache/kafka -
> MINOR:
> > Add 3.7 upgrade notes <https://github.com/apache/kafka/pull/15407/files
> >.
> >
> > With that, I have opened a PR to move them to the kafka-site repository -
> > https://github.com/apache/kafka-site/pull/587. That is awaiting review.
> >
> > Similarly, the 3.7 blog post is ready for review again
> > <https://github.com/apache/kafka-site/pull/578> and awaiting a review
> on 37:
> > Update default docs to point to the 3.7.0 release docs
> > <https://github.com/apache/kafka-site/pull/582>
> >
> > I also have a WIP for fixing the 3.6 docs in the kafka-site repo
> > <https://github.com/apache/kafka-site/pull/586>. This isn't really
> related
> > to the release, but it's good to do.
> >
> > On Wed, Feb 21, 2024 at 4:55 AM Luke Chen  wrote:
> >
> > > Hi all,
> > >
> > > I found there is a bug (KAFKA-16283
> > > <https://issues.apache.org/jira/browse/KAFKA-16283>) in the built-in
> > > `RoundRobinPartitioner`, and it will cause only half of the partitions
> > > receiving the records. (I'm surprised our tests didn't catch it!?)
> > > After further testing, I found this issue already existed in v3.0.0. (I
> > > didn't test 2.x versions)
> > > I think this should not be a blocker to v3.7.0 since it's been there
> for
> > > years.
> > > But I think we should, at least, document it to notify users about this
> > > known issue.
> > > I've created 2 PRs to document it:
> > > https://github.com/apache/kafka/pull/15400
> > > https://github.com/apache/kafka-site/pull/585
> > >
> > > Let me know what you think.
> > >
> > > Thanks.
> > > Luke
> > >
> > > On Wed, Feb 21, 2024 at 10:52 AM Proven Provenzano
> > >  wrote:
> > >
> > > > HI,
> > > >
> > > > I've downloaded, built from source and then validated JBOD with KRaft
> > > works
> > > > along with migrating a cluster with JBOD from ZK to KRaft works.
> > > >
> > > > +1 (nonbinding) from me.
> > > >
> > > > --Proven
> > > >
> > > > On Tue, Feb 20, 2024 at 2:13 PM Justine Olshan
> > > > 
> > > > wrote:
> > > >
> > > > > Hey folks,
> > > > >
> > > > > I've done the following to validate the release:
> > > > >
> > > > > -- validated the keys for all the artifacts
> > > > > -- built from source and started a ZK cluster -- ran a few
> workloads on
> > > > it.
> > > > > -- ran 2.12 Kraft cluster and ran a few workloads on it
> > > > >
> > > > > I see there is a lot of ongoing discussion about the upgrade
> notes. +1
> > > > > (binding) from me given Mickael is voting +1 as well.
> > > > >
> > > > > Justine
> > > > >
> > > > > On Tue, Feb 20, 2024 at 6:18 AM Divij Vaidya <
> divijvaidy...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > > I am a bit unclear on the precise process regarding what parts
> of
> > > > this
> > > > > > get m

Re: [VOTE] 3.7.0 RC4

2024-02-23 Thread Stanislav Kozlovski
Some quick updates:

There were some inconsistencies between the documentation in the
apache/kafka repo and the one in kafka-site. The process is such that the
apache/kafka docs are the source of truth, but we had a few divergences in
the other repo. I have worked on correcting those with:
- MINOR: Reconcile upgrade.html with kafka-site/36's version
<https://github.com/apache/kafka/pull/15406> and cherry-picked it into the
3.6 and 3.7 branches too

Additionally, the 3.7 upgrade notes have been merged in apache/kafka - MINOR:
Add 3.7 upgrade notes <https://github.com/apache/kafka/pull/15407/files>.

With that, I have opened a PR to move them to the kafka-site repository -
https://github.com/apache/kafka-site/pull/587. That is awaiting review.

Similarly, the 3.7 blog post is ready for review again
<https://github.com/apache/kafka-site/pull/578> and awaiting a review on 37:
Update default docs to point to the 3.7.0 release docs
<https://github.com/apache/kafka-site/pull/582>

I also have a WIP for fixing the 3.6 docs in the kafka-site repo
<https://github.com/apache/kafka-site/pull/586>. This isn't really related
to the release, but it's good to do.

On Wed, Feb 21, 2024 at 4:55 AM Luke Chen  wrote:

> Hi all,
>
> I found there is a bug (KAFKA-16283
> <https://issues.apache.org/jira/browse/KAFKA-16283>) in the built-in
> `RoundRobinPartitioner`, and it will cause only half of the partitions
> receiving the records. (I'm surprised our tests didn't catch it!?)
> After further testing, I found this issue already existed in v3.0.0. (I
> didn't test 2.x versions)
> I think this should not be a blocker to v3.7.0 since it's been there for
> years.
> But I think we should, at least, document it to notify users about this
> known issue.
> I've created 2 PRs to document it:
> https://github.com/apache/kafka/pull/15400
> https://github.com/apache/kafka-site/pull/585
>
> Let me know what you think.
>
> Thanks.
> Luke
>
> On Wed, Feb 21, 2024 at 10:52 AM Proven Provenzano
>  wrote:
>
> > HI,
> >
> > I've downloaded, built from source and then validated JBOD with KRaft
> works
> > along with migrating a cluster with JBOD from ZK to KRaft works.
> >
> > +1 (nonbinding) from me.
> >
> > --Proven
> >
> > On Tue, Feb 20, 2024 at 2:13 PM Justine Olshan
> > 
> > wrote:
> >
> > > Hey folks,
> > >
> > > I've done the following to validate the release:
> > >
> > > -- validated the keys for all the artifacts
> > > -- built from source and started a ZK cluster -- ran a few workloads on
> > it.
> > > -- ran 2.12 Kraft cluster and ran a few workloads on it
> > >
> > > I see there is a lot of ongoing discussion about the upgrade notes. +1
> > > (binding) from me given Mickael is voting +1 as well.
> > >
> > > Justine
> > >
> > > On Tue, Feb 20, 2024 at 6:18 AM Divij Vaidya 
> > > wrote:
> > >
> > > > > I am a bit unclear on the precise process regarding what parts of
> > this
> > > > get merged at what time, and whether the release first needs to be
> done
> > > or
> > > > not.
> > > >
> > > > The order is as follows:
> > > >
> > > > 1. Release approved as part of this vote. After this we follow the
> > > > steps from here:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Release+Process#ReleaseProcess-Afterthevotepasses
> > > >
> > > > 2. Upload artifacts to maven etc. These artifacts do not have RC
> suffix
> > > in
> > > > them. You need a PMC member to mark these artifacts as "production"
> in
> > > > apache svn.
> > > > 3. Update website changes (docs, blog etc.). This is where your PRs
> > > > on kafka-site repo get merged.
> > > > 4. Send a release announcement by email.
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Tue, Feb 20, 2024 at 3:02 PM Stanislav Kozlovski
> > > >  wrote:
> > > >
> > > > > Thanks for testing the release! And thanks for the review on the
> > > > > documentation. Good catch on the license too.
> > > > >
> > > > > I have addressed the comments in the blog PR, and opened a few
> other
> > > PRs
> > > > to
> > > > > the website in relation to the release.
> > > > >
> > > > > - 37: Add download section for the latest 3.7

Re: [VOTE] 3.7.0 RC4

2024-02-20 Thread Stanislav Kozlovski
Thanks for testing the release! And thanks for the review on the
documentation. Good catch on the license too.

I have addressed the comments in the blog PR, and opened a few other PRs to
the website in relation to the release.

- 37: Add download section for the latest 3.7 release
<https://github.com/apache/kafka-site/pull/583/files>
- 37: Update default docs to point to the 3.7.0 release docs
<https://github.com/apache/kafka-site/pull/582>
- 3.7: Add blog post for Kafka 3.7
<https://github.com/apache/kafka-site/pull/578>
- MINOR: Update stale upgrade_3_6_0 header links in documentation
<https://github.com/apache/kafka-site/pull/580>
- 37: Add upgrade notes for the 3.7.0 release
<https://github.com/apache/kafka-site/pull/581>

I am a bit unclear on the precise process regarding what parts of this get
merged at what time, and whether the release first needs to be done or not.

Best,
Stanislav

On Mon, Feb 19, 2024 at 8:34 PM Divij Vaidya 
wrote:

> Great. In that case we can fix the license issue retrospectively. I have
> created a JIRA for it https://issues.apache.org/jira/browse/KAFKA-16278
> and
> also updated the release process (which redirects to
> https://issues.apache.org/jira/browse/KAFKA-12622) to check for the
> correct
> license in both the kafka binaries.
>
> I am +1 (binding) assuming Mickael's concerns about update notes to 3.7 are
> addressed before release.
>
> --
> Divij Vaidya
>
>
>
> On Mon, Feb 19, 2024 at 6:08 PM Mickael Maison 
> wrote:
>
> > Hi,
> >
> > I agree with Josep, I don't think it's worth making a new RC just for
> this.
> >
> > Thanks Stanislav for sharing the test results. The last thing holding
> > me from casting my vote is the missing upgrade notes for 3.7.0.
> >
> > Thanks,
> > Mickael
> >
> >
> >
> > On Mon, Feb 19, 2024 at 4:28 PM Josep Prat 
> > wrote:
> > >
> > > I think I remember finding a similar problem (NOTICE_binary) and it
> > didn't
> > > qualify for an extra RC
> > >
> > > Best,
> > >
> > > On Mon, Feb 19, 2024 at 3:44 PM Divij Vaidya 
> > > wrote:
> > >
> > > > I have performed the following checks. The only thing I would like to
> > call
> > > > out is the missing licenses before providing a vote. How do we want
> > > > to proceed on this? What have we done in the past? (Creating a new RC
> > is
> > > > overkill IMO for this license issue).
> > > >
> > > > ## License check
> > > >
> > > > Test: Validate license of dependencies for both 2.12 & 2.13 binary.
> > > > Result: Missing license for some scala* libraries specifically for
> > 2.12.
> > > > Seems like we have been missing these licenses for quite some version
> > now.
> > > >
> > > > ```
> > > > for f in $(ls libs | grep -v "^kafka\|connect\|trogdor"); do if !
> grep
> > -q
> > > > ${f%.*} LICENSE; then echo "${f%.*} is missing in license file"; fi;
> > done
> > > > scala-collection-compat_2.12-2.10.0 is missing in license file
> > > > scala-java8-compat_2.12-1.0.2 is missing in license file
> > > > scala-library-2.12.18 is missing in license file
> > > > scala-logging_2.12-3.9.4 is missing in license file
> > > > scala-reflect-2.12.18 is missing in license file
> > > > ```
> > > >
> > > > ## Long running tests for memory leak (on ARM machine with zstd)
> > > >
> > > > Test: Run produce/consume for a few hours and verify no gradual
> > increase in
> > > > heap.
> > > > Result: No heap increase observed. The overall CPU utilization is
> lower
> > > > compared to 3.5.1.
> > > >
> > > > ## Verify system test results
> > > >
> > > > Test: Spot check the results of system tests.
> > > > Result: I have verified that the system tests are passing across
> > different
> > > > runs.
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Sun, Feb 18, 2024 at 2:50 PM Stanislav Kozlovski
> > > >  wrote:
> > > >
> > > > > The latest system test build completed successfully -
> > > > >
> > > > >
> > > >
> >
> https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1708250728--apache--3.7--02197edaaa/2024-02-18--001./2024

Re: [VOTE] 3.7.0 RC4

2024-02-18 Thread Stanislav Kozlovski
The latest system test build completed successfully -
https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1708250728--apache--3.7--02197edaaa/2024-02-18--001./2024-02-18--001./report.html

*System tests are therefore all good*. We just have some flakes

On Sun, Feb 18, 2024 at 10:45 AM Stanislav Kozlovski 
wrote:

> The upgrade test passed ->
> https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1708103771--apache--3.7--bb6990114b/2024-02-16--001./2024-02-16--001./report.html
>
> The replica verification test succeeded in ZK mode, but failed in
> ISOLATED_KRAFT. It just seems to be very flaky.
> https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1708100119--apache--3.7--bb6990114b/2024-02-16--001./2024-02-16--001./report.html
>
> Scheduling another run in
> https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6062/
>
> On Fri, Feb 16, 2024 at 6:39 PM Stanislav Kozlovski <
> stanis...@confluent.io> wrote:
>
>> Thanks all for the help in verifying.
>>
>> I have updated
>> https://gist.github.com/stanislavkozlovski/820976fc7bfb5f4dcdf9742fd96a9982
>> with the system tests.
>> There were two builds ran, and across those - the following tests failed
>> two times in a row:
>>
>>
>> *kafkatest.tests.tools.replica_verification_test.ReplicaVerificationToolTest#test_replica_lagsArguments:{
>> "metadata_quorum": "ZK"}*Fails with the same error of
>> *`TimeoutError('Timed out waiting to reach non-zero number of replica
>> lags.')`*
>> I have scheduled a re-run of this specific test here ->
>> https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6057
>>
>> *kafkatest.tests.core.upgrade_test.TestUpgrade#test_upgradeArguments:{
>> "compression_types": [ "zstd" ], "from_kafka_version": "2.4.1",
>> "to_message_format_version": null}*
>> Fails with the same error of
>> *`TimeoutError('Producer failed to produce messages for 20s.')`*
>> *kafkatest.tests.core.upgrade_test.TestUpgrade#test_upgradeArguments:{
>> "compression_types": [ "lz4" ], "from_kafka_version": "3.0.2",
>> "to_message_format_version": null}*
>> Fails with the same error of *`TimeoutError('Producer failed to produce
>> messages for 20s.')`*
>>
>> I have scheduled a re-run of this test here ->
>> https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6058/
>>
>> On Fri, Feb 16, 2024 at 12:15 PM Vedarth Sharma 
>> wrote:
>>
>>> Hey Stanislav,
>>>
>>> Thanks for the release candidate.
>>>
>>> +1 (non-binding)
>>>
>>> I tested and verified the docker image artifact apache/kafka:3.7.0-rc4:-
>>> - verified create topic, produce messages and consume messages flow when
>>> running the docker image with
>>> - default configs
>>> - configs provided via env variables
>>> - configs provided via file input
>>> - verified the html documentation for docker image.
>>> - ran the example docker compose files successfully.
>>>
>>> All looks good for the docker image artifact!
>>>
>>> Thanks and regards,
>>> Vedarth
>>>
>>>
>>> On Thu, Feb 15, 2024 at 10:58 PM Mickael Maison <
>>> mickael.mai...@gmail.com>
>>> wrote:
>>>
>>> > Hi Stanislav,
>>> >
>>> > Thanks for running the release.
>>> >
>>> > I did the following testing:
>>> > - verified the check sums and signatures
>>> > - ran ZooKeeper and KRaft quickstarts with Scala 2.13 binaries
>>> > - ran a successful migration from ZooKeeper to KRaft
>>> >
>>> > We seem to be missing the upgrade notes for 3.7.0 in the docs. See
>>> > https://kafka.apache.org/37/documentation.html#upgrade that still
>>> > points to 3.6.0
>>> > Before voting I'd like to see results from the system tests too.
>>> >
>>> > Thanks,
>>> > Mickael
>>> >
>>> > On Thu, Feb 15, 2024 at 6:06 PM Andrew Schofield
>>> >  wrote:
>>> > >
>>> > > +1 (non-binding). I used the staged binaries with Scala 2.13. I tried
>>> > the new group coordinator
>>> > > and consumer group protocol which is included with the Early Access
>>> >

Re: [VOTE] 3.7.0 RC4

2024-02-18 Thread Stanislav Kozlovski
The upgrade test passed ->
https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1708103771--apache--3.7--bb6990114b/2024-02-16--001./2024-02-16--001./report.html

The replica verification test succeeded in ZK mode, but failed in
ISOLATED_KRAFT. It just seems to be very flaky.
https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1708100119--apache--3.7--bb6990114b/2024-02-16--001./2024-02-16--001./report.html

Scheduling another run in
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6062/

On Fri, Feb 16, 2024 at 6:39 PM Stanislav Kozlovski 
wrote:

> Thanks all for the help in verifying.
>
> I have updated
> https://gist.github.com/stanislavkozlovski/820976fc7bfb5f4dcdf9742fd96a9982
> with the system tests.
> There were two builds ran, and across those - the following tests failed
> two times in a row:
>
>
> *kafkatest.tests.tools.replica_verification_test.ReplicaVerificationToolTest#test_replica_lagsArguments:{
> "metadata_quorum": "ZK"}*Fails with the same error of
> *`TimeoutError('Timed out waiting to reach non-zero number of replica
> lags.')`*
> I have scheduled a re-run of this specific test here ->
> https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6057
>
> *kafkatest.tests.core.upgrade_test.TestUpgrade#test_upgradeArguments:{
> "compression_types": [ "zstd" ], "from_kafka_version": "2.4.1",
> "to_message_format_version": null}*
> Fails with the same error of
> *`TimeoutError('Producer failed to produce messages for 20s.')`*
> *kafkatest.tests.core.upgrade_test.TestUpgrade#test_upgradeArguments:{
> "compression_types": [ "lz4" ], "from_kafka_version": "3.0.2",
> "to_message_format_version": null}*
> Fails with the same error of *`TimeoutError('Producer failed to produce
> messages for 20s.')`*
>
> I have scheduled a re-run of this test here ->
> https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6058/
>
> On Fri, Feb 16, 2024 at 12:15 PM Vedarth Sharma 
> wrote:
>
>> Hey Stanislav,
>>
>> Thanks for the release candidate.
>>
>> +1 (non-binding)
>>
>> I tested and verified the docker image artifact apache/kafka:3.7.0-rc4:-
>> - verified create topic, produce messages and consume messages flow when
>> running the docker image with
>> - default configs
>> - configs provided via env variables
>> - configs provided via file input
>> - verified the html documentation for docker image.
>> - ran the example docker compose files successfully.
>>
>> All looks good for the docker image artifact!
>>
>> Thanks and regards,
>> Vedarth
>>
>>
>> On Thu, Feb 15, 2024 at 10:58 PM Mickael Maison > >
>> wrote:
>>
>> > Hi Stanislav,
>> >
>> > Thanks for running the release.
>> >
>> > I did the following testing:
>> > - verified the check sums and signatures
>> > - ran ZooKeeper and KRaft quickstarts with Scala 2.13 binaries
>> > - ran a successful migration from ZooKeeper to KRaft
>> >
>> > We seem to be missing the upgrade notes for 3.7.0 in the docs. See
>> > https://kafka.apache.org/37/documentation.html#upgrade that still
>> > points to 3.6.0
>> > Before voting I'd like to see results from the system tests too.
>> >
>> > Thanks,
>> > Mickael
>> >
>> > On Thu, Feb 15, 2024 at 6:06 PM Andrew Schofield
>> >  wrote:
>> > >
>> > > +1 (non-binding). I used the staged binaries with Scala 2.13. I tried
>> > the new group coordinator
>> > > and consumer group protocol which is included with the Early Access
>> > release of KIP-848.
>> > > Also verified the availability of the new APIs. All working as
>> expected.
>> > >
>> > > Thanks,
>> > > Andrew
>> > >
>> > > > On 15 Feb 2024, at 05:07, Paolo Patierno 
>> > wrote:
>> > > >
>> > > > +1 (non-binding). I used the staged binaries with Scala 2.13 and
>> mostly
>> > > > focused on the ZooKeeper to KRaft migration with multiple tests.
>> > Everything
>> > > > works fine.
>> > > >
>> > > > Thanks
>> > > > Paolo
>> > > >
>> > > > On Mon, 12 Feb 2024, 22:06 Jakub Scholz,  wrote:
>> > > >
>> > > >> +1 (non-binding).

Re: [VOTE] 3.7.0 RC4

2024-02-16 Thread Stanislav Kozlovski
Thanks all for the help in verifying.

I have updated
https://gist.github.com/stanislavkozlovski/820976fc7bfb5f4dcdf9742fd96a9982
with the system tests.
There were two builds ran, and across those - the following tests failed
two times in a row:


*kafkatest.tests.tools.replica_verification_test.ReplicaVerificationToolTest#test_replica_lagsArguments:{
"metadata_quorum": "ZK"}*Fails with the same error of
*`TimeoutError('Timed out waiting to reach non-zero number of replica
lags.')`*
I have scheduled a re-run of this specific test here ->
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6057

*kafkatest.tests.core.upgrade_test.TestUpgrade#test_upgradeArguments:{
"compression_types": [ "zstd" ], "from_kafka_version": "2.4.1",
"to_message_format_version": null}*
Fails with the same error of
*`TimeoutError('Producer failed to produce messages for 20s.')`*
*kafkatest.tests.core.upgrade_test.TestUpgrade#test_upgradeArguments:{
"compression_types": [ "lz4" ], "from_kafka_version": "3.0.2",
"to_message_format_version": null}*
Fails with the same error of *`TimeoutError('Producer failed to produce
messages for 20s.')`*

I have scheduled a re-run of this test here ->
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/6058/

On Fri, Feb 16, 2024 at 12:15 PM Vedarth Sharma 
wrote:

> Hey Stanislav,
>
> Thanks for the release candidate.
>
> +1 (non-binding)
>
> I tested and verified the docker image artifact apache/kafka:3.7.0-rc4:-
> - verified create topic, produce messages and consume messages flow when
> running the docker image with
> - default configs
> - configs provided via env variables
> - configs provided via file input
> - verified the html documentation for docker image.
> - ran the example docker compose files successfully.
>
> All looks good for the docker image artifact!
>
> Thanks and regards,
> Vedarth
>
>
> On Thu, Feb 15, 2024 at 10:58 PM Mickael Maison 
> wrote:
>
> > Hi Stanislav,
> >
> > Thanks for running the release.
> >
> > I did the following testing:
> > - verified the check sums and signatures
> > - ran ZooKeeper and KRaft quickstarts with Scala 2.13 binaries
> > - ran a successful migration from ZooKeeper to KRaft
> >
> > We seem to be missing the upgrade notes for 3.7.0 in the docs. See
> > https://kafka.apache.org/37/documentation.html#upgrade that still
> > points to 3.6.0
> > Before voting I'd like to see results from the system tests too.
> >
> > Thanks,
> > Mickael
> >
> > On Thu, Feb 15, 2024 at 6:06 PM Andrew Schofield
> >  wrote:
> > >
> > > +1 (non-binding). I used the staged binaries with Scala 2.13. I tried
> > the new group coordinator
> > > and consumer group protocol which is included with the Early Access
> > release of KIP-848.
> > > Also verified the availability of the new APIs. All working as
> expected.
> > >
> > > Thanks,
> > > Andrew
> > >
> > > > On 15 Feb 2024, at 05:07, Paolo Patierno 
> > wrote:
> > > >
> > > > +1 (non-binding). I used the staged binaries with Scala 2.13 and
> mostly
> > > > focused on the ZooKeeper to KRaft migration with multiple tests.
> > Everything
> > > > works fine.
> > > >
> > > > Thanks
> > > > Paolo
> > > >
> > > > On Mon, 12 Feb 2024, 22:06 Jakub Scholz,  wrote:
> > > >
> > > >> +1 (non-binding). I used the staged binaries with Scala 2.13 and the
> > staged
> > > >> Maven artifacts to run my tests. All seems to work fine. Thanks.
> > > >>
> > > >> Jakub
> > > >>
> > > >> On Fri, Feb 9, 2024 at 4:20 PM Stanislav Kozlovski
> > > >>  wrote:
> > > >>
> > > >>> Hello Kafka users, developers and client-developers,
> > > >>>
> > > >>> This is the second candidate we are considering for release of
> Apache
> > > >> Kafka
> > > >>> 3.7.0.
> > > >>>
> > > >>> Major changes include:
> > > >>> - Early Access to KIP-848 - the next generation of the consumer
> > rebalance
> > > >>> protocol
> > > >>> - Early Access to KIP-858: Adding JBOD support to KRaft
> > > >>> - KIP-714: Observability into Client metrics via a standardized
> > interface
> > > >>>
> > > >>> Release notes for the 3.7.0 release:
> 

Re: [VOTE] 3.7.0 RC4

2024-02-14 Thread Stanislav Kozlovski
Hey Justine,

Good question. I have been running the system tests and I have been
preparing a doc to share. I got one run completed and was waiting for a
second one to cross-check, but unfortunately I had a build take 35+ hours
and canceled it. I'm currently 6h into the second one, expecting it to
complete in 11h or so.

There were quite a bit, so I was looking forward to get another run in to
cross-check the flakes:
-
https://gist.githubusercontent.com/stanislavkozlovski/3a97fc7602f3fee40cb374f1343d46b6/raw/b9de6e0eb975e8234d43bc725982096e47fd0457/rc4_test_failures_system_test_run_1.md

I also checked the integration tests and saw
- kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll
- kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll
failed three timed in a row.

Here is the gist with the results:
https://gist.github.com/stanislavkozlovski/820976fc7bfb5f4dcdf9742fd96a9982



On Wed, Feb 14, 2024 at 6:31 PM Justine Olshan 
wrote:

> Hey Stan,
> Did we ever get system tests results? I can also start a run
>
> On Mon, Feb 12, 2024 at 1:06 PM Jakub Scholz  wrote:
>
> > +1 (non-binding). I used the staged binaries with Scala 2.13 and the
> staged
> > Maven artifacts to run my tests. All seems to work fine. Thanks.
> >
> > Jakub
> >
> > On Fri, Feb 9, 2024 at 4:20 PM Stanislav Kozlovski
> >  wrote:
> >
> > > Hello Kafka users, developers and client-developers,
> > >
> > > This is the second candidate we are considering for release of Apache
> > Kafka
> > > 3.7.0.
> > >
> > > Major changes include:
> > > - Early Access to KIP-848 - the next generation of the consumer
> rebalance
> > > protocol
> > > - Early Access to KIP-858: Adding JBOD support to KRaft
> > > - KIP-714: Observability into Client metrics via a standardized
> interface
> > >
> > > Release notes for the 3.7.0 release:
> > >
> > >
> >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/RELEASE_NOTES.html
> > >
> > > *** Please download, test and vote by Thursday, February 15th, 9AM PST
> > ***
> > >
> > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > https://kafka.apache.org/KEYS
> > >
> > > * Release artifacts to be voted upon (source and binary):
> > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/
> > >
> > > * Docker release artifact to be voted upon:
> > > apache/kafka:3.7.0-rc4
> > >
> > > * Maven artifacts to be voted upon:
> > > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > >
> > > * Javadoc:
> > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/javadoc/
> > >
> > > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> > > https://github.com/apache/kafka/releases/tag/3.7.0-rc4
> > >
> > > * Documentation:
> > > https://kafka.apache.org/37/documentation.html
> > >
> > > * Protocol:
> > > https://kafka.apache.org/37/protocol.html
> > >
> > > * Successful Jenkins builds for the 3.7 branch:
> > >
> > > Unit/integration tests: I am in the process of running and analyzing
> > these.
> > > System tests: I am in the process of running these.
> > >
> > > Expect a follow-up over the weekend
> > >
> > > * Successful Docker Image Github Actions Pipeline for 3.7 branch:
> > > Docker Build Test Pipeline:
> > > https://github.com/apache/kafka/actions/runs/7845614846
> > >
> > > /**
> > >
> > > Best,
> > > Stanislav
> > >
> >
>


-- 
Best,
Stanislav


Re: Apache Kafka 3.7.0 Release

2024-02-12 Thread Stanislav Kozlovski
Hey Divij, that is a good point regarding KIP-848.

David, as the author of the KIP, would you be able to drive this?

Similarly, would anybody be willing to drive such an EA Release Note for
the JBOD feature?

Mayank, this *doesn't* seem like a blocker to me given the complexity,
public API changes and the fact that it's for high partition counts (a
definition here would help though)

Reminder - RC4 is out for vote right now.

Best,
Stanislav

On Tue, Feb 6, 2024 at 5:25 PM Mayank Shekhar Narula <
mayanks.nar...@gmail.com> wrote:

> Hi Folks
>
> KIP-951 was delivered fully in AK 3.7. Its 1st optimisation was delivered
> in 3.6.1, to skip backoff period for a produce batch being retried to new
> leader i.e. KAFKA-15415.
>
> KAFKA-15415 current implementation introduced a performance regression, by
> increasing synchronization on the produce path, especially for high
> partition counts. The description section of
> https://issues.apache.org/jira/browse/KAFKA-16226 goes more into details
> of
> the regression.
>
> I have put up a fix https://github.com/apache/kafka/pull/15323, which
> removes this synchronization. The fix adds a new public method to
> Cluster.java, and a public constructor to PartitionInfo.java.
>
> Is this a blocker for v3.7.0?
>
> PS - Posted in KIP-951's voting thread as well
> <https://lists.apache.org/thread/otxt5wr7cj4qx4v3zg05gclry0vrdvh8>.
>
>
> On Fri, Feb 2, 2024 at 3:58 PM Divij Vaidya 
> wrote:
>
> > Hey folks
> >
> > The release plan for 3.7.0 [1] calls out KIP 848 as "Targeting a Preview
> in
> > 3.7".
> >
> > Is that still true? If yes, then we should perhaps add that in the blog,
> > call it out in the release notes and prepare a preview document similar
> to
> > what we did for Tiered Storage Early Access release[2]
> >
> > If not true, then we should update the release notes to reflect the
> current
> > state of the KIP.
> >
> > (I think the same is true for other KIPs like KIP-963)
> >
> > [1] https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Tiered+Storage+Early+Access+Release+Notes
> >
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Thu, Jan 11, 2024 at 1:03 PM Luke Chen  wrote:
> >
> > > Hi all,
> > >
> > > There is a bug KAFKA-16101
> > > <https://issues.apache.org/jira/browse/KAFKA-16101> reporting that
> > "Kafka
> > > cluster will be unavailable during KRaft migration rollback".
> > > The impact for this issue is that if brokers try to rollback to ZK mode
> > > during KRaft migration process, there will be a period of time the
> > cluster
> > > is unavailable.
> > > Since ZK migrating to KRaft feature is a production ready feature, I
> > think
> > > this should be addressed soon.
> > > Do you think this is a blocker for v3.7.0?
> > >
> > > Thanks.
> > > Luke
> > >
> > > On Thu, Jan 11, 2024 at 6:11 AM Stanislav Kozlovski
> > >  wrote:
> > >
> > > > Thanks Colin,
> > > >
> > > > With that, I believe we are out of blockers. I was traveling today
> and
> > > > couldn't build an RC - expect one to be published tomorrow (barring
> any
> > > > problems).
> > > >
> > > > In the meanwhile - here is a PR for the 3.7 blog post -
> > > > https://github.com/apache/kafka-site/pull/578
> > > >
> > > > Best,
> > > > Stan
> > > >
> > > > On Wed, Jan 10, 2024 at 12:06 AM Colin McCabe 
> > > wrote:
> > > >
> > > > > KAFKA-16094 has been fixed and backported to 3.7.
> > > > >
> > > > > Colin
> > > > >
> > > > >
> > > > > On Mon, Jan 8, 2024, at 14:52, Colin McCabe wrote:
> > > > > > On an unrelated note, I found a blocker bug related to upgrades
> > from
> > > > > > 3.6 (and earlier) to 3.7.
> > > > > >
> > > > > > The JIRA is here:
> > > > > >   https://issues.apache.org/jira/browse/KAFKA-16094
> > > > > >
> > > > > > Fix here:
> > > > > >   https://github.com/apache/kafka/pull/15153
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 8, 2024, at 14:47, Colin McC

[VOTE] 3.7.0 RC4

2024-02-09 Thread Stanislav Kozlovski
Hello Kafka users, developers and client-developers,

This is the second candidate we are considering for release of Apache Kafka
3.7.0.

Major changes include:
- Early Access to KIP-848 - the next generation of the consumer rebalance
protocol
- Early Access to KIP-858: Adding JBOD support to KRaft
- KIP-714: Observability into Client metrics via a standardized interface

Release notes for the 3.7.0 release:
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/RELEASE_NOTES.html

*** Please download, test and vote by Thursday, February 15th, 9AM PST ***

Kafka's KEYS file containing PGP keys we use to sign the release:
https://kafka.apache.org/KEYS

* Release artifacts to be voted upon (source and binary):
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/

* Docker release artifact to be voted upon:
apache/kafka:3.7.0-rc4

* Maven artifacts to be voted upon:
https://repository.apache.org/content/groups/staging/org/apache/kafka/

* Javadoc:
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc4/javadoc/

* Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
https://github.com/apache/kafka/releases/tag/3.7.0-rc4

* Documentation:
https://kafka.apache.org/37/documentation.html

* Protocol:
https://kafka.apache.org/37/protocol.html

* Successful Jenkins builds for the 3.7 branch:

Unit/integration tests: I am in the process of running and analyzing these.
System tests: I am in the process of running these.

Expect a follow-up over the weekend

* Successful Docker Image Github Actions Pipeline for 3.7 branch:
Docker Build Test Pipeline:
https://github.com/apache/kafka/actions/runs/7845614846

/**

Best,
Stanislav


Re: [VOTE] 3.7.0 RC2

2024-02-05 Thread Stanislav Kozlovski
Thanks Mickael, sounds good.

KAFKA-160195 and KAFKA-16157 were both merged!

I was made aware of one final blocker, this time for streams - KAFKA-16221.
Matthias was prompt with a short hotfix PR:
https://github.com/apache/kafka/pull/15315

After that goes into 3.7, I think I will be free to build the next RC.
Great work!

On Fri, Feb 2, 2024 at 6:43 PM Mickael Maison 
wrote:

> Hi Stanislav,
>
> I merged https://github.com/apache/kafka/pull/15308 in trunk. I let
> you cherry-pick it to 3.7.
>
> I think fixing the absolute show stoppers and calling JBOD support in
> KRaft early access in 3.7.0 is probably the right call. Even without
> the bugs we found, there's still quite a few JBOD follow up work to do
> (KAFKA-16061) + system tests and documentation updates.
>
> Thanks,
> Mickael
>
> On Fri, Feb 2, 2024 at 4:49 PM Stanislav Kozlovski
>  wrote:
> >
> > Thanks for the work everybody. Providing a status update at the end of
> the
> > week:
> >
> > - docs change explaining migration
> > <https://github.com/apache/kafka/pull/15193> was merged
> > - the blocker KAFKA-16162 <https://github.com/apache/kafka/pull/15270>
> was
> > merged
> > - the blocker KAFKA-14616 <https://github.com/apache/kafka/pull/15230>
> was
> > merged
> > - a small blocker problem with the shadow jar plugin
> > <https://github.com/apache/kafka/pull/15308>
> > - the blockers KAFKALESS-16157 & KAFKALESS-16195 aren't merged
> > - the good-to-have KAFKA-16082 isn't merged
> >
> > I think we should prioritize merging KAFKALESS-16195 and *call JBOD EA*.
> I
> > question whether we may find more blocker bugs in the next RC.
> > The release is late by approximately a month so far, so I do want to
> scope
> > down aggressively to meet the time-based goal.
> >
> > Best,
> > Stanislav
> >
> > On Mon, Jan 29, 2024 at 5:46 PM Omnia Ibrahim 
> > wrote:
> >
> > > Hi Stan and Gaurav,
> > > Just to clarify some points mentioned here before
> > >  KAFKA-14616: I raised a year ago so it's not related to JBOD work. It
> is
> > > rather a blocker bug for KRAFT in general. The PR from Colin should fix
> > > this. Am not sure if it is a blocker for 3.7 per-say as it was a major
> bug
> > > since 3.3 and got missed from all other releases.
> > >
> > > Regarding the JBOD's work:
> > > KAFKA-16082:  Is not a blocker for 3.7 instead it's nice fix. The pr
> > > https://github.com/apache/kafka/pull/15136 is quite a small one and
> was
> > > approved by Proven and I but it is waiting for a committer's approval.
> > > KAFKA-16162: This is a blocker for 3.7.  Same it’s a small pr
> > > https://github.com/apache/kafka/pull/15270 and it is approved Proven
> and
> > > I and the PR is waiting for committer's approval.
> > > KAFKA-16157: This is a blocker for 3.7. There is one small suggestion
> for
> > > the pr https://github.com/apache/kafka/pull/15263 but I don't think
> any
> > > of the current feedback is blocking the pr from getting approved.
> Assuming
> > > we get a committer's approval on it.
> > > KAFKA-16195:  Same it's a blocker but it has approval from Proven and I
> > > and we are waiting for committer's approval on the pr
> > > https://github.com/apache/kafka/pull/15262.
> > >
> > > If we can’t get a committer approval for KAFKA-16162, KAFKA-16157 and
> > > KAFKA-16195  in time for 3.7 then we can mark JBOD as early release
> > > assuming we merge at least KAFKA-16195.
> > >
> > > Regards,
> > > Omnia
> > >
> > > > On 26 Jan 2024, at 15:39, ka...@gnarula.com wrote:
> > > >
> > > > Apologies, I duplicated KAFKA-16157 twice in my previous message. I
> > > intended to mention KAFKA-16195
> > > > with the PR at https://github.com/apache/kafka/pull/15262 as the
> second
> > > JIRA.
> > > >
> > > > Thanks,
> > > > Gaurav
> > > >
> > > >> On 26 Jan 2024, at 15:34, ka...@gnarula.com wrote:
> > > >>
> > > >> Hi Stan,
> > > >>
> > > >> I wanted to share some updates about the bugs you shared earlier.
> > > >>
> > > >> - KAFKA-14616: I've reviewed and tested the PR from Colin and have
> > > observed
> > > >> the fix works as intended.
> > > >> - KAFKA-16162: I reviewed Proven's PR and found some gaps in the
> > > proposed fix. I've
> > > >> therefore raised http

Re: [VOTE] 3.7.0 RC2

2024-02-02 Thread Stanislav Kozlovski
Thanks for the work everybody. Providing a status update at the end of the
week:

- docs change explaining migration
<https://github.com/apache/kafka/pull/15193> was merged
- the blocker KAFKA-16162 <https://github.com/apache/kafka/pull/15270> was
merged
- the blocker KAFKA-14616 <https://github.com/apache/kafka/pull/15230> was
merged
- a small blocker problem with the shadow jar plugin
<https://github.com/apache/kafka/pull/15308>
- the blockers KAFKALESS-16157 & KAFKALESS-16195 aren't merged
- the good-to-have KAFKA-16082 isn't merged

I think we should prioritize merging KAFKALESS-16195 and *call JBOD EA*. I
question whether we may find more blocker bugs in the next RC.
The release is late by approximately a month so far, so I do want to scope
down aggressively to meet the time-based goal.

Best,
Stanislav

On Mon, Jan 29, 2024 at 5:46 PM Omnia Ibrahim 
wrote:

> Hi Stan and Gaurav,
> Just to clarify some points mentioned here before
>  KAFKA-14616: I raised a year ago so it's not related to JBOD work. It is
> rather a blocker bug for KRAFT in general. The PR from Colin should fix
> this. Am not sure if it is a blocker for 3.7 per-say as it was a major bug
> since 3.3 and got missed from all other releases.
>
> Regarding the JBOD's work:
> KAFKA-16082:  Is not a blocker for 3.7 instead it's nice fix. The pr
> https://github.com/apache/kafka/pull/15136 is quite a small one and was
> approved by Proven and I but it is waiting for a committer's approval.
> KAFKA-16162: This is a blocker for 3.7.  Same it’s a small pr
> https://github.com/apache/kafka/pull/15270 and it is approved Proven and
> I and the PR is waiting for committer's approval.
> KAFKA-16157: This is a blocker for 3.7. There is one small suggestion for
> the pr https://github.com/apache/kafka/pull/15263 but I don't think any
> of the current feedback is blocking the pr from getting approved. Assuming
> we get a committer's approval on it.
> KAFKA-16195:  Same it's a blocker but it has approval from Proven and I
> and we are waiting for committer's approval on the pr
> https://github.com/apache/kafka/pull/15262.
>
> If we can’t get a committer approval for KAFKA-16162, KAFKA-16157 and
> KAFKA-16195  in time for 3.7 then we can mark JBOD as early release
> assuming we merge at least KAFKA-16195.
>
> Regards,
> Omnia
>
> > On 26 Jan 2024, at 15:39, ka...@gnarula.com wrote:
> >
> > Apologies, I duplicated KAFKA-16157 twice in my previous message. I
> intended to mention KAFKA-16195
> > with the PR at https://github.com/apache/kafka/pull/15262 as the second
> JIRA.
> >
> > Thanks,
> > Gaurav
> >
> >> On 26 Jan 2024, at 15:34, ka...@gnarula.com wrote:
> >>
> >> Hi Stan,
> >>
> >> I wanted to share some updates about the bugs you shared earlier.
> >>
> >> - KAFKA-14616: I've reviewed and tested the PR from Colin and have
> observed
> >> the fix works as intended.
> >> - KAFKA-16162: I reviewed Proven's PR and found some gaps in the
> proposed fix. I've
> >> therefore raised https://github.com/apache/kafka/pull/15270 following
> a discussion with Luke in JIRA.
> >> - KAFKA-16082: I don't think this is marked as a blocker anymore. I'm
> awaiting
> >> feedback/reviews at https://github.com/apache/kafka/pull/15136
> >>
> >> In addition to the above, there are 2 JIRAs I'd like to bring
> everyone's attention to:
> >>
> >> - KAFKA-16157: This is similar to KAFKA-14616 and is marked as a
> blocker. I've raised
> >> https://github.com/apache/kafka/pull/15263 and am awaiting reviews on
> it.
> >> - KAFKA-16157: I raised this yesterday and have addressed feedback from
> Luke. This should
> >> hopefully get merged soon.
> >>
> >> Regards,
> >> Gaurav
> >>
> >>
> >>> On 24 Jan 2024, at 11:51, ka...@gnarula.com wrote:
> >>>
> >>> Hi Stanislav,
> >>>
> >>> Thanks for bringing these JIRAs/PRs up.
> >>>
> >>> I'll be testing the open PRs for KAFKA-14616 and KAFKA-16162 this week
> and I hope to have some feedback
> >>> by Friday. I gather the latter JIRA is marked as a WIP by Proven and
> he's away. I'll try to build on his work in the meantime.
> >>>
> >>> As for KAFKA-16082, we haven't been able to deduce a data loss
> scenario. There's a PR open
> >>> by me for promoting an abandoned future replica with approvals from
> Omnia and Proven,
> >>> so I'd appreciate a committer reviewing it.
> >>>
> >>> Regards,
> >>> Gaurav
> >>>
> >>> On 23 Jan 2024

Re: [VOTE] 3.7.0 RC2

2024-01-23 Thread Stanislav Kozlovski
mpact of the error messages?*
> > >>>>>
> > >>>>> I did not observe any obvious impact. I was able to send and
> receive
> > >>>>> messages as normally. But to be honest, I have no idea what else
> > >>>>> this might impact, so I did not try anything special.
> > >>>>>
> > >>>>> I think everyone upgrading an existing KRaft cluster will go
> through
> > >>> this
> > >>>>> stage (running Kafka 3.7 with an older metadata version for at
> least
> > a
> > >>>>> while). So even if it is just a logged exception without any other
> > >>>> impact I
> > >>>>> wonder if it might scare users from upgrading. But I leave it to
> > >>> others
> > >>>> to
> > >>>>> decide if this is a blocker or not.
> > >>>>>
> > >>>>
> > >>>> Hi Jakub,
> > >>>>
> > >>>> Thanks for trying the RC. I think what you found is a blocker bug
> > >>> because
> > >>>> it will generate huge amount of logspam. I guess we didn't find it
> in
> > >>> junit
> > >>>> tests since logspam doesn't fail the automated tests. But certainly
> > it's
> > >>>> not suitable for production. Did you file a JIRA yet?
> > >>>>
> > >>>>> On Sun, Jan 14, 2024 at 10:17 PM Stanislav Kozlovski
> > >>>>>  wrote:
> > >>>>>
> > >>>>>> Hey Luke,
> > >>>>>>
> > >>>>>> This is an interesting problem. Given the fact that the KIP for
> > >>> having a
> > >>>>>> 3.8 release passed, I think it weights the scale towards not
> calling
> > >>>> this a
> > >>>>>> blocker and expecting it to be solved in 3.7.1.
> > >>>>>>
> > >>>>>> It is unfortunate that it would not seem safe to migrate to KRaft
> in
> > >>>> 3.7.0
> > >>>>>> (given the inability to rollback safely), but if that's true - the
> > >>> same
> > >>>>>> case would apply for 3.6.0. So in any case users w\ould be
> expected
> > >>> to
> > >>>> use a
> > >>>>>> patch release for this.
> > >>>>
> > >>>> Hi Luke,
> > >>>>
> > >>>> Thanks for testing rollback. I think this is a case where the
> > >>>> documentation is wrong. The intention was to for the steps to
> > basically
> > >>> be:
> > >>>>
> > >>>> 1. roll all the brokers into zk mode, but with migration enabled
> > >>>> 2. take down the kraft quorum
> > >>>> 3. rmr /controller, allowing a hybrid broker to take over.
> > >>>> 4. roll all the brokers into zk mode without migration enabled (if
> > >>> desired)
> > >>>>
> > >>>> With these steps, there isn't really unavailability since a ZK
> > >>> controller
> > >>>> can be elected quickly after the kraft quorum is gone.
> > >>>>
> > >>>>>> Further, since we will have a 3.8 release - it is
> > >>>>>> likely we will ultimately recommend users upgrade from that
> version
> > >>>> given
> > >>>>>> its aim is to have strategic KRaft feature parity with ZK.
> > >>>>>> That being said, I am not 100% on this. Let me know whether you
> > think
> > >>>> this
> > >>>>>> should block the release, Luke. I am also tagging Colin and David
> to
> > >>>> weigh
> > >>>>>> in with their opinions, as they worked on the migration logic.
> > >>>>
> > >>>> The rollback docs are new in 3.7 so the fact that they're wrong is a
> > >>> clear
> > >>>> blocker, I think. But easy to fix, I believe. I will create a PR.
> > >>>>
> > >>>> best,
> > >>>> Colin
> > >>>>
> > >>>>>>
> > >>>>>> Hey Kirk and Chris,
> > >>>>>>
> > >>>>>> Unless I'm missing something - KAFKALESS-16029 is simply a bad log
> > >>> due
&g

Re: [VOTE] 3.7.0 RC2

2024-01-16 Thread Stanislav Kozlovski
Hi Kirk,

Given we are going to have to roll a new RC anyway, and the change is so
simple - might as well get it in!

On Mon, Jan 15, 2024 at 8:26 PM Kirk True  wrote:

> Hi Stanislav,
>
> On Sun, Jan 14, 2024, at 1:17 PM, Stanislav Kozlovski wrote:
> > Hey Kirk and Chris,
> >
> > Unless I'm missing something - KAFKALESS-16029 is simply a bad log due to
> > improper closing. And the PR description implies this has been present
> > since 3.5. While annoying, I don't see a strong reason for this to block
> > the release.
>
> I would imagine that it would result in concerned users reporting the
> issue.
>
> I took another look, and the code that causes the issue was indeed changed
> in 3.7. It is easily reproducible.
>
> The PR is ready for review: https://github.com/apache/kafka/pull/15186
>
> Thanks,
> Kirk



-- 
Best,
Stanislav


Re: [VOTE] 3.7.0 RC2

2024-01-15 Thread Stanislav Kozlovski
I wanted to circle back and confirm the integration tests + system tests,
plus give an overall update regarding status.

The integration tests have a fair amount of flakes. I ran and inspected 3
consecutive builds (57
<https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/57/>, 58
<https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/>, 59
<https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/59/>), then
cross-checked each run's failures via a script of mine to see any
consistent failures.

Three tests proved very flaky. Two are related to KIP-848 running under
KRaft. The third one is a Trogdor test. All 3 tests pass locally, hence I
deem them not blockers for the release. Especially since KIP-848 is in
early access, I am not particularly concerned with a flaky test. I opened
three JIRAs to track them:
- https://issues.apache.org/jira/browse/KAFKA-16134
- https://issues.apache.org/jira/browse/KAFKA-16135
- https://issues.apache.org/jira/browse/KAFKA-16136

As for the system tests, I again ran 2 consecutive builds (1
<https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1705045120--apache--3.7--d071cceffc/2024-01-11--001./2024-01-11--001./report.html>,
2
<https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1705113592--apache--3.7--d071cceffc/2024-01-12--001./2024-01-12--001./report.html>)
and I found 4 tests that exhibit consecutive failures.
- The whole analysis: https://hackmd.io/@hOneAGCrSmKSpL8VF-1HWQ/HyRgRJmta

The failing tests:
StreamsStandbyTask - https://issues.apache.org/jira/browse/KAFKA-16141
StreamsUpgradeTest - https://issues.apache.org/jira/browse/KAFKA-16139
QuotaTest - https://issues.apache.org/jira/browse/KAFKA-16138
ZookeeperMigrationTest - https://issues.apache.org/jira/browse/KAFKA-16140

I am reaching out to subject matter experts regarding the failures.

Thanks to everyone who contributed in testing the release. Here is a
general update regarding known blockers that were recently found:

We are treating https://issues.apache.org/jira/browse/KAFKA-16131 and
https://issues.apache.org/jira/browse/KAFKA-16101 as blockers.

https://issues.apache.org/jira/browse/KAFKA-16132 is a potential other
issue that will likely be treated as a blocker

Best,
Stanislav

On Mon, Jan 15, 2024 at 12:04 PM Jakub Scholz  wrote:

> *> Hi Jakub,> > Thanks for trying the RC. I think what you found is a
> blocker bug because it *
> *> will generate huge amount of logspam. I guess we didn't find it in junit
> tests *
> *> since logspam doesn't fail the automated tests. But certainly it's not
> suitable *
> *> for production. Did you file a JIRA yet?*
>
> Hi Colin,
>
> I opened https://issues.apache.org/jira/browse/KAFKA-16131.
>
> Thanks & Regards
> Jakub
>
> On Mon, Jan 15, 2024 at 8:57 AM Colin McCabe  wrote:
>
> > Hi Stanislav,
> >
> > Thanks for making the first RC. The fact that it's titled RC2 is messing
> > with my mind a bit. I hope this doesn't make people think that we're
> > farther along than we are, heh.
> >
> > On Sun, Jan 14, 2024, at 13:54, Jakub Scholz wrote:
> > > *> Nice catch! It does seem like we should have gated this behind the
> > > metadata> version as KIP-858 implies. Is the cluster configured with
> > > multiple log> dirs? What is the impact of the error messages?*
> > >
> > > I did not observe any obvious impact. I was able to send and receive
> > > messages as normally. But to be honest, I have no idea what else
> > > this might impact, so I did not try anything special.
> > >
> > > I think everyone upgrading an existing KRaft cluster will go through
> this
> > > stage (running Kafka 3.7 with an older metadata version for at least a
> > > while). So even if it is just a logged exception without any other
> > impact I
> > > wonder if it might scare users from upgrading. But I leave it to others
> > to
> > > decide if this is a blocker or not.
> > >
> >
> > Hi Jakub,
> >
> > Thanks for trying the RC. I think what you found is a blocker bug because
> > it will generate huge amount of logspam. I guess we didn't find it in
> junit
> > tests since logspam doesn't fail the automated tests. But certainly it's
> > not suitable for production. Did you file a JIRA yet?
> >
> > > On Sun, Jan 14, 2024 at 10:17 PM Stanislav Kozlovski
> > >  wrote:
> > >
> > >> Hey Luke,
> > >>
> > >> This is an interesting problem. Given the fact that the KIP for
> having a
> > >> 3.8 release passed, I think it weights the scale towards not call

[jira] [Created] (KAFKA-16141) StreamsStandbyTask##test_standby_tasks_rebalanceArguments:{ “metadata_quorum”: “ISOLATED_KRAFT”, “use_new_coordinator”: false} fails consistently in 3.7

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16141:
---

 Summary: 
StreamsStandbyTask##test_standby_tasks_rebalanceArguments:{ “metadata_quorum”: 
“ISOLATED_KRAFT”, “use_new_coordinator”: false} fails consistently in 3.7
 Key: KAFKA-16141
 URL: https://issues.apache.org/jira/browse/KAFKA-16141
 Project: Kafka
  Issue Type: Test
Affects Versions: 3.7.0
Reporter: Stanislav Kozlovski


{code:java}
kafkatest.tests.streams.streams_standby_replica_test.StreamsStandbyTask#test_standby_tasks_rebalanceArguments:{
 “metadata_quorum”: “ISOLATED_KRAFT”, “use_new_coordinator”: false}

TimeoutError("Did expect to read 'ACTIVE_TASKS:2 STANDBY_TASKS:[1-3]' from 
ubuntu@worker26")
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/tests/runner_client.py",
 line 184, in _do_run
data = self.run_test()
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/tests/runner_client.py",
 line 262, in run_test
return self.test_context.function(self.test)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/mark/_mark.py",
 line 433, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_standby_replica_test.py",
 line 79, in test_standby_tasks_rebalance
self.wait_for_verification(processor_1, "ACTIVE_TASKS:2 
STANDBY_TASKS:[1-3]", processor_1.STDOUT_FILE)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/base_streams_test.py",
 line 96, in wait_for_verification
err_msg="Did expect to read '%s' from %s" % (message, 
processor.node.account))
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/utils/util.py",
 line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from 
last_exception
ducktape.errors.TimeoutError: Did expect to read 'ACTIVE_TASKS:2 
STANDBY_TASKS:[1-3]' from ubuntu@worker26
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16140) zookeeper_migration_test#TestMigration#test_recooncile_kraft_to_zk system test fails concistently on 3.7

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16140:
---

 Summary: 
zookeeper_migration_test#TestMigration#test_recooncile_kraft_to_zk system test 
fails concistently on 3.7
 Key: KAFKA-16140
 URL: https://issues.apache.org/jira/browse/KAFKA-16140
 Project: Kafka
  Issue Type: Test
Affects Versions: 3.7.0
Reporter: Stanislav Kozlovski


{code:java}
kafkatest.tests.core.

zookeeper_migration_test.TestMigration#test_reconcile_kraft_to_zk 

AssertionError('Did not see expected INFO log after migration')
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/tests/runner_client.py",
 line 184, in _do_run
data = self.run_test()
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/tests/runner_client.py",
 line 262, in run_test
return self.test_context.function(self.test)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/core/zookeeper_migration_test.py",
 line 367, in test_reconcile_kraft_to_zk
assert saw_expected_log, "Did not see expected INFO log after migration"
AssertionError: Did not see expected INFO log after migration{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16139) StreamsUpgradeTest fails consistently in 3.7.0

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16139:
---

 Summary: StreamsUpgradeTest fails consistently in 3.7.0
 Key: KAFKA-16139
 URL: https://issues.apache.org/jira/browse/KAFKA-16139
 Project: Kafka
  Issue Type: Test
Affects Versions: 3.7.0
Reporter: Stanislav Kozlovski


h1. 
kafkatest.tests.streams.streams_upgrade_test.StreamsUpgradeTest#test_rolling_upgrade_with_2_bouncesArguments:\{
 “from_version”: “3.5.1”, “to_version”: “3.7.0-SNAPSHOT”}
 
{{TimeoutError('Could not detect Kafka Streams version 3.7.0-SNAPSHOT on 
ubuntu@worker2')}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16138) QuotaTest system test fails consistently in 3.7

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16138:
---

 Summary: QuotaTest system test fails consistently in 3.7
 Key: KAFKA-16138
 URL: https://issues.apache.org/jira/browse/KAFKA-16138
 Project: Kafka
  Issue Type: Test
Affects Versions: 3.7.0
Reporter: Stanislav Kozlovski


as mentioned in 
[https://hackmd.io/@hOneAGCrSmKSpL8VF-1HWQ/HyRgRJmta#kafkatesttestsclientquota_testQuotaTesttest_quotaArguments-%E2%80%9Coverride_quota%E2%80%9D-true-%E2%80%9Cquota_type%E2%80%9D-%E2%80%9Cuser%E2%80%9D,]
 the test fails consistently:
{code:java}
ValueError('max() arg is an empty sequence')
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/tests/runner_client.py",
 line 184, in _do_run
data = self.run_test()
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/tests/runner_client.py",
 line 262, in run_test
return self.test_context.function(self.test)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.7/site-packages/ducktape/mark/_mark.py",
 line 433, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py",
 line 169, in test_quota
success, msg = self.validate(self.kafka, producer, consumer)
  File 
"/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py",
 line 197, in validate
metric.value for k, metrics in producer.metrics(group='producer-metrics', 
name='outgoing-byte-rate', client_id=producer.client_id) for metric in metrics
ValueError: max() arg is an empty sequence {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16136) CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated() is very flaky

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16136:
---

 Summary: 
CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated() is very flaky
 Key: KAFKA-16136
 URL: https://issues.apache.org/jira/browse/KAFKA-16136
 Project: Kafka
  Issue Type: Test
Reporter: Stanislav Kozlovski


The test failed 3 builds in a row (with different JDK versions) in the 3.7 
release branch as part of verifying the release

Locally it passed

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16135) kafka.api.PlaintextConsumerTest.testPerPartitionLeadMetricsCleanUpWithSubscribe(String, String).quorum=kraft+kip848.groupProtocol=consumer is flaky

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16135:
---

 Summary: 
kafka.api.PlaintextConsumerTest.testPerPartitionLeadMetricsCleanUpWithSubscribe(String,
 String).quorum=kraft+kip848.groupProtocol=consumer is flaky
 Key: KAFKA-16135
 URL: https://issues.apache.org/jira/browse/KAFKA-16135
 Project: Kafka
  Issue Type: Test
Reporter: Stanislav Kozlovski


The test
kafka.api.PlaintextConsumerTest.testPerPartitionLeadMetricsCleanUpWithSubscribe(String,
 String).quorum=kraft+kip848.groupProtocol=consumer
is incredibly flaky - it failed 3 builds in a row for the 3.7 release 
candidate, but with different JDK versions. Locally it also fails often and 
requires a few retries to pass

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16134) kafka.api.PlaintextConsumerTest.testPerPartitionLagMetricsCleanUpWithSubscribe(String, String).quorum=kraft+kip848.groupProtocol=consumer is flaky

2024-01-15 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-16134:
---

 Summary: 
kafka.api.PlaintextConsumerTest.testPerPartitionLagMetricsCleanUpWithSubscribe(String,
 String).quorum=kraft+kip848.groupProtocol=consumer is flaky
 Key: KAFKA-16134
 URL: https://issues.apache.org/jira/browse/KAFKA-16134
 Project: Kafka
  Issue Type: Test
Reporter: Stanislav Kozlovski


The following test is very flaky. It failed 3 times consecutively in Jenkins 
runs for the 3.7 release candidate.
kafka.api.PlaintextConsumerTest.testPerPartitionLagMetricsCleanUpWithSubscribe(String,
 String).quorum=kraft+kip848.groupProtocol=consumer
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] 3.7.0 RC2

2024-01-14 Thread Stanislav Kozlovski
quest not be sent at all by the brokers? (I did not opened a JIRA for it,
> but I can open one if you agree this is not expected)
>
> Thanks & Regards
> Jakub
>
> On Sat, Jan 13, 2024 at 8:03 AM Luke Chen  wrote:
>
> > Hi Stanislav,
> >
> > I commented in the "Apache Kafka 3.7.0 Release" thread, but maybe you
> > missed it.
> > cross-posting here:
> >
> > There is a bug KAFKA-16101
> > <https://issues.apache.org/jira/browse/KAFKA-16101> reporting that
> "Kafka
> > cluster will be unavailable during KRaft migration rollback".
> > The impact for this issue is that if brokers try to rollback to ZK mode
> > during KRaft migration process, there will be a period of time the
> cluster
> > is unavailable.
> > Since ZK migrating to KRaft feature is a production ready feature, I
> think
> > this should be addressed soon.
> > Do you think this is a blocker for v3.7.0?
> >
> > Thanks.
> > Luke
> >
> > On Sat, Jan 13, 2024 at 8:36 AM Chris Egerton 
> > wrote:
> >
> > > Thanks, Kirk!
> > >
> > > @Stanislav--do you believe that this warrants a new RC?
> > >
> > > On Fri, Jan 12, 2024, 19:08 Kirk True  wrote:
> > >
> > > > Hi Chris/Stanislav,
> > > >
> > > > I'm working on the 'Unable to find FetchSessionHandler' log problem
> > > > (KAFKA-16029) and have put out a draft PR (
> > > > https://github.com/apache/kafka/pull/15186). I will use the
> quickstart
> > > > approach as a second means to reproduce/verify while I wait for the
> > PR's
> > > > Jenkins job to finish.
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote:
> > > > > Hi Stanislav,
> > > > >
> > > > >
> > > > > Thanks for running this release!
> > > > >
> > > > > To verify, I:
> > > > > - Built from source using Java 11 with both:
> > > > > - - the 3.7.0-rc2 tag on GitHub
> > > > > - - the kafka-3.7.0-src.tgz artifact from
> > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> > > > > - Checked signatures and checksums
> > > > > - Ran the quickstart using both:
> > > > > - - The kafka_2.13-3.7.0.tgz artifact from
> > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ with
> > Java
> > > > 11
> > > > > and Scala 13 in KRaft mode
> > > > > - - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
> > > > > - Ran all unit tests
> > > > > - Ran all integration tests for Connect and MM2
> > > > >
> > > > >
> > > > > I found two minor areas for concern:
> > > > >
> > > > > 1. (Possibly a blocker)
> > > > > When running the quickstart, I noticed this ERROR-level log message
> > > being
> > > > > emitted frequently (not not every time) when I killed my console
> > > consumer
> > > > > via ctrl-C:
> > > > >
> > > > > > [2024-01-12 11:00:31,088] ERROR [Consumer
> > clientId=console-consumer,
> > > > > groupId=console-consumer-74388] Unable to find FetchSessionHandler
> > for
> > > > node
> > > > > 1. Ignoring fetch response
> > > > > (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> > > > >
> > > > > I see that this error message is already reported in
> > > > > https://issues.apache.org/jira/browse/KAFKA-16029. I think we
> should
> > > > > prioritize fixing it for this release. I know it's probably benign
> > but
> > > > it's
> > > > > really not a good look for us when basic operations log error
> > messages,
> > > > and
> > > > > it may give new users some headaches.
> > > > >
> > > > >
> > > > > 2. (Probably not a blocker)
> > > > > The following unit tests failed the first time around, and all of
> > them
> > > > > passed the second time I ran them:
> > > > >
> > > > > - (clients)
> > > > ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
> > > > > - (clients) SelectorTest.testConnectionsByClientMetric()
> > > > > - (clients) Tls13SelectorTest.testConnect

[VOTE] 3.7.0 RC2

2024-01-11 Thread Stanislav Kozlovski
Hello Kafka users, developers, and client-developers,

This is the first candidate for release of Apache Kafka 3.7.0.

Note it's named "RC2" because I had a few "failed" RCs that I had
cut/uploaded but ultimately had to scrap prior to announcing due to new
blockers arriving before I could even announce them.

Further - I haven't yet been able to set up the system tests successfully.
And the integration/unit tests do have a few failures that I have to spend
time triaging. I would appreciate any help in case anyone notices any tests
failing that they're subject matters experts in. Expect me to follow up in
a day or two with more detailed analysis.

Major changes include:
- Early Access to KIP-848 - the next generation of the consumer rebalance
protocol
- KIP-858: Adding JBOD support to KRaft
- KIP-714: Observability into Client metrics via a standardized interface

Check more information in the WIP blog post:
https://github.com/apache/kafka-site/pull/578

Release notes for the 3.7.0 release:
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html

*** Please download, test and vote by Thursday, January 18, 9am PT ***

Usually these deadlines tend to be 2-3 days, but due to this being the
first RC and the tests not having ran yet, I am giving it a bit more time.

Kafka's KEYS file containing PGP keys we use to sign the release:
https://kafka.apache.org/KEYS

* Release artifacts to be voted upon (source and binary):
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/

* Docker release artifact to be voted upon:
apache/kafka:3.7.0-rc2

* Maven artifacts to be voted upon:
https://repository.apache.org/content/groups/staging/org/apache/kafka/

* Javadoc:
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/

* Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
https://github.com/apache/kafka/releases/tag/3.7.0-rc2

* Documentation:
https://kafka.apache.org/37/documentation.html

* Protocol:
https://kafka.apache.org/37/protocol.html

* Successful Jenkins builds for the 3.7 branch:
Unit/integration tests:
https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/
There are failing tests here. I have to follow up with triaging some of the
failures and figuring out if they're actual problems or simply flakes.

System tests: https://jenkins.confluent.io/job/system-test-kafka/job/3.7/

No successful system test runs yet. I am working on getting the job to run.

* Successful Docker Image Github Actions Pipeline for 3.7 branch:
Attached are the scan_report and report_jvm output files from the Docker
Build run:
https://github.com/apache/kafka/actions/runs/7486094960/job/20375761673

And the final docker image build job - Docker Build Test Pipeline:
https://github.com/apache/kafka/actions/runs/7486178277

The image is apache/kafka:3.7.0-rc2 -
https://hub.docker.com/layers/apache/kafka/3.7.0-rc2/images/sha256-5b4707c08170d39549fbb6e2a3dbb83936a50f987c0c097f23cb26b4c210c226?context=explore

/**

Thanks,
Stanislav Kozlovski

kafka/test:test (alpine 3.18.5)
===
Total: 0 (HIGH: 0, CRITICAL: 0)



Re: Apache Kafka 3.7.0 Release

2024-01-10 Thread Stanislav Kozlovski
Thanks Colin,

With that, I believe we are out of blockers. I was traveling today and
couldn't build an RC - expect one to be published tomorrow (barring any
problems).

In the meanwhile - here is a PR for the 3.7 blog post -
https://github.com/apache/kafka-site/pull/578

Best,
Stan

On Wed, Jan 10, 2024 at 12:06 AM Colin McCabe  wrote:

> KAFKA-16094 has been fixed and backported to 3.7.
>
> Colin
>
>
> On Mon, Jan 8, 2024, at 14:52, Colin McCabe wrote:
> > On an unrelated note, I found a blocker bug related to upgrades from
> > 3.6 (and earlier) to 3.7.
> >
> > The JIRA is here:
> >   https://issues.apache.org/jira/browse/KAFKA-16094
> >
> > Fix here:
> >   https://github.com/apache/kafka/pull/15153
> >
> > best,
> > Colin
> >
> >
> > On Mon, Jan 8, 2024, at 14:47, Colin McCabe wrote:
> >> Hi Ismael,
> >>
> >> I wasn't aware of that. If we are required to publish all modules, then
> >> this is working as intended.
> >>
> >> I am a bit curious if we've discussed why we need to publish the server
> >> modules to Sonatype. Is there a discussion about the pros and cons of
> >> this somewhere?
> >>
> >> regards,
> >> Colin
> >>
> >> On Mon, Jan 8, 2024, at 14:09, Ismael Juma wrote:
> >>> All modules are published to Sonatype - that's a requirement. You may
> be
> >>> missing the fact that `core` is published as `kafka_2.13` and
> `kafka_2.12`.
> >>>
> >>> Ismael
> >>>
> >>> On Tue, Jan 9, 2024 at 12:00 AM Colin McCabe 
> wrote:
> >>>
> >>>> Hi Ismael,
> >>>>
> >>>> It seems like both the metadata gradle module and the server-common
> module
> >>>> are getting published to Sonatype as separate artifacts, unless I'm
> >>>> misunderstanding something. Example:
> >>>>
> >>>> https://central.sonatype.com/search?q=kafka-server-common
> >>>>
> >>>> I don't see kafka-core getting published, but maybe other private
> >>>> server-side gradle modules are getting published.
> >>>>
> >>>> This seems bad. Is there a reason to publish modules that are only
> used by
> >>>> the server on Sonatype?
> >>>>
> >>>> best,
> >>>> Colin
> >>>>
> >>>>
> >>>> On Mon, Jan 8, 2024, at 12:50, Ismael Juma wrote:
> >>>> > Hi Colin,
> >>>> >
> >>>> > I think you may have misunderstood what they mean by gradle
> metadata -
> >>>> it's
> >>>> > not the Kafka metadata module.
> >>>> >
> >>>> > Ismael
> >>>> >
> >>>> > On Mon, Jan 8, 2024 at 9:45 PM Colin McCabe 
> wrote:
> >>>> >
> >>>> >> Oops, hit send too soon. I see that #15127 was already merged. So
> we
> >>>> >> should no longer be publishing :metadata as part of the clients
> >>>> artifacts,
> >>>> >> right?
> >>>> >>
> >>>> >> thanks,
> >>>> >> Colin
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Jan 8, 2024, at 11:42, Colin McCabe wrote:
> >>>> >> > Hi Apporv,
> >>>> >> >
> >>>> >> > Please remove the metadata module from any artifacts published
> for
> >>>> >> > clients. It is only used by the server.
> >>>> >> >
> >>>> >> > best,
> >>>> >> > Colin
> >>>> >> >
> >>>> >> >
> >>>> >> > On Sun, Jan 7, 2024, at 03:04, Apoorv Mittal wrote:
> >>>> >> >> Hi Colin,
> >>>> >> >> Thanks for the response. The only reason for asking the
> question of
> >>>> >> >> publishing the metadata is because that's present in previous
> client
> >>>> >> >> releases. For more context, the description of PR
> >>>> >> >> <https://github.com/apache/kafka/pull/15127> holds the details
> and
> >>>> >> waiting
> >>>> >> >> for the confirmation there prior to the merge.
> >>>> >> >>
> >>>> >> >> Regards,
> >>>> >> >> Apoorv Mitta

Re: Apache Kafka 3.7.0 Release

2024-01-08 Thread Stanislav Kozlovski
I've almost got an RC candidate out to test. I just need a. review on the
kafka-site repo to update the website with the appropriate 3.7 subpages -
https://github.com/apache/kafka-site/pull/576/

On Mon, Jan 8, 2024 at 10:05 AM Lucas Brutschy
 wrote:

> Hi,
>
> we have fixed one memory leak in Kafka Streams, but there is still at
> least one missing in the code. I created
> https://issues.apache.org/jira/browse/KAFKA-16089 which is a blocker.
>
> Cheers,
> Lucas
>
> On Sun, Jan 7, 2024 at 12:05 PM Apoorv Mittal 
> wrote:
> >
> > Hi Colin,
> > Thanks for the response. The only reason for asking the question of
> > publishing the metadata is because that's present in previous client
> > releases. For more context, the description of PR
> > <https://github.com/apache/kafka/pull/15127> holds the details and
> waiting
> > for the confirmation there prior to the merge.
> >
> > Regards,
> > Apoorv Mittal
> > +44 7721681581
> >
> >
> > On Fri, Jan 5, 2024 at 10:22 PM Colin McCabe  wrote:
> >
> > > metadata is an internal gradle module. It is not used by clients. So I
> > > don't see why you would want to publish it (unless I'm misunderstanding
> > > something).
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Fri, Jan 5, 2024, at 10:05, Stanislav Kozlovski wrote:
> > > > Thanks for reporting the blockers, folks. Good job finding.
> > > >
> > > > I have one ask - can anybody with Gradle expertise help review this
> small
> > > > PR? https://github.com/apache/kafka/pull/15127 (+1, -1)
> > > > In particular, we are wondering whether we need to publish module
> > > metadata
> > > > as part of the gradle publishing process.
> > > >
> > > >
> > > > On Fri, Jan 5, 2024 at 3:56 PM Proven Provenzano
> > > >  wrote:
> > > >
> > > >> We have potentially one more blocker
> > > >> https://issues.apache.org/jira/browse/KAFKA-16082 which might
> cause a
> > > data
> > > >> loss scenario with JBOD in KRaft.
> > > >> Initial analysis thought this is a problem and further review looks
> > > like it
> > > >> isn't but we are continuing to dig into the issue to ensure that it
> > > isn't.
> > > >> We would request feedback on the bug from anyone who is familiar
> with
> > > this
> > > >> code.
> > > >>
> > > >> --Proven
> > > >>
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Stanislav
> > >
>


-- 
Best,
Stanislav


Re: Apache Kafka 3.7.0 Release

2024-01-05 Thread Stanislav Kozlovski
Thanks for reporting the blockers, folks. Good job finding.

I have one ask - can anybody with Gradle expertise help review this small
PR? https://github.com/apache/kafka/pull/15127 (+1, -1)
In particular, we are wondering whether we need to publish module metadata
as part of the gradle publishing process.


On Fri, Jan 5, 2024 at 3:56 PM Proven Provenzano
 wrote:

> We have potentially one more blocker
> https://issues.apache.org/jira/browse/KAFKA-16082 which might cause a data
> loss scenario with JBOD in KRaft.
> Initial analysis thought this is a problem and further review looks like it
> isn't but we are continuing to dig into the issue to ensure that it isn't.
> We would request feedback on the bug from anyone who is familiar with this
> code.
>
> --Proven
>


-- 
Best,
Stanislav


Re: Apache Kafka 3.7.0 Release

2024-01-04 Thread Stanislav Kozlovski
Thanks Apoorv, I was going to update the mailing thread as well.

Major kudos to Apoorv for the thorough work debugging and getting to the
bottom of this tricky publishing issue, a subtle regression from the work
he did in making the kafka-clients jar shadowed.

On Thu, Jan 4, 2024 at 5:09 PM Apoorv Mittal 
wrote:

> Hi Stan,
> I have opened the minor PR: https://github.com/apache/kafka/pull/15127 to
> fix publishing the dependency. Once discussed and merged in trunk, I'll
> update the 3.7 branch as well.
>
> Regards,
> Apoorv Mittal
> +44 7721681581
>
>
> On Thu, Jan 4, 2024 at 12:49 PM Matthias J. Sax  wrote:
>
> > We found a blocker for 3.7:
> > https://issues.apache.org/jira/browse/KAFKA-16077
> >
> > Already having a PR under review to fix it.
> >
> >
> > -Matthias
> >
> > On 1/3/24 10:43 AM, Stanislav Kozlovski wrote:
> > > Hey all, happy new year.
> > >
> > > Thanks for the heads up Almog. Makes sense.
> > >
> > > To give an update - I haven't been able to resolve the gradlewAll
> publish
> > > failure, and as such haven't been able to release an RC.
> > > As a minor barrier, I have to also update the year in the NOTICE file,
> > > otherwise the release script won't let me continue -
> > > https://github.com/apache/kafka/pull/15111
> > >
> > > Me and Apoorv synced offline and ran a few tests to debug the issue
> > > regarding the clients build. I successfully executed `publish` when
> > > pointing toward a custom jfrog repo with both JDK 8 and 17. Inspecting
> > the
> > > debug logs, the task that previously failed
> > > `:clients:publishMavenJavaPublicationToMavenRepository'` passed
> > > successfully. Here's a sample of the logs -
> > >
> >
> https://gist.github.com/stanislavkozlovski/841060cb467ec1d179cc9f293c8702e7
> > >
> > > Having read the release.py script a few times, I am not able to see
> what
> > is
> > > different in the setup there. It simply clones the repo anew, gets the
> > 3.7
> > > branch and runs the same command.
> > >
> > > At this point, I am contemplating pushing a commit to 3.7 that modifies
> > the
> > > release.py file that enables debug on the command:
> > > diff --git a/release.py b/release.py
> > > index 43c5809861..e299e10e74 100755
> > > --- a/release.py
> > > +++ b/release.py
> > > @@ -675,7 +675,7 @@ with
> > > open(os.path.expanduser("~/.gradle/gradle.properties")) as f:
> > >   contents = f.read()
> > >   if not user_ok("Going to build and upload mvn artifacts based on
> these
> > > settings:\n" + contents + '\nOK (y/n)?: '):
> > >   fail("Retry again later")
> > > -cmd("Building and uploading archives", "./gradlewAll publish",
> > > cwd=kafka_dir, env=jdk8_env, shell=True)
> > > +cmd("Building and uploading archives", "./gradlewAll publish --debug",
> > > cwd=kafka_dir, env=jdk8_env, shell=True)
> > >   cmd("Building and uploading archives", "mvn deploy -Pgpg-signing",
> > > cwd=streams_quickstart_dir, env=jdk8_env, shell=True)
> > >
> > >   release_notification_props = { 'release_version': release_version,
> > > (END)
> > >
> > > and continuing to debug through that.
> > >
> > > Since the release.py script grabs a new copy of origin, we have to
> modify
> > > upstream. An alternative is for me to use my local github Kafka repo,
> but
> > > that may result in the script pushing a build of that into the remote
> > > servers.
> > >
> > > On Tue, Jan 2, 2024 at 8:17 PM Almog Gavra 
> > wrote:
> > >
> > >> Hello Stan,
> > >>
> > >> I wanted to give you a heads up that
> > >> https://github.com/apache/kafka/pull/15073 (
> > >> https://issues.apache.org/jira/browse/KAFKA-16046) was identified as
> a
> > >> blocker regression and should be merged to trunk by EOD.
> > >>
> > >> Cheers,
> > >> Almog
> > >>
> > >> On Tue, Jan 2, 2024 at 4:20 AM Stanislav Kozlovski
> > >>  wrote:
> > >>
> > >>> Hi Apoorv,
> > >>>
> > >>> Thanks for taking ownership and looking into this! One more caveat is
> > >> that
> > >>> I believe this first publish is ran with JDK 8, as the release.py
> runs
> > >> with
> > >>> both JDK 

[jira] [Resolved] (KAFKA-16046) Stream Stream Joins fail after restoration with deserialization exceptions

2024-01-03 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-16046.
-
Resolution: Fixed

> Stream Stream Joins fail after restoration with deserialization exceptions
> --
>
> Key: KAFKA-16046
> URL: https://issues.apache.org/jira/browse/KAFKA-16046
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.7.0
>Reporter: Almog Gavra
>Assignee: Almog Gavra
>Priority: Blocker
>  Labels: streams
> Fix For: 3.7.0
>
>
> Before KIP-954, the `KStreamImplJoin` class would always create 
> non-timestamped persistent windowed stores. After that KIP, the default was 
> changed to create timestamped stores. This wasn't compatible because, during 
> restoration, timestamped stores have their changelog values transformed to 
> prepend the timestamp to the value. This caused serialization errors when 
> trying to read from the store because the deserializers did not expect the 
> timestamp to be prepended.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Apache Kafka 3.7.0 Release

2024-01-03 Thread Stanislav Kozlovski
Hey all, happy new year.

Thanks for the heads up Almog. Makes sense.

To give an update - I haven't been able to resolve the gradlewAll publish
failure, and as such haven't been able to release an RC.
As a minor barrier, I have to also update the year in the NOTICE file,
otherwise the release script won't let me continue -
https://github.com/apache/kafka/pull/15111

Me and Apoorv synced offline and ran a few tests to debug the issue
regarding the clients build. I successfully executed `publish` when
pointing toward a custom jfrog repo with both JDK 8 and 17. Inspecting the
debug logs, the task that previously failed
`:clients:publishMavenJavaPublicationToMavenRepository'` passed
successfully. Here's a sample of the logs -
https://gist.github.com/stanislavkozlovski/841060cb467ec1d179cc9f293c8702e7

Having read the release.py script a few times, I am not able to see what is
different in the setup there. It simply clones the repo anew, gets the 3.7
branch and runs the same command.

At this point, I am contemplating pushing a commit to 3.7 that modifies the
release.py file that enables debug on the command:
diff --git a/release.py b/release.py
index 43c5809861..e299e10e74 100755
--- a/release.py
+++ b/release.py
@@ -675,7 +675,7 @@ with
open(os.path.expanduser("~/.gradle/gradle.properties")) as f:
 contents = f.read()
 if not user_ok("Going to build and upload mvn artifacts based on these
settings:\n" + contents + '\nOK (y/n)?: '):
 fail("Retry again later")
-cmd("Building and uploading archives", "./gradlewAll publish",
cwd=kafka_dir, env=jdk8_env, shell=True)
+cmd("Building and uploading archives", "./gradlewAll publish --debug",
cwd=kafka_dir, env=jdk8_env, shell=True)
 cmd("Building and uploading archives", "mvn deploy -Pgpg-signing",
cwd=streams_quickstart_dir, env=jdk8_env, shell=True)

 release_notification_props = { 'release_version': release_version,
(END)

and continuing to debug through that.

Since the release.py script grabs a new copy of origin, we have to modify
upstream. An alternative is for me to use my local github Kafka repo, but
that may result in the script pushing a build of that into the remote
servers.

On Tue, Jan 2, 2024 at 8:17 PM Almog Gavra  wrote:

> Hello Stan,
>
> I wanted to give you a heads up that
> https://github.com/apache/kafka/pull/15073 (
> https://issues.apache.org/jira/browse/KAFKA-16046) was identified as a
> blocker regression and should be merged to trunk by EOD.
>
> Cheers,
> Almog
>
> On Tue, Jan 2, 2024 at 4:20 AM Stanislav Kozlovski
>  wrote:
>
> > Hi Apoorv,
> >
> > Thanks for taking ownership and looking into this! One more caveat is
> that
> > I believe this first publish is ran with JDK 8, as the release.py runs
> with
> > both JDK 8 and (if I recall correctly) 17 versions. This seems to fail on
> > the first one - so JDK 8.
> > Not sure if that is related in any way. And I'm also not sure if it
> should
> > be kafka-clients or just clients.
> >
> > On Sat, Dec 30, 2023 at 10:48 AM Apoorv Mittal  >
> > wrote:
> >
> > > Hi Stan,
> > > Thanks for looking into the release. I worked with `./gradlewAll
> > > publishToMavenLocal` which generates the respective `kafka-clients.jar`
> > > and deploys to maven local, I believed that `./gradlewAll publish`
> should
> > > just publish the artifacts to remote repository and hence should always
> > > work as jars successfully gets deployed to local maven.
> > >
> > > Though now I set up the remote private maven repository for myself (on
> > > jfrog) and tried `./gradlewAll publish` on the 3.7 branch and
> > > successfully completed the build with all artifacts uploaded to the
> > remote
> > > repository. What seems strange to me is the error you mentioned in the
> > > previous email regarding the reference of the clients jar. I suppose
> the
> > > reference should be to `kafka-clients.jar` rather than `clients.jar`, I
> > > might be missing if something else gets triggered in the release
> > pipeline.
> > > Do you think I should set up the remote repository as per the
> > instructions
> > > in `release.py` and try running `./release.py` as that might do
> something
> > > different, though I suspect that it should?
> > >
> > > [image: Screenshot 2023-12-30 at 9.33.42 AM.png]
> > >
> > >
> > > Regards,
> > > Apoorv Mittal
> > >
> > >
> > > On Fri, Dec 29, 2023 at 2:13 AM Colin McCabe 
> wrote:
> > >
> > >> Just to update this thread, everything in KAFKA-14127 is done now. A
> few
> > >> tasks got moved to a 

Re: Apache Kafka 3.7.0 Release

2024-01-02 Thread Stanislav Kozlovski
Hi Apoorv,

Thanks for taking ownership and looking into this! One more caveat is that
I believe this first publish is ran with JDK 8, as the release.py runs with
both JDK 8 and (if I recall correctly) 17 versions. This seems to fail on
the first one - so JDK 8.
Not sure if that is related in any way. And I'm also not sure if it should
be kafka-clients or just clients.

On Sat, Dec 30, 2023 at 10:48 AM Apoorv Mittal 
wrote:

> Hi Stan,
> Thanks for looking into the release. I worked with `./gradlewAll
> publishToMavenLocal` which generates the respective `kafka-clients.jar`
> and deploys to maven local, I believed that `./gradlewAll publish` should
> just publish the artifacts to remote repository and hence should always
> work as jars successfully gets deployed to local maven.
>
> Though now I set up the remote private maven repository for myself (on
> jfrog) and tried `./gradlewAll publish` on the 3.7 branch and
> successfully completed the build with all artifacts uploaded to the remote
> repository. What seems strange to me is the error you mentioned in the
> previous email regarding the reference of the clients jar. I suppose the
> reference should be to `kafka-clients.jar` rather than `clients.jar`, I
> might be missing if something else gets triggered in the release pipeline.
> Do you think I should set up the remote repository as per the instructions
> in `release.py` and try running `./release.py` as that might do something
> different, though I suspect that it should?
>
> [image: Screenshot 2023-12-30 at 9.33.42 AM.png]
>
>
> Regards,
> Apoorv Mittal
>
>
> On Fri, Dec 29, 2023 at 2:13 AM Colin McCabe  wrote:
>
>> Just to update this thread, everything in KAFKA-14127 is done now. A few
>> tasks got moved to a separate umbrella JIRA.
>>
>> Some folks are going to do more testing, both manual and automated, in
>> the next week or two. I think this will give us a good indicator of
>> stability and what we need to fix.
>>
>> Right now I'm leaning towards just making it GA since that's how most
>> features work. It's kind of rare for us to do a multi-step rollout for new
>> features.
>>
>> best,
>> Colin
>>
>>
>> On Wed, Dec 20, 2023, at 03:43, Mickael Maison wrote:
>> > Hi,
>> >
>> > With the current timeline for 3.7, I tend to agree with Viktor that
>> > JBOD support in KRaft is unlikely to receive the extensive testing
>> > this feature needs before releasing. And that's not counting the
>> > testing tasks left to do in
>> > https://issues.apache.org/jira/browse/KAFKA-14127.
>> >
>> > I'm fine sticking to the current 3.7 timeline but I'd err on the safe
>> > side and mark JBOD as early access to avoid major issues. Kafka is
>> > known for its robustness and resiliency and we certainly don't want to
>> > lose the trust we gained over years.
>> >
>> > Thanks,
>> > Mickael
>> >
>> > On Wed, Dec 20, 2023 at 12:24 AM Ismael Juma  wrote:
>> >>
>> >> Hi Viktor,
>> >>
>> >> Extending the code freeze doesn't help stabilize things. If we have
>> >> important bugs for JBOD, we should mark those as blockers and we'll
>> wait
>> >> until they are fixed if the fixes won't take too long (as usual).
>> >>
>> >> Ismael
>> >>
>> >> On Tue, Dec 19, 2023 at 11:58 AM Viktor Somogyi-Vass
>> >>  wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I was wondering what people think about extending the code freeze
>> date to
>> >> > early January?
>> >> > The reason I'm asking is that there are still a couple of testing
>> gaps in
>> >> > JBOD (https://issues.apache.org/jira/browse/KAFKA-14127) which I
>> think is
>> >> > very important to finish to ensure a high quality release (after all
>> this
>> >> > supposed to be the last 3.x) and secondly the year end holidays for
>> many
>> >> > people are coming fast, which means we'll likely have less people
>> working
>> >> > on testing and validation. In my opinion it would strengthen the
>> release if
>> >> > we could spend a week in January to really finish off JBOD and do a
>> 2 week
>> >> > stabilization.
>> >> >
>> >> > What do you all think?
>> >> >
>> >> > Best,
>> >> > Viktor
>> >> >
>> >> > On Tue, Dec 12, 2023 at 2:59 PM Stanislav Kozlovski
>> >> >  wrot

Re: Apache Kafka 3.7.0 Release

2023-12-28 Thread Stanislav Kozlovski
Hey all,

Code Freeze has been over for some time, and I have been trying to build an
RC. I have hit a problem trying to build it, hence am reaching out for some
help.

The error I'm getting while running release.py is:

./gradlewAll publish
...
> * What went wrong:
> Execution failed for task 
> ':clients:publishMavenJavaPublicationToMavenRepository'.
> > Failed to publish publication 'mavenJava' to repository 'maven'
>> Invalid publication 'mavenJava': artifact file does not exist: 
> '/Users/stanislav/Documents/code/kafka-release/.release_work_dir/kafka/clients/build/libs/clients-3.7.0-all.jar'

I think it might be related to https://github.com/apache/kafka/pull/14618
I inspected the gradle file thoroughly and even tried debugging locally,
but to no success. The release script itself also creates a clean copy of
the repo, so changing stuff locally doesn't really apply to it (and
probably isn't a good idea)

I spoke to Xavier Leaute a bit and he suggested that we probably need to
have an "afterEvaluate" block which assigns the project's name to the
archivesBaseName, saying that we need that because the shadow plugin
doesn't honor the "kafka-clients" archivesBaseName override.
I could raise a PR that does this if others agree it's the right solution.
I won't lie - this is above my head - so I'm not certain about it. (and a
quick search doesn't show much)

PS: I realize I'm building an RC to run tests against, but it occurred to
me that I haven't inspected any of the integration tests or system tests
for actual failures. I ack integration tests would usually block the CI
build and therefore be caught - but don't we look at the system tests prior
to creating an RC? Or is this something that's done with the RC?

Best,
Stan

On Wed, Dec 20, 2023 at 12:44 PM Mickael Maison 
wrote:

> Hi,
>
> With the current timeline for 3.7, I tend to agree with Viktor that
> JBOD support in KRaft is unlikely to receive the extensive testing
> this feature needs before releasing. And that's not counting the
> testing tasks left to do in
> https://issues.apache.org/jira/browse/KAFKA-14127.
>
> I'm fine sticking to the current 3.7 timeline but I'd err on the safe
> side and mark JBOD as early access to avoid major issues. Kafka is
> known for its robustness and resiliency and we certainly don't want to
> lose the trust we gained over years.
>
> Thanks,
> Mickael
>
> On Wed, Dec 20, 2023 at 12:24 AM Ismael Juma  wrote:
> >
> > Hi Viktor,
> >
> > Extending the code freeze doesn't help stabilize things. If we have
> > important bugs for JBOD, we should mark those as blockers and we'll wait
> > until they are fixed if the fixes won't take too long (as usual).
> >
> > Ismael
> >
> > On Tue, Dec 19, 2023 at 11:58 AM Viktor Somogyi-Vass
> >  wrote:
> >
> > > Hi all,
> > >
> > > I was wondering what people think about extending the code freeze date
> to
> > > early January?
> > > The reason I'm asking is that there are still a couple of testing gaps
> in
> > > JBOD (https://issues.apache.org/jira/browse/KAFKA-14127) which I
> think is
> > > very important to finish to ensure a high quality release (after all
> this
> > > supposed to be the last 3.x) and secondly the year end holidays for
> many
> > > people are coming fast, which means we'll likely have less people
> working
> > > on testing and validation. In my opinion it would strengthen the
> release if
> > > we could spend a week in January to really finish off JBOD and do a 2
> week
> > > stabilization.
> > >
> > > What do you all think?
> > >
> > > Best,
> > > Viktor
> > >
> > > On Tue, Dec 12, 2023 at 2:59 PM Stanislav Kozlovski
> > >  wrote:
> > >
> > > > Hey!
> > > >
> > > > Just notifying everybody on this thread that I have cut the 3.7
> branch
> > > and
> > > > sent a new email thread titled "New Release Branch 3.7" to the
> mailing
> > > list
> > > > <https://lists.apache.org/thread/4j87m12fm3bgq01fgphtkfb41s56w6hh>.
> > > >
> > > > Best,
> > > > Stanislav
> > > >
> > > > On Wed, Dec 6, 2023 at 11:10 AM Stanislav Kozlovski <
> > > > stanis...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hello again,
> > > > >
> > > > > Time is flying by! It is feature freeze day!
> > > > >
> > > > > By today, we expect to have major features merged and to begin
> working
> > > on
> > > > > their stabili

[jira] [Resolved] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-12679.
-
Resolution: Fixed

Marking this as done as per Lucas' comment that this is solved

 

> Rebalancing a restoring or running task may cause directory livelocking with 
> newly created task
> ---
>
> Key: KAFKA-12679
> URL: https://issues.apache.org/jira/browse/KAFKA-12679
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.1
> Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
>Reporter: Peter Nahas
>Assignee: Lucas Brutschy
>Priority: Major
> Fix For: 3.7.0
>
> Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running 
> state and the task gets rebalanced to a separate thread on the same instance, 
> the newly created task will attempt to lock the state store director while 
> the first thread is continuing to use it. This is totally normal and expected 
> behavior when the first thread is not yet aware of the rebalance. However, 
> that newly created task is effectively running a while loop with no backoff 
> waiting to lock the directory:
>  # TaskManager tells the task to restore in `tryToCompleteRestoration`
>  # The task attempts to lock the directory
>  # The lock attempt fails and throws a 
> `org.apache.kafka.streams.errors.LockException`
>  # TaskManager catches the exception, stops further processing on the task 
> and reports that not all tasks have restored
>  # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff 
> when this condition occurs, but there does not appear to be any in the code. 
> The result is that if this goes on for long enough, the lock-loop may 
> dominate CPU usage in the process and starve out the old stream thread task 
> processing.
>  
> When in this state, the DEBUG level logging for TaskManager will produce a 
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager   
>   : stream-thread [StreamThread-10] Could not initialize 0_34 due 
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread 
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for 
> task 0_34
> {noformat}
>  
>  
> I've attached a git formatted patch to resolve the issue. Simply detect the 
> scenario and sleep for the backoff time in the appropriate StreamThread.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15147) Measure pending and outstanding Remote Segment operations

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-15147.
-
Resolution: Fixed

> Measure pending and outstanding Remote Segment operations
> -
>
> Key: KAFKA-15147
> URL: https://issues.apache.org/jira/browse/KAFKA-15147
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Jorge Esteban Quilcate Otoya
>Assignee: Christo Lolov
>Priority: Major
>  Labels: tiered-storage
> Fix For: 3.7.0
>
>
>  
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Upload+and+delete+lag+metrics+in+Tiered+Storage
>  
> KAFKA-15833: RemoteCopyLagBytes 
> KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, 
> RemoteDeleteLagSegments
> KAFKA-16013: ExpiresPerSec
> KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, 
> RemoteLogMetadataCount
> KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, 
> BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec
> 
> Remote Log Segment operations (copy/delete) are executed by the Remote 
> Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default 
> TopicBasedRLMM writes to the internal Kafka topic state changes on remote log 
> segments).
> As executions run, fail, and retry; it will be important to know how many 
> operations are pending and outstanding over time to alert operators.
> Pending operations are not enough to alert, as values can oscillate closer to 
> zero. An additional condition needs to apply (running time > threshold) to 
> consider an operation outstanding.
> Proposal:
> RemoteLogManager could be extended with 2 concurrent maps 
> (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure 
> segmentId time when operation started, and based on this expose 2 metrics per 
> operation:
>  * pendingSegmentCopies: gauge of pendingSegmentCopies map
>  * outstandingSegmentCopies: loop over pending ops, and if now - startedTime 
> > timeout, then outstanding++ (maybe on debug level?)
> Is this a valuable metric to add to Tiered Storage? or better to solve on a 
> custom RLMM implementation?
> Also, does it require a KIP?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15327) Client consumer should commit offsets on close

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-15327.
-
Resolution: Fixed

Resolving since this was merged

 

> Client consumer should commit offsets on close
> --
>
> Key: KAFKA-15327
> URL: https://issues.apache.org/jira/browse/KAFKA-15327
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Philip Nee
>Priority: Major
>  Labels: kip-848, kip-848-client-support, kip-848-preview
> Fix For: 3.7.0
>
>
> In the current implementation of the KafkaConsumer, the ConsumerCoordinator 
> commits offsets before the consumer is closed, with a call to 
> maybeAutoCommitOffsetsSync(timer);
> The async consumer should provide the same behaviour to commit offsets on 
> close. 
> This fix should allow to successfully run the following integration tests 
> (defined in PlaintextConsumerTest)
>  * testAutoCommitOnClose
>  * testAutoCommitOnCloseAfterWakeup



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15780) Wait for consistent kraft metadata when creating topics in tests

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-15780.
-
Resolution: Fixed

Resolving since this was merged. Nice work!

 

> Wait for consistent kraft metadata when creating topics in tests
> 
>
> Key: KAFKA-15780
> URL: https://issues.apache.org/jira/browse/KAFKA-15780
> Project: Kafka
>  Issue Type: Test
>Reporter: David Mao
>Assignee: David Mao
>Priority: Minor
> Fix For: 3.7.0
>
>
> Tests occasionally flake when not retrying stale metadata in KRaft mode.
> I suspect that the root cause is because TestUtils.createTopicWithAdmin waits 
> for partitions to be present in the metadata cache but does not wait for the 
> metadata to be fully published to the broker.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15817) Avoid reconnecting to the same IP address if multiple addresses are available

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-15817.
-
Resolution: Fixed

Resolving since this was merged. good job!

 

> Avoid reconnecting to the same IP address if multiple addresses are available
> -
>
> Key: KAFKA-15817
> URL: https://issues.apache.org/jira/browse/KAFKA-15817
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1
>Reporter: Bob Barrett
>Assignee: Bob Barrett
>Priority: Major
> Fix For: 3.7.0
>
>
> In https://issues.apache.org/jira/browse/KAFKA-12193, we changed the DNS 
> resolution behavior for clients to re-resolve DNS after disconnecting from a 
> broker, rather than wait until we iterated over all addresses from a given 
> resolution. This is useful when the IP addresses have changed between the 
> connection and disconnection.
> However, with the behavior change, this does mean that clients could 
> potentially reconnect immediately to the same IP they just disconnected from, 
> if the IPs have not changed. In cases where the disconnection happened 
> because that IP was unhealthy (such as a case where a load balancer has 
> instances in multiple availability zones and one zone is unhealthy, or a case 
> where an intermediate component in the network path is going through a 
> rolling restart), this will delay the client successfully reconnecting. To 
> address this, clients should remember the IP they just disconnected from and 
> skip that IP when reconnecting, as long as the address resolved to multiple 
> addresses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16007) ZK migrations can be slow for large clusters

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-16007.
-
Resolution: Fixed

Closing since it's merged

 

> ZK migrations can be slow for large clusters
> 
>
> Key: KAFKA-16007
> URL: https://issues.apache.org/jira/browse/KAFKA-16007
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, kraft
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Minor
> Fix For: 3.7.0, 3.6.2
>
>
> On a large cluster with many single-partition topics, the ZK to KRaft 
> migration took nearly half an hour:
> {code}
> [KRaftMigrationDriver id=9990] Completed migration of metadata from ZooKeeper 
> to KRaft. 157396 records were generated in 2245862 ms across 67132 batches. 
> The record types were {TOPIC_RECORD=66282, PARTITION_RECORD=72067, 
> CONFIG_RECORD=17116, PRODUCER_IDS_RECORD=1, 
> ACCESS_CONTROL_ENTRY_RECORD=1930}. The current metadata offset is now 332267 
> with an epoch of 19. Saw 36 brokers in the migrated metadata [0, 1, 2, 3, 4, 
> 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
> 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35].
> {code}
> This is a result of how we generate batches of records when traversing the ZK 
> tree. Since we now using metadata transactions for the migration, we can 
> re-batch these without any consistency problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Kafka trunk test & build stability

2023-12-26 Thread Stanislav Kozlovski
Execute(ExecutorPolicy.java:64)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> > >> > > > >   at java.base/java.lang.Thread.run(Thread.java:1583)
> > >> > > > > Caused by: java.lang.IllegalArgumentException
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
> > >> > > > > ... 6 more
> > >> > > > > org.gradle.internal.remote.internal.ConnectException: Could
> not
> > >> > connect
> > >> > > > to
> > >> > > > > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289,
> > >> addresses:[/
> > >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1].
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
> > >> > > > >   at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
> > >> > > > > Caused by: java.net.ConnectException: Connection refused
> > >> > > > >   at java.base/sun.nio.ch.Net.pollConnect(Native Method)
> > >> > > > >   at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
> > >> > > > >   at
> > >> > > > > java.base/sun.nio.ch
> > >> > > >
> .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191)
> > >> > > > >   at
> > >> > > > > java.base/sun.nio.ch
> > >> > > > .SocketChannelImpl.blocki

[jira] [Resolved] (KAFKA-15818) Implement max poll interval

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-15818.
-
Resolution: Fixed

> Implement max poll interval
> ---
>
> Key: KAFKA-15818
> URL: https://issues.apache.org/jira/browse/KAFKA-15818
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Blocker
>  Labels: consumer-threading-refactor, kip-848-client-support, 
> kip-848-e2e, kip-848-preview
> Fix For: 3.7.0
>
>
> The consumer needs to be polled at a candance lower than 
> MAX_POLL_INTERVAL_MAX otherwise the consumer should try to leave the group.  
> Currently, we send an acknowledgment event to the network thread per poll.  
> The event only triggers update on autocommit state, we need to implement 
> updating the poll timer so that the consumer can leave the group when the 
> timer expires. 
>  
> The current logic looks like this:
> {code:java}
>  if (heartbeat.pollTimeoutExpired(now)) {
> // the poll timeout has expired, which means that the foreground thread 
> has stalled
> // in between calls to poll().
> log.warn("consumer poll timeout has expired. This means the time between 
> subsequent calls to poll() " +
> "was longer than the configured max.poll.interval.ms, which typically 
> implies that " +
> "the poll loop is spending too much time processing messages. You can 
> address this " +
> "either by increasing max.poll.interval.ms or by reducing the maximum 
> size of batches " +
> "returned in poll() with max.poll.records.");
> maybeLeaveGroup("consumer poll timeout has expired.");
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16026) AsyncConsumer does not send a poll event to the background thread

2023-12-26 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-16026.
-
Resolution: Fixed

https://github.com/apache/kafka/pull/15035

 

> AsyncConsumer does not send a poll event to the background thread
> -
>
> Key: KAFKA-16026
> URL: https://issues.apache.org/jira/browse/KAFKA-16026
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Blocker
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> consumer poll does not send a poll event to the background thread to:
>  # trigger autocommit
>  # reset max poll interval timer
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-975 Docker Image for Apache Kafka

2023-12-26 Thread Stanislav Kozlovski
Hey all,

As the release manager for 3.7.0, I am pretty interested to know if we
should consider this a blocker.

Do we have clarity as to whether users could practically rely on this Go
script? From a shallow look, it's only used in one line in the Dockerfile.
I guess the downside is that images extending ours would have to ship with
Golang. But in theory, once we remove it - it shouldn't be problematic
unless they extended our image, rest on the assumption that Golang was
present and used some other things in their own Dockerfile that relied on
it?

It sounds a bit minor. In the interest of the release, I would prefer we
ship with this Go script in 3.7, and change it behind the scenes in the
next release.

Thoughts?


On Wed, Dec 20, 2023 at 11:30 PM Ismael Juma  wrote:

> We should be very clear on what users can rely on when it comes to the
> docker images (i.e. what are public interfaces) and what are implementation
> details (and can be changed whenever we want). That's the only way to have
> a maintainable system. Same way we make changes to internal classes even
> though users can (and some do) rely on them.
>
> Ismael
>
> On Wed, Dec 20, 2023 at 10:55 AM Mickael Maison 
> wrote:
>
> > Hi,
> >
> > Yes changes have to be merged by a committer but for this kind of
> > decisions it's best if it's seen by more than one.
> >
> > > Hmm, is this a blocker? I don't see why. It would be nice to include it
> > in 3.7 and we have time, so I'm fine with that.
> > Sure, it's not a blocker in the usual sense. But if we ship this Go
> > binary it's possible users extending our images will start depending
> > on it. Since we want to get rid of it, I'd prefer if we never shipped
> > it.
> >
> > Thanks,
> > Mickael
> >
> >
> > On Wed, Dec 20, 2023 at 4:28 PM Ismael Juma  wrote:
> > >
> > > Hi Mickael,
> > >
> > > A couple of comments inline.
> > >
> > > On Wed, Dec 20, 2023 at 3:34 AM Mickael Maison <
> mickael.mai...@gmail.com
> > >
> > > wrote:
> > >
> > > > When you say, "we have opted to take a different approach", who is
> > > > "we"? I think this decision should be made by the committers.
> > > >
> > >
> > > Changes can only be merged by committers, so I think it's implicit that
> > at
> > > least one committer would have to agree. :) I think Vedarth was simply
> > > saying that the group working on the KIP had a new proposal that
> > addressed
> > > all the goals in a better way than the original proposal.
> > >
> > > I marked the Jira (https://issues.apache.org/jira/browse/KAFKA-16016)
> > > > as a blocker for 3.7 as I think we need to make this decision before
> > > > releasing the docker images.
> > > >
> > >
> > > Hmm, is this a blocker? I don't see why. It would be nice to include it
> > in
> > > 3.7 and we have time, so I'm fine with that.
> > >
> > > Ismael
> >
>


-- 
Best,
Stanislav


Re: Kafka trunk test & build stability

2023-12-19 Thread Stanislav Kozlovski
Hey Николай,

Apologies about this - I wasn't aware of this behavior. I have made all the
gists public.



On Wed, Dec 20, 2023 at 12:09 AM Greg Harris 
wrote:

> Hey Stan,
>
> Thanks for opening the discussion. I haven't been looking at overall
> build duration recently, so it's good that you are calling it out.
>
> I worry about us over-indexing on this one build, which itself appears
> to be an outlier. I only see one other build [1] above 6h overall in
> the last 90 days in this view: [2]
> And I don't see any overlap of failed tests in these two builds, which
> makes it less likely that these particular failed tests are the causes
> of long build times.
>
> Separately, I've been investigating build environment slowness, and
> trying to connect it with test failures [3]. I observed that the CI
> build environment is 2-20 times slower than my developer machine (M1
> mac).
> When I simulate a similar slowdown locally, there are tests which
> become significantly more flakey, often due to hard-coded timeouts.
> I think that these particularly nasty builds could be explained by
> long-tail slowdowns causing arbitrary tests to take an excessive time
> to execute.
>
> Rather than trying to find signals in these rare test failures, I
> think we should find tests that have these sorts of failures more
> regularly.
> There are lots of builds in the 5-6h duration bracket, which is
> certainly unacceptably long. We should look into these builds to find
> improvements and optimizations.
>
> [1] https://ge.apache.org/s/ygh4gbz4uma6i/
> [2]
> https://ge.apache.org/scans?list.sortColumn=buildDuration=P90D=kafka=trunk=America%2FNew_York
> [3] https://github.com/apache/kafka/pull/15008
>
> Thanks for looking into this!
> Greg
>
> On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков 
> wrote:
> >
> > Hello, Stanislav.
> >
> > Can you, please, make the gist public.
> > Private gists not available for some GitHub users even if link are known.
> >
> > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski 
> > > 
> написал(а):
> > >
> > > Hey everybody,
> > > I've heard various complaints that build times in trunk are taking too
> > > long, some taking as much as 8 hours (the timeout) - and this is
> slowing us
> > > down from being able to meet the code freeze deadline for 3.7.
> > >
> > > I took it upon myself to gather up some data in Gradle Enterprise to
> see if
> > > there are any outlier tests that are causing this slowness. Turns out
> there
> > > are a few, in this particular build -
> https://ge.apache.org/s/un2hv7n6j374k/
> > > - which took 10 hours and 29 minutes in total.
> > >
> > > I have compiled the tests that took a disproportionately large amount
> of
> > > time (20m+), alongside their time, error message and a link to their
> full
> > > log output here -
> > >
> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> > >
> > > It includes failures from core, streams, storage and clients.
> > > Interestingly, some other tests that don't fail also take a long time
> in
> > > what is apparently the test harness framework. See the gist for more
> > > information.
> > >
> > > I am starting this thread with the intention of getting the discussion
> > > started and brainstorming what we can do to get the build times back
> under
> > > control.
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> >
>


-- 
Best,
Stanislav


Kafka trunk test & build stability

2023-12-19 Thread Stanislav Kozlovski
Hey everybody,
I've heard various complaints that build times in trunk are taking too
long, some taking as much as 8 hours (the timeout) - and this is slowing us
down from being able to meet the code freeze deadline for 3.7.

I took it upon myself to gather up some data in Gradle Enterprise to see if
there are any outlier tests that are causing this slowness. Turns out there
are a few, in this particular build - https://ge.apache.org/s/un2hv7n6j374k/
- which took 10 hours and 29 minutes in total.

I have compiled the tests that took a disproportionately large amount of
time (20m+), alongside their time, error message and a link to their full
log output here -
https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2

It includes failures from core, streams, storage and clients.
Interestingly, some other tests that don't fail also take a long time in
what is apparently the test harness framework. See the gist for more
information.

I am starting this thread with the intention of getting the discussion
started and brainstorming what we can do to get the build times back under
control.


--
Best,
Stanislav


Re: New Release Branch 3.7

2023-12-14 Thread Stanislav Kozlovski
Hey all,

(thanks to Josep for reviewing the 3.8 bump PR)

I have two more PRs to get reviewed regarding the release:
- targeting trunk: MINOR: Update documentation.html with the 3.7 release
#15010 <https://github.com/apache/kafka/pull/15010>
- targeting 3.7: MINOR: Update documentation.html with the 3.7 release
#15011 <https://github.com/apache/kafka/pull/15011>

Additionally, I have one ask:
- if you are reading this message, can you double-check the list of KIPs
being released in the Release Page
<https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0> and
if you recognize any KIP you are involved it - can you ensure that the data
in the page & the associated JIRA/KIP (target release, merge status, vote
status) is up to date?

I have myself went over the list of KIPs and a bit of the commit history. I
think KIPs:
- KIP-998: Give ProducerConfig(props, doLog) constructor protected access
- KIP-938: Add more metrics for measuring KRaft performance
are slipping this release.

By far our biggest feature this release, KIP-848 The Next Generation of the
Consumer Rebalance Protocol, is a bit on the border. The main PR that KIP
is dependent on is this KAFKA-15456: Commit/Fetch error handling
improvements and V9 support #14557
<https://github.com/apache/kafka/pull/14557>.
While it's a gray area, I am weighing this change more on the stabilization
side as well as allowing it to come in the middle of feature freeze for two
chief reasons: a) it fixes issues that were recently discovered, hence can
be marked as stabilization work and b) it only touches the code path used
for Early Access, and not the existing production code path (hence doesn't
substantially risk the release)

Additionally, KIP-858 has one minor but important change pending
<https://github.com/apache/kafka/pull/14984>.

With that, I remind you that there are only 6 days to code freeze! It's not
long until we will have our very first Apache Kafka 3.7 RC! Let's get this
shipped.

Best,
Stanislav

On Tue, Dec 12, 2023 at 2:56 PM Stanislav Kozlovski 
wrote:

> Hello Kafka developers and friends,
>
> As promised, we now have a release branch for 3.7 release.
> Trunk is being bumped to 3.8.0-SNAPSHOT (please help review the PR 
> <https://github.com/apache/kafka/pull/14993>).
>
> I'll be going over the JIRAs to move every non-blocker from this release to 
> the next release.
>
> From this point, most changes should go to trunk.
> *- Blockers (existing and new that we discover while testing the release) 
> will be double-committed.*
> *- Please discuss with your reviewer whether your PR should go to trunk or to 
> trunk+release so they can merge accordingly.*
> *- Please help us test the release!*
>
> Thanks!
>
> --
> Best,
> Stanislav
>


-- 
Best,
Stanislav


Re: Apache Kafka 3.7.0 Release

2023-12-12 Thread Stanislav Kozlovski
Hey!

Just notifying everybody on this thread that I have cut the 3.7 branch and
sent a new email thread titled "New Release Branch 3.7" to the mailing list
<https://lists.apache.org/thread/4j87m12fm3bgq01fgphtkfb41s56w6hh>.

Best,
Stanislav

On Wed, Dec 6, 2023 at 11:10 AM Stanislav Kozlovski 
wrote:

> Hello again,
>
> Time is flying by! It is feature freeze day!
>
> By today, we expect to have major features merged and to begin working on
> their stabilisation. Minor features should have PRs.
>
> I am planning to cut the release branch soon - on Monday EU daytime. When
> I do that, I will create a new e-mail thread titled "New release branch
> 3.7.0" to notify you, so be on the lookout for that. I will also notify
> this thread.
>
> Thank you for your contributions. Let's get this release shipped!
>
> Best,
> Stanislav
>
>
> On Fri, Nov 24, 2023 at 6:11 PM Stanislav Kozlovski <
> stanis...@confluent.io> wrote:
>
>> Hey all,
>>
>> The KIP Freeze has passed. I count 31 KIPs that will be going into the
>> 3.7 Release. Thank you all for your hard work!
>>
>> They are the following (some of these were accepted in previous releases
>> and have minor parts going out, some targeting a Preview release and the
>> rest being fully released as regular.):
>>  - KIP-1000: List Client Metrics Configuration Resources
>>  - KIP-1001: Add CurrentControllerId Metric
>>  - KIP-405: Kafka Tiered Storage
>>  - KIP-580: Exponential Backoff for Kafka Clients
>>  - KIP-714: Client metrics and observability
>>  - KIP-770: Replace "buffered.records.per.partition" &
>> "cache.max.bytes.buffering" with
>> "{statestore.cache}/{input.buffer}.max.bytes"
>>  - KIP-848: The Next Generation of the Consumer Rebalance Protocol
>>  - KIP-858: Handle JBOD broker disk failure in KRaft
>>  - KIP-890: Transactions Server-Side Defense
>>  - KIP-892: Transactional StateStores
>>  - KIP-896: Remove old client protocol API versions in Kafka 4.0 -
>> metrics/request log changes to identify deprecated apis
>>  - KIP-925: Rack aware task assignment in Kafka Streams
>>  - KIP-938: Add more metrics for measuring KRaft performance
>>  - KIP-951 - Leader discovery optimizations for the client
>>  - KIP-954: expand default DSL store configuration to custom types
>>  - KIP-959: Add BooleanConverter to Kafka Connect
>>  - KIP-960: Single-key single-timestamp IQv2 for state stores
>>  - KIP-963: Additional metrics in Tiered Storage
>>  - KIP-968: Support single-key_multi-timestamp Interactive Queries (IQv2)
>> for Versioned State Stores
>>  - KIP-970: Deprecate and remove Connect's redundant task configurations
>> endpoint
>>  - KIP-975: Docker Image for Apache Kafka
>>  - KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect
>>  - KIP-978: Allow dynamic reloading of certificates with different DN /
>> SANs
>>  - KIP-979: Allow independently stop KRaft processes
>>  - KIP-980: Allow creating connectors in a stopped state
>>  - KIP-985: Add reverseRange and reverseAll query over kv-store in IQv2
>>  - KIP-988: Streams Standby Update Listener
>>  - KIP-992: Proposal to introduce IQv2 Query Types: TimestampedKeyQuery
>> and TimestampedRangeQuery
>>  - KIP-998: Give ProducerConfig(props, doLog) constructor protected access
>>
>> Notable KIPs that didn't make the Freeze were KIP-977 - it only got 2/3
>> votes.
>>
>> For the full list and latest source of truth, refer to the Release Plan
>> 3.7.0 Document
>> <https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0>.
>>
>> Thanks for your contributions once again!
>> Best,
>> Stan
>>
>>
>> On Thu, Nov 23, 2023 at 2:27 PM Nick Telford 
>> wrote:
>>
>>> Hi Stan,
>>>
>>> I'd like to propose including KIP-892 in the 3.7 release. The KIP has
>>> been
>>> accepted and I'm just working on rebasing the implementation against
>>> trunk
>>> before I open a PR.
>>>
>>> Regards,
>>> Nick
>>>
>>> On Tue, 21 Nov 2023 at 11:27, Mayank Shekhar Narula <
>>> mayanks.nar...@gmail.com> wrote:
>>>
>>> > Hi Stan
>>> >
>>> > Can you include KIP-951 to the 3.7 release plan? All PRs are merged in
>>> the
>>> > trunk.
>>> >
>>> > On Wed, Nov 15, 2023 at 4:05 PM Stanislav Kozlovski
>>> >  wrote:
>>> >
>>> > > Friendly reminder to everybody that the KIP Freeze is *ex

New Release Branch 3.7

2023-12-12 Thread Stanislav Kozlovski
Hello Kafka developers and friends,

As promised, we now have a release branch for 3.7 release.
Trunk is being bumped to 3.8.0-SNAPSHOT (please help review the PR
).

I'll be going over the JIRAs to move every non-blocker from this
release to the next release.

>From this point, most changes should go to trunk.
*- Blockers (existing and new that we discover while testing the
release) will be double-committed.*
*- Please discuss with your reviewer whether your PR should go to
trunk or to trunk+release so they can merge accordingly.*
*- Please help us test the release!*

Thanks!

-- 
Best,
Stanislav


Re: Apache Kafka 3.7.0 Release

2023-12-06 Thread Stanislav Kozlovski
Hello again,

Time is flying by! It is feature freeze day!

By today, we expect to have major features merged and to begin working on
their stabilisation. Minor features should have PRs.

I am planning to cut the release branch soon - on Monday EU daytime. When I
do that, I will create a new e-mail thread titled "New release branch
3.7.0" to notify you, so be on the lookout for that. I will also notify
this thread.

Thank you for your contributions. Let's get this release shipped!

Best,
Stanislav


On Fri, Nov 24, 2023 at 6:11 PM Stanislav Kozlovski 
wrote:

> Hey all,
>
> The KIP Freeze has passed. I count 31 KIPs that will be going into the 3.7
> Release. Thank you all for your hard work!
>
> They are the following (some of these were accepted in previous releases
> and have minor parts going out, some targeting a Preview release and the
> rest being fully released as regular.):
>  - KIP-1000: List Client Metrics Configuration Resources
>  - KIP-1001: Add CurrentControllerId Metric
>  - KIP-405: Kafka Tiered Storage
>  - KIP-580: Exponential Backoff for Kafka Clients
>  - KIP-714: Client metrics and observability
>  - KIP-770: Replace "buffered.records.per.partition" &
> "cache.max.bytes.buffering" with
> "{statestore.cache}/{input.buffer}.max.bytes"
>  - KIP-848: The Next Generation of the Consumer Rebalance Protocol
>  - KIP-858: Handle JBOD broker disk failure in KRaft
>  - KIP-890: Transactions Server-Side Defense
>  - KIP-892: Transactional StateStores
>  - KIP-896: Remove old client protocol API versions in Kafka 4.0 -
> metrics/request log changes to identify deprecated apis
>  - KIP-925: Rack aware task assignment in Kafka Streams
>  - KIP-938: Add more metrics for measuring KRaft performance
>  - KIP-951 - Leader discovery optimizations for the client
>  - KIP-954: expand default DSL store configuration to custom types
>  - KIP-959: Add BooleanConverter to Kafka Connect
>  - KIP-960: Single-key single-timestamp IQv2 for state stores
>  - KIP-963: Additional metrics in Tiered Storage
>  - KIP-968: Support single-key_multi-timestamp Interactive Queries (IQv2)
> for Versioned State Stores
>  - KIP-970: Deprecate and remove Connect's redundant task configurations
> endpoint
>  - KIP-975: Docker Image for Apache Kafka
>  - KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect
>  - KIP-978: Allow dynamic reloading of certificates with different DN /
> SANs
>  - KIP-979: Allow independently stop KRaft processes
>  - KIP-980: Allow creating connectors in a stopped state
>  - KIP-985: Add reverseRange and reverseAll query over kv-store in IQv2
>  - KIP-988: Streams Standby Update Listener
>  - KIP-992: Proposal to introduce IQv2 Query Types: TimestampedKeyQuery
> and TimestampedRangeQuery
>  - KIP-998: Give ProducerConfig(props, doLog) constructor protected access
>
> Notable KIPs that didn't make the Freeze were KIP-977 - it only got 2/3
> votes.
>
> For the full list and latest source of truth, refer to the Release Plan
> 3.7.0 Document
> <https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0>.
>
> Thanks for your contributions once again!
> Best,
> Stan
>
>
> On Thu, Nov 23, 2023 at 2:27 PM Nick Telford 
> wrote:
>
>> Hi Stan,
>>
>> I'd like to propose including KIP-892 in the 3.7 release. The KIP has been
>> accepted and I'm just working on rebasing the implementation against trunk
>> before I open a PR.
>>
>> Regards,
>> Nick
>>
>> On Tue, 21 Nov 2023 at 11:27, Mayank Shekhar Narula <
>> mayanks.nar...@gmail.com> wrote:
>>
>> > Hi Stan
>> >
>> > Can you include KIP-951 to the 3.7 release plan? All PRs are merged in
>> the
>> > trunk.
>> >
>> > On Wed, Nov 15, 2023 at 4:05 PM Stanislav Kozlovski
>> >  wrote:
>> >
>> > > Friendly reminder to everybody that the KIP Freeze is *exactly 7 days
>> > away*
>> > > - November 22.
>> > >
>> > > A KIP must be accepted by this date in order to be considered for this
>> > > release. Note, any KIP that may not be implemented in time, or
>> otherwise
>> > > risks heavily destabilizing the release, should be deferred.
>> > >
>> > > Best,
>> > > Stan
>> > >
>> > > On Fri, Nov 3, 2023 at 6:03 AM Sophie Blee-Goldman <
>> > sop...@responsive.dev>
>> > > wrote:
>> > >
>> > > > Looks great, thank you! +1
>> > > >
>> > > > On Thu, Nov 2, 2023 at 10:21 AM David Jacot
>> > > > > >
>> > > > wrote:
>>

Re: Apache Kafka 3.7.0 Release

2023-11-24 Thread Stanislav Kozlovski
Hey all,

The KIP Freeze has passed. I count 31 KIPs that will be going into the 3.7
Release. Thank you all for your hard work!

They are the following (some of these were accepted in previous releases
and have minor parts going out, some targeting a Preview release and the
rest being fully released as regular.):
 - KIP-1000: List Client Metrics Configuration Resources
 - KIP-1001: Add CurrentControllerId Metric
 - KIP-405: Kafka Tiered Storage
 - KIP-580: Exponential Backoff for Kafka Clients
 - KIP-714: Client metrics and observability
 - KIP-770: Replace "buffered.records.per.partition" &
"cache.max.bytes.buffering" with
"{statestore.cache}/{input.buffer}.max.bytes"
 - KIP-848: The Next Generation of the Consumer Rebalance Protocol
 - KIP-858: Handle JBOD broker disk failure in KRaft
 - KIP-890: Transactions Server-Side Defense
 - KIP-892: Transactional StateStores
 - KIP-896: Remove old client protocol API versions in Kafka 4.0 -
metrics/request log changes to identify deprecated apis
 - KIP-925: Rack aware task assignment in Kafka Streams
 - KIP-938: Add more metrics for measuring KRaft performance
 - KIP-951 - Leader discovery optimizations for the client
 - KIP-954: expand default DSL store configuration to custom types
 - KIP-959: Add BooleanConverter to Kafka Connect
 - KIP-960: Single-key single-timestamp IQv2 for state stores
 - KIP-963: Additional metrics in Tiered Storage
 - KIP-968: Support single-key_multi-timestamp Interactive Queries (IQv2)
for Versioned State Stores
 - KIP-970: Deprecate and remove Connect's redundant task configurations
endpoint
 - KIP-975: Docker Image for Apache Kafka
 - KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect
 - KIP-978: Allow dynamic reloading of certificates with different DN / SANs
 - KIP-979: Allow independently stop KRaft processes
 - KIP-980: Allow creating connectors in a stopped state
 - KIP-985: Add reverseRange and reverseAll query over kv-store in IQv2
 - KIP-988: Streams Standby Update Listener
 - KIP-992: Proposal to introduce IQv2 Query Types: TimestampedKeyQuery and
TimestampedRangeQuery
 - KIP-998: Give ProducerConfig(props, doLog) constructor protected access

Notable KIPs that didn't make the Freeze were KIP-977 - it only got 2/3
votes.

For the full list and latest source of truth, refer to the Release Plan
3.7.0 Document
<https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0>.

Thanks for your contributions once again!
Best,
Stan


On Thu, Nov 23, 2023 at 2:27 PM Nick Telford  wrote:

> Hi Stan,
>
> I'd like to propose including KIP-892 in the 3.7 release. The KIP has been
> accepted and I'm just working on rebasing the implementation against trunk
> before I open a PR.
>
> Regards,
> Nick
>
> On Tue, 21 Nov 2023 at 11:27, Mayank Shekhar Narula <
> mayanks.nar...@gmail.com> wrote:
>
> > Hi Stan
> >
> > Can you include KIP-951 to the 3.7 release plan? All PRs are merged in
> the
> > trunk.
> >
> > On Wed, Nov 15, 2023 at 4:05 PM Stanislav Kozlovski
> >  wrote:
> >
> > > Friendly reminder to everybody that the KIP Freeze is *exactly 7 days
> > away*
> > > - November 22.
> > >
> > > A KIP must be accepted by this date in order to be considered for this
> > > release. Note, any KIP that may not be implemented in time, or
> otherwise
> > > risks heavily destabilizing the release, should be deferred.
> > >
> > > Best,
> > > Stan
> > >
> > > On Fri, Nov 3, 2023 at 6:03 AM Sophie Blee-Goldman <
> > sop...@responsive.dev>
> > > wrote:
> > >
> > > > Looks great, thank you! +1
> > > >
> > > > On Thu, Nov 2, 2023 at 10:21 AM David Jacot
> >  > > >
> > > > wrote:
> > > >
> > > > > +1 from me as well. Thanks, Stan!
> > > > >
> > > > > David
> > > > >
> > > > > On Thu, Nov 2, 2023 at 6:04 PM Ismael Juma 
> > wrote:
> > > > >
> > > > > > Thanks Stanislav, +1
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, Nov 2, 2023 at 7:01 AM Stanislav Kozlovski
> > > > > >  wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Given the discussion here and the lack of any pushback, I have
> > > > changed
> > > > > > the
> > > > > > > dates of the release:
> > > > > > > - KIP Freeze - *November 22 *(moved 4 days later)
> > > > > > > - Feature Freeze - *December 6 *(moved 2 days earlier)
> &g

Re: Apache Kafka 3.7.0 Release

2023-11-15 Thread Stanislav Kozlovski
Friendly reminder to everybody that the KIP Freeze is *exactly 7 days away*
- November 22.

A KIP must be accepted by this date in order to be considered for this
release. Note, any KIP that may not be implemented in time, or otherwise
risks heavily destabilizing the release, should be deferred.

Best,
Stan

On Fri, Nov 3, 2023 at 6:03 AM Sophie Blee-Goldman 
wrote:

> Looks great, thank you! +1
>
> On Thu, Nov 2, 2023 at 10:21 AM David Jacot 
> wrote:
>
> > +1 from me as well. Thanks, Stan!
> >
> > David
> >
> > On Thu, Nov 2, 2023 at 6:04 PM Ismael Juma  wrote:
> >
> > > Thanks Stanislav, +1
> > >
> > > Ismael
> > >
> > > On Thu, Nov 2, 2023 at 7:01 AM Stanislav Kozlovski
> > >  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Given the discussion here and the lack of any pushback, I have
> changed
> > > the
> > > > dates of the release:
> > > > - KIP Freeze - *November 22 *(moved 4 days later)
> > > > - Feature Freeze - *December 6 *(moved 2 days earlier)
> > > > - Code Freeze - *December 20*
> > > >
> > > > If anyone has any thoughts against this proposal - please let me
> know!
> > It
> > > > would be good to settle on this early. These will be the dates we're
> > > going
> > > > with
> > > >
> > > > Best,
> > > > Stanislav
> > > >
> > > > On Thu, Oct 26, 2023 at 12:15 AM Sophie Blee-Goldman <
> > > > sop...@responsive.dev>
> > > > wrote:
> > > >
> > > > > Thanks for the response and explanations -- I think the main
> question
> > > for
> > > > > me
> > > > > was whether we intended to permanently increase the KF -- FF gap
> from
> > > the
> > > > > historical 1 week to 3 weeks? Maybe this was a conscious decision
> > and I
> > > > > just
> > > > >  missed the memo, hopefully someone else can chime in here. I'm all
> > for
> > > > > additional though. And looking around at some of the recent
> releases,
> > > it
> > > > > seems like we haven't been consistently following the "usual"
> > schedule
> > > > > since
> > > > > the 2.x releases.
> > > > >
> > > > > Anyways, my main concern was making sure to leave a full 2 weeks
> > > between
> > > > > feature freeze and code freeze, so I'm generally happy with the new
> > > > > proposal.
> > > > > Although I would still prefer to have the KIP freeze fall on a
> > > Wednesday
> > > > --
> > > > > Ismael actually brought up the same thing during the 3.5.0 release
> > > > > planning,
> > > > > so I'll just refer to his explanation for this:
> > > > >
> > > > > We typically choose a Wednesday for the various freeze dates -
> there
> > > are
> > > > > > often 1-2 day slips and it's better if that doesn't require
> people
> > > > > > working through the weekend.
> > > > > >
> > > > >
> > > > > (From this mailing list thread
> > > > > <https://lists.apache.org/thread/dv1rym2jkf0141sfsbkws8ckkzw7st5h
> >)
> > > > >
> > > > > Thanks for driving the release!
> > > > > Sophie
> > > > >
> > > > > On Wed, Oct 25, 2023 at 8:13 AM Stanislav Kozlovski
> > > > >  wrote:
> > > > >
> > > > > > Thanks for the thorough response, Sophie.
> > > > > >
> > > > > > - Added to the "Future Release Plan"
> > > > > >
> > > > > > > 1. Why is the KIP freeze deadline on a Saturday?
> > > > > >
> > > > > > It was simply added as a starting point - around 30 days from the
> > > > > > announcement. We can move it earlier to the 15th of November, but
> > my
> > > > > > thinking is later is better with these things - it's already
> > > aggressive
> > > > > > enough. e.g given the choice of Nov 15 vs Nov 18, I don't
> > necessarily
> > > > > see a
> > > > > > strong reason to choose 15.
> > > > > >
> > > > > > If people feel strongly about this, to make up for this, we can
> eat
> > > > into
> > > > > >

Re: Apache Kafka 3.7.0 Release

2023-11-02 Thread Stanislav Kozlovski
Hi all,

Given the discussion here and the lack of any pushback, I have changed the
dates of the release:
- KIP Freeze - *November 22 *(moved 4 days later)
- Feature Freeze - *December 6 *(moved 2 days earlier)
- Code Freeze - *December 20*

If anyone has any thoughts against this proposal - please let me know! It
would be good to settle on this early. These will be the dates we're going
with

Best,
Stanislav

On Thu, Oct 26, 2023 at 12:15 AM Sophie Blee-Goldman 
wrote:

> Thanks for the response and explanations -- I think the main question for
> me
> was whether we intended to permanently increase the KF -- FF gap from the
> historical 1 week to 3 weeks? Maybe this was a conscious decision and I
> just
>  missed the memo, hopefully someone else can chime in here. I'm all for
> additional though. And looking around at some of the recent releases, it
> seems like we haven't been consistently following the "usual" schedule
> since
> the 2.x releases.
>
> Anyways, my main concern was making sure to leave a full 2 weeks between
> feature freeze and code freeze, so I'm generally happy with the new
> proposal.
> Although I would still prefer to have the KIP freeze fall on a Wednesday --
> Ismael actually brought up the same thing during the 3.5.0 release
> planning,
> so I'll just refer to his explanation for this:
>
> We typically choose a Wednesday for the various freeze dates - there are
> > often 1-2 day slips and it's better if that doesn't require people
> > working through the weekend.
> >
>
> (From this mailing list thread
> <https://lists.apache.org/thread/dv1rym2jkf0141sfsbkws8ckkzw7st5h>)
>
> Thanks for driving the release!
> Sophie
>
> On Wed, Oct 25, 2023 at 8:13 AM Stanislav Kozlovski
>  wrote:
>
> > Thanks for the thorough response, Sophie.
> >
> > - Added to the "Future Release Plan"
> >
> > > 1. Why is the KIP freeze deadline on a Saturday?
> >
> > It was simply added as a starting point - around 30 days from the
> > announcement. We can move it earlier to the 15th of November, but my
> > thinking is later is better with these things - it's already aggressive
> > enough. e.g given the choice of Nov 15 vs Nov 18, I don't necessarily
> see a
> > strong reason to choose 15.
> >
> > If people feel strongly about this, to make up for this, we can eat into
> > the KF-FF time as I'll touch upon later, and move FF a few days earlier
> to
> > land on a Wednesday.
> >
> > This reduces the time one has to get their feature complete after KF, but
> > allows for longer time to a KIP accepted, so the KF-FF gap can be made up
> > when developing the feature in parallel.
> >
> > > , this makes it easy for everyone to remember when the next deadline is
> > so they can make sure to get everything in on time. I worry that varying
> > this will catch people off guard.
> >
> > I don't see much value in optimizing the dates for ease of memory -
> besides
> > the KIP Freeze (which is the base date), there are only two more dates to
> > remember that are on the wiki. More importantly, we have a plethora of
> > tools that can be used to set up reminders - so a contributor doesn't
> > necessarily need to remember anything if they're serious about getting
> > their feature in.
> >
> > > 3. Is there a particular reason for having the feature freeze almost a
> > full 3 weeks from the KIP freeze? ... having 3 weeks between the KIP and
> > feature freeze (which are
> > usually separated by just a single week)?
> >
> > I was going off the last two releases, which had *20 days* (~3 weeks) in
> > between KF & FF. Here are their dates:
> >
> > - AK 3.5
> >   - KF: 22 March
> >   - FF: 12 April
> > - (20 days after)
> >   - CF: 26 April
> > - (14 days after)
> >   - Release: 15 June
> >  - 50 days after CF
> > - AK 3.6
> >   - KF: 26 July
> >   - FF: 16 Aug
> > - (20 days after)
> >   - CF: 30 Aug
> > - (14 days after)
> >   - Release: 11 October
> > - 42 days after CF
> >
> > I don't know the precise reasoning for extending the time, nor what is
> the
> > most appropriate time - but having talked offline to some folks prior to
> > this discussion, it seemed reasonable.
> >
> > Your proposal uses an aggressive 1-week gap between both, which is quite
> > the jump from the previous 3 weeks.
> >
> > Perhaps someone with more direct experience in the recent can chime in
> > here. Both for the reasoning for the extension from 1w to 3w in the last
> 2
> > release

Re: [VOTE] KIP-975: Docker Image for Apache Kafka

2023-10-27 Thread Stanislav Kozlovski
Thanks for the KIP!

Great idea, well thought out and much needed. +1 (binding)

On Fri, 27 Oct 2023 at 06:36, Krishna Agarwal 
wrote:

> Hi,
> I'd like to call a vote on KIP-975 which aims to publish an official docker
> image for Apache Kafka.
>
> KIP -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka
>
> Discussion thread -
> https://lists.apache.org/thread/3g43hps2dmkyxgglplrlwpsf7vkywkyy
>
> Regards,
> Krishna
>


Re: Apache Kafka 3.7.0 Release

2023-10-25 Thread Stanislav Kozlovski
ying this will catch people off guard.
> 3. Is there a particular reason for having the feature freeze almost a full
> 3 weeks from the KIP freeze? I understand moving the KIP freeze deadline up
> to account for recent release delays, but aren't we wasting some of that
> gained time by having 3 weeks between the KIP and feature freeze (which are
> usually separated by just a single week)?
> 4. On the other hand, we usually have a full two weeks from the feature
> freeze deadline to the code freeze but with the given schedule there would
> only be a week and a half. Given how important this period is for testing
> and stabilizing the release, and how vital this is for uncovering blockers
> that would have delayed the release deadline, I really think we should
> maintain the two-week gap (at a minimum)
>
> Note that historically, we have set all the deadlines on a Wednesday and
> when in doubt erred on the side of an earlier deadline, to encourage folks
> to get their work completed and stabilized as soon as possible. We can, and
> often have, allowed things to come in late between the Wednesday freeze
> deadline and the following Friday, but only on a case-by-case basis. This
> way the RM has the flexibility to determine what to allow and when, if need
> be, while still having everyone aim for the established deadlines.
>
> Just to throw a suggestion out there, if we want to avoid running into the
> winter holidays while still making up for slipping of recent releases, what
> about something like this:
>
> KIP Freeze: Nov 22nd
> Feature Freeze: Nov 29th
> Code Freeze: Dec 13th
>
> We can keep the release target as Jan 3rd or move it up to Dec 27th.
> Personally, I would just aim to have it as Dec 27th but keep the stated
> target as Jan 3rd, to account for unexpected blockers/delays and time away
> during the winter holidays
>
> Thoughts?
>
> On Mon, Oct 23, 2023 at 3:14 PM Sophie Blee-Goldman  >
> wrote:
>
> > Can you add the 3.7 plan to the release schedule page?
> >
> > (this -->
> > https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan)
> >
> > Thanks!
> >
> > On Sun, Oct 15, 2023 at 2:27 AM Stanislav Kozlovski
> >  wrote:
> >
> >> Hey Chris,
> >>
> >> Thanks for the catch! It was indeed copied and I wasn't sure what to
> make
> >> of the bullet point, so I kept it. What you say makes sense - I removed
> >> it.
> >>
> >> I also added KIP-976!
> >>
> >> Cheers!
> >>
> >> On Sat, Oct 14, 2023 at 9:35 PM Chris Egerton 
> >> wrote:
> >>
> >> > Hi Stanislav,
> >> >
> >> > Thanks for putting this together! I think the "Ensure that release
> >> > candidates include artifacts for the new Connect test-plugins module"
> >> > section (which I'm guessing was copied over from the 3.6.0 release
> >> plan?)
> >> > can be removed; we made sure that those artifacts were present for
> >> 3.6.0,
> >> > and I don't anticipate any changes that would make them likelier to be
> >> > accidentally dropped in subsequent releases than any other Maven
> >> artifacts
> >> > that we publish.
> >> >
> >> > Also, can we add KIP-976 (
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> >> > )
> >> > to the release plan? The vote thread for it passed last week and I've
> >> > published a complete PR (https://github.com/apache/kafka/pull/14538),
> >> so
> >> > it
> >> > shouldn't be too difficult to get things merged in time for 3.7.0.
> >> >
> >> > Cheers,
> >> >
> >> > Chris
> >> >
> >> > On Sat, Oct 14, 2023 at 3:26 PM Stanislav Kozlovski
> >> >  wrote:
> >> >
> >> > > Thanks for letting me drive it, folks.
> >> > >
> >> > > I've created the 3.7.0 release page here:
> >> > >
> https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0
> >> > > It outlines the key milestones and important dates for the release.
> >> > >
> >> > > In particular, since the last two releases slipped their originally
> >> > > targeted release date by taking an average of 46 days after code
> >> freeze
> >> > (as
> >> > > opposed to the minimum which is 14 days), I pulled the dates forward
> >> to
> >> > try
> >

Re: Apache Kafka 3.7.0 Release

2023-10-15 Thread Stanislav Kozlovski
Hey Chris,

Thanks for the catch! It was indeed copied and I wasn't sure what to make
of the bullet point, so I kept it. What you say makes sense - I removed it.

I also added KIP-976!

Cheers!

On Sat, Oct 14, 2023 at 9:35 PM Chris Egerton 
wrote:

> Hi Stanislav,
>
> Thanks for putting this together! I think the "Ensure that release
> candidates include artifacts for the new Connect test-plugins module"
> section (which I'm guessing was copied over from the 3.6.0 release plan?)
> can be removed; we made sure that those artifacts were present for 3.6.0,
> and I don't anticipate any changes that would make them likelier to be
> accidentally dropped in subsequent releases than any other Maven artifacts
> that we publish.
>
> Also, can we add KIP-976 (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> )
> to the release plan? The vote thread for it passed last week and I've
> published a complete PR (https://github.com/apache/kafka/pull/14538), so
> it
> shouldn't be too difficult to get things merged in time for 3.7.0.
>
> Cheers,
>
> Chris
>
> On Sat, Oct 14, 2023 at 3:26 PM Stanislav Kozlovski
>  wrote:
>
> > Thanks for letting me drive it, folks.
> >
> > I've created the 3.7.0 release page here:
> > https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0
> > It outlines the key milestones and important dates for the release.
> >
> > In particular, since the last two releases slipped their originally
> > targeted release date by taking an average of 46 days after code freeze
> (as
> > opposed to the minimum which is 14 days), I pulled the dates forward to
> try
> > and catch up with the original release schedule.
> > You can refer to the last release during the Christmas holiday season -
> > Apache
> > Kafka 3.4
> > <https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.4.0> -
> > to
> > see sample dates.
> >
> > The currently proposed dates are:
> >
> > *KIP Freeze - 18th November *(Saturday)
> > *This is 1 month and four days from now - rather short - but I'm afraid
> is
> > the only lever that's easy to pull forward.*
> > As usual, a KIP must be accepted by this date in order to be considered
> for
> > this release. Note, any KIP that may not be implemented in a week, or
> that
> > might destabilize the release, should be deferred.
> >
> > *Feature Freeze - 8th December* (Friday)
> > *This follows 3 weeks after the KIP Freeze, as has been the case in our
> > latest releases.*
> > By this point, we want all major features to be merged & us to be working
> > on stabilisation. Minor features should have PRs, the release branch
> should
> > be cut; anything not in this state will be automatically moved to the
> next
> > release in JIRA
> >
> > *Code Freeze - 20th December* (Wednesday)
> >
> > *Critically, this is before the holiday season and ends in the middle of
> > the week, to give contributors more time and flexibility to address any
> > last-minute without eating into the time people usually take holidays. It
> > comes 12 days after the Feature Freeze.This is two days shorter than the
> > usual code freeze window. I don't have a strong opinion and am open to
> > extend it to Friday, or trade off a day/two with the KF<->FF date range.*
> >
> > *Release -* *after January 3rd*.
> > *It comes after a minimum of two weeks of stabilization, so the earliest
> we
> > can start releasing is January 3rd. We will move as fast as we can and
> aim
> > completing it as early in January as possible.*
> >
> > As for the initially-populated KIPs in the release plan, I did the
> > following:
> >
> > I kept 4 KIPs that were mentioned in 3.6, saying they would have minor
> > parts finished in 3.7 (as the major ones went out in 3.6)
> > - KIP-405 Tiered Storage mentioned a major part went out with 3.6 and the
> > remainder will come with 3.7
> > - KIP-890 mentioned Part 1 shipped in 3.6. I am assuming the remainder
> will
> > come in 3.7, and have contacted the author to confirm.
> > - KIP-926 was partially implemented in 3.6. I am assuming the remainder
> > will come in 3.7, and have contacted the author to confirm.
> > - KIP-938 mentioned that the majority was completed and a small remainder
> > re: ForwardingManager metrics will come in 3.7. I have contacted the
> author
> > to confirm.
> >
> > I then went through the JIRA filter which looks at open issues with a Fix
> > Version of 3.7 and added KIP-770,

Re: Apache Kafka 3.7.0 Release

2023-10-14 Thread Stanislav Kozlovski
Thanks for letting me drive it, folks.

I've created the 3.7.0 release page here:
https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0
It outlines the key milestones and important dates for the release.

In particular, since the last two releases slipped their originally
targeted release date by taking an average of 46 days after code freeze (as
opposed to the minimum which is 14 days), I pulled the dates forward to try
and catch up with the original release schedule.
You can refer to the last release during the Christmas holiday season - Apache
Kafka 3.4
<https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.4.0> - to
see sample dates.

The currently proposed dates are:

*KIP Freeze - 18th November *(Saturday)
*This is 1 month and four days from now - rather short - but I'm afraid is
the only lever that's easy to pull forward.*
As usual, a KIP must be accepted by this date in order to be considered for
this release. Note, any KIP that may not be implemented in a week, or that
might destabilize the release, should be deferred.

*Feature Freeze - 8th December* (Friday)
*This follows 3 weeks after the KIP Freeze, as has been the case in our
latest releases.*
By this point, we want all major features to be merged & us to be working
on stabilisation. Minor features should have PRs, the release branch should
be cut; anything not in this state will be automatically moved to the next
release in JIRA

*Code Freeze - 20th December* (Wednesday)

*Critically, this is before the holiday season and ends in the middle of
the week, to give contributors more time and flexibility to address any
last-minute without eating into the time people usually take holidays. It
comes 12 days after the Feature Freeze.This is two days shorter than the
usual code freeze window. I don't have a strong opinion and am open to
extend it to Friday, or trade off a day/two with the KF<->FF date range.*

*Release -* *after January 3rd*.
*It comes after a minimum of two weeks of stabilization, so the earliest we
can start releasing is January 3rd. We will move as fast as we can and aim
completing it as early in January as possible.*

As for the initially-populated KIPs in the release plan, I did the
following:

I kept 4 KIPs that were mentioned in 3.6, saying they would have minor
parts finished in 3.7 (as the major ones went out in 3.6)
- KIP-405 Tiered Storage mentioned a major part went out with 3.6 and the
remainder will come with 3.7
- KIP-890 mentioned Part 1 shipped in 3.6. I am assuming the remainder will
come in 3.7, and have contacted the author to confirm.
- KIP-926 was partially implemented in 3.6. I am assuming the remainder
will come in 3.7, and have contacted the author to confirm.
- KIP-938 mentioned that the majority was completed and a small remainder
re: ForwardingManager metrics will come in 3.7. I have contacted the author
to confirm.

I then went through the JIRA filter which looks at open issues with a Fix
Version of 3.7 and added KIP-770, KIP-858, and KIP-980.
I also found a fair amount of JIRAs that were targeting the 3.7 release but
consecutively had no activity on them for the past few releases. For most
of those, I pinged the author and explicitly asked if it's going to aim to
make it to 3.7. I have not included those here and will not until I hear
confirmation.

Please review the plan and provide any additional information or updates
regarding KIPs that target this release version (3.7).
If you have authored any KIPs that have an inaccurate status in the list,
or are not in the list and should be, or are in the list and should not be
- please inform me in this thread so that I can keep the document accurate
and up to date.

Excited to get this release going!

All the best,
Stanislav

On Tue, Oct 10, 2023 at 9:12 AM Bruno Cadonna  wrote:

> Thanks Stan!
>
> +1
>
> Best,
> Bruno
>
> On 10/10/23 7:24 AM, Luke Chen wrote:
> > Thanks Stanislav!
> >
> > On Tue, Oct 10, 2023 at 3:05 AM Josep Prat 
> > wrote:
> >
> >> Thanks Stanislav!
> >>
> >> ———
> >> Josep Prat
> >>
> >> Aiven Deutschland GmbH
> >>
> >> Alexanderufer 3-7, 10117 Berlin
> >>
> >> Amtsgericht Charlottenburg, HRB 209739 B
> >>
> >> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> >>
> >> m: +491715557497
> >>
> >> w: aiven.io
> >>
> >> e: josep.p...@aiven.io
> >>
> >> On Mon, Oct 9, 2023, 20:05 Chris Egerton 
> wrote:
> >>
> >>> +1, thanks Stanislav!
> >>>
> >>> On Mon, Oct 9, 2023, 14:02 Bill Bejeck  wrote:
> >>>
> >>>> +1
> >>>>
> >>>> Thanks, Stanislav!
> >>>>
> >>>> -Bill
> >>>>
> >>>> On Mon, Oct 9, 2023 at 1:5

[jira] [Resolved] (KAFKA-14175) KRaft Upgrades Part 2

2023-10-14 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-14175.
-
Resolution: Won't Fix

> KRaft Upgrades Part 2
> -
>
> Key: KAFKA-14175
> URL: https://issues.apache.org/jira/browse/KAFKA-14175
> Project: Kafka
>  Issue Type: New Feature
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Major
>
> This is the parent issue for KIP-778 tasks which were not completed for the 
> 3.3 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Apache Kafka 3.7.0 Release

2023-10-09 Thread Stanislav Kozlovski
Hey all!

I would like to volunteer to be the release manager driving the next
release - Apache Kafka *3.7.0*.

If there are no objections, I will start and share a release plan soon
enough!

Cheers,
Stanislav


Re: [DISCUSS] KIP-932: Queues for Kafka

2023-05-22 Thread Stanislav Kozlovski
Hey Andrew!

Kudos on the proposal. It is greatly written - a joy to read. It is
definitely an interesting solution to the queueing problem - I would not
have guessed we could solve it like this. Thank you for working on this.

Happy to get the discussion started - I have a few comments/questions on
first read:

1. Tiered Storage

I notice no mention of Tiered Storage (KIP-405). Does that complicate the
design, especially when fetching historical data? It would be good to have
at least one sentence mentioning it, even if it doesn't impact it. Right
now I'm unsure if it was considered.

2. SSO initialized to the latest offset

> "By default, the SSO for each share-partition is initialized to the
latest offset for the corresponding topic-partitions."

Have we considered allowing this to be configurable to latest/earliest?
This would be consistent with the auto.offset.reset config of vanilla
consumer groups.
Thinking from a user's perspective, it sounds valid to want to start from
the start of a topic when starting a share group. Historical processing
comes to mind

3. Durable Storage

The KIP mentions that "The cluster records this information durably", which
implies that it saves it somewhere. Does the ShareCoordinator have its own
topic? Would it be compacted?

In particular, I am interested in what such a topic's retention would be
like. The vanilla consumer offsets topic has some special retention
semantics (KIP-211) where we start counting the retention after the
consumer group becomes empty (inactive) - the default being 7 days. Need to
make sure the retention here isn't too short either, as the offsets topic
originally had 24 hours of retention and that proved problematic.

In general, some extra detail about the persistence would be greatly
appreciated!

4. Batch Acknowledgement

> "In the situation where some records in a batch have been released or
rejected separately, subsequent fetches of those records are more likely to
have gaps."

Can we expand a bit more on this edge case? I am interested in learning
what gets returned on subsequent fetch requests.
In particular, - how does this work with compression? As far as I remember,
we can compress the whole batch there, which might make individual record
filtering tricky.

5. Member Management

How is consumer group member management handled? I didn't see any specific
mention - is it the same as a vanilla group?
In particular - how will bad consumers be handled?

I guess I see two cases:
1. bad consumer that doesn't even heartbeat
2. bad consumer that heartbeats well but for some reason every message
processing times out. e.g imagine it was network partitioned from some
third-party system that is a critical part of its message processing loop

One evident problem I can foresee in production systems is one (or a few)
slow consumer applications bringing the SSO/SEO advancement down to a crawl.
Imagine an example where the same consumer app always hits the timeout
limit - what would the behavior be in such a case? Do we keep that consumer
app indefinitely (if so, do we run the risk of having it invalidate
completely valid messages)? Are there any equivalents to the consumer group
rebalances which fence off such bad consumers?

6. Processing Semantics (exactly once)

> The delivery counts are only maintained approximately and the Acquired
state is not persisted.

Does this introduce the risk of zombie consumers on share-partition-leader
failure? i.e restarting and giving another consumer the acquired state for
the same record

I notice that the KIP says:
> Finally, this KIP does not include support for acknowledging delivery
using transactions for exactly-once semantics.
at the very end. It would be helpful to address this earlier in the
example, as one of the key points. And it would be good to be clearer on
what the processing semantics are. They seem to be *at-least-once* to me.


7. nit: Acronyms

I feel like SSO and SEO may be bad acronyms. Googling "Kafka SEO" is bound
to return weird results.
What do we think about the tradeoff of using more-unique acronyms (like
SGEO, SSGO) at the expense of one extra letter?

Again - thanks for working on this! I think it's a great initiative. I'm
excited to see us perfect this proposal and enable a brand new use case in
Kafka!

Best,
Stanislav



On Mon, May 15, 2023 at 2:55 PM Andrew Schofield 
wrote:

> Hi,
> I would like to start a discussion thread on KIP-932: Queues for Kafka.
> This KIP proposes an alternative to consumer groups to enable cooperative
> consumption by consumers without partition assignment. You end up with
> queue semantics on top of regular Kafka topics, with per-message
> acknowledgement and automatic handling of messages which repeatedly fail to
> be processed.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka
>
> Please take a look and let me know what you think.
>
> Thanks.
> Andrew



-- 
Best,
Stanislav


Re: [VOTE] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-09-09 Thread Stanislav Kozlovski
Thanks everybody!

This KIP has passed with 3 binding votes (Luke, David, Colin).

Colin, I agree with the nitpick and have extended the KIP to mention
throwing an informative error message.

On Wed, Sep 7, 2022 at 11:09 PM Colin McCabe  wrote:

> +1 (binding).
>
> One nitpick: when the user sets AllowReplicationFactorChange = false, the
> exception the user gets back from AdminClient should mention that this was
> the problem.  If the exception just says "The broker does not support
> ALTER_PARTITION_REASSIGNMENTS with version in range [1, 1]. The supported
> range is [0, 0]." the user will be confused about what the problem is.
> Instead, the exception should mention that the broker does not support
> AllowReplicationFactorChange.
>
> best,
> Colin
>
>
> On Wed, Sep 7, 2022, at 06:11, David Jacot wrote:
> > +1 from me. Thanks, Stan!
> >
> > On Tue, Aug 23, 2022 at 12:10 PM Luke Chen  wrote:
> >>
> >> Hi Stanislav,
> >>
> >> Thanks for the KIP.
> >> The solution looks reasonable to me.
> >> +1 from me.
> >>
> >> Thank you.
> >> Luke
> >>
> >> On Tue, Aug 23, 2022 at 6:07 AM Stanislav Kozlovski
> >>  wrote:
> >>
> >> > Hello,
> >> >
> >> > I'd like to start a vote on KIP-860, which adds a client-provided
> option to
> >> > the AlterPartitionReassignmentsRequest that allows the user to guard
> >> > against an unintentional change in the replication factor during
> partition
> >> > reassignments.
> >> >
> >> > Discuss Thread:
> >> > https://lists.apache.org/thread/bhrqjd4vb05xtztkdo8py374m9dgq69r
> >> > KIP:
> >> >
> >> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> >> > JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
> >> >
> >> >
> >> > --
> >> > Best,
> >> > Stanislav
> >> >
>


-- 
Best,
Stanislav


Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-09-09 Thread Stanislav Kozlovski
Thanks Ismael,

I added an extra paragraph in the motivation. We have certainly hit this
within our internal Confluent reassignment software and from a quick skim
in the popular Cruise Control repository, I notice that similar problems
have been hit there too. Hopefully the examples in the KIP are sufficient
to make the case

On Wed, Sep 7, 2022 at 11:21 PM Ismael Juma  wrote:

> Thanks for the details, Colin. I understand how this can happen. But this
> API has been out for a long time. Are we saying that we have seen Cruise
> Control cause this kind of problem? If so, it would be good to mention it
> in the KIP as evidence that the current approach is brittle.
>
> Ismael
>
> On Wed, Sep 7, 2022 at 2:15 PM Colin McCabe  wrote:
>
> > Hi Ismael,
> >
> > I think this issue comes up when people write software that automatically
> > creates partition reassignments to balance the cluster. Cruise Control is
> > one example; Confluent also has some software that does this. If there is
> > already a reassignment that is going on for some partition and the
> software
> > tries to create a new reassignment for that partition, the software may
> > inadvertently change the replication factor.
> >
> > In general, I think some people find it surprising that reassignment can
> > change the replication factor of a partition. When we outlined the
> > reassignment API in KIP-455 we maintained the ability to do this, since
> the
> > old ZK-based API had always been able to do it. But this was a bit
> > controversial. Maybe it would have been more intuitive to preserve
> > replication factor by default unless the user explicitly stated that they
> > wanted to change it. So in a sense, you could view this as a fix for
> > KIP-455 :) (in my opinion, at least)
> >
> > best,
> > Colin
> >
> >
> > On Wed, Sep 7, 2022, at 07:07, Ismael Juma wrote:
> > > Thanks for the KIP. Can we explain a bit more why this is an important
> > use
> > > case to address? For example, do we have concrete examples of people
> > > running into this? The way the KIP is written, it sounds like a
> potential
> > > problem but no information is given on whether it's a real problem in
> > > practice.
> > >
> > > Ismael
> > >
> > > On Thu, Jul 28, 2022 at 2:00 AM Stanislav Kozlovski
> > >  wrote:
> > >
> > >> Hey all,
> > >>
> > >> I'd like to start a discussion on a proposal to help API users from
> > >> inadvertently increasing the replication factor of a topic through
> > >> the alter partition reassignments API. The KIP describes two fairly
> > >> easy-to-hit race conditions in which this can happen.
> > >>
> > >> The KIP itself is pretty simple, yet has a couple of alternatives that
> > can
> > >> help solve the same problem. I would appreciate thoughts from the
> > community
> > >> on how you think we should proceed, and whether the proposal makes
> > sense in
> > >> the first place.
> > >>
> > >> Thanks!
> > >>
> > >> KIP:
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> > >> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
> > >>
> > >> --
> > >> Best,
> > >> Stanislav
> > >>
> >
>


-- 
Best,
Stanislav


[VOTE] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-08-22 Thread Stanislav Kozlovski
Hello,

I'd like to start a vote on KIP-860, which adds a client-provided option to
the AlterPartitionReassignmentsRequest that allows the user to guard
against an unintentional change in the replication factor during partition
reassignments.

Discuss Thread:
https://lists.apache.org/thread/bhrqjd4vb05xtztkdo8py374m9dgq69r
KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
JIRA: https://issues.apache.org/jira/browse/KAFKA-14121


-- 
Best,
Stanislav


Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-08-22 Thread Stanislav Kozlovski
Thanks David,

I do prefer the `disallow-replication-factor-change` flag but only for the
CLI. I assume that's what you're proposing instead of
"disable-replication-factor-change". The wording is more natural in your
suggestion I feel.
If we were to modify more (e.g RPC, Admin API), I think it'd be a bit less
straightforward in the code and API to reason about having a
double-negative - e.g `disallowReplicationFactorChange=false`
I have changed the KIP to mention "disallow".

As for the RPCs, I had only envisioned changes in the request - hence I had
only pasted the `.diff`. I have now added the option to be returned as part
of the response too, and have described both RPC jsons in the conventional
way we do it.
Hopefully, this looks good.

I'll be starting a voting thread now.

Best,
Stanislav

On Tue, Aug 16, 2022 at 6:17 AM David Jacot 
wrote:

> Thanks Stan. Overall, the KIP looks good to me. I have two minor comments:
>
> * Could we document the different versions in the
> AlterPartitionReassignmentsRequest/Request schema? You can look at
> other requests/responses to see how we have done this so far.
>
> * I wonder if --disallow-replication-factor-change would be a better
> name. I don't feel strong about this so I am happy to go with the
> quorum here.
>
> Best,
> David
>
> On Tue, Aug 16, 2022 at 12:31 AM Stanislav Kozlovski
>  wrote:
> >
> > Thanks for the discussion all,
> >
> > I have updated the KIP to mention throwing an UnsupportedVersionException
> > if the server is running an old version that would not honor the
> configured
> > allowReplicationFactor option.
> >
> > Please take a look:
> > - KIP:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> > - changes:
> >
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=217392873=4=3
> >
> > If there aren't extra comments, I plan on starting a vote thread by the
> end
> > of this week.
> >
> > Best,
> > Stanislav
> >
> > On Tue, Aug 9, 2022 at 5:06 AM David Jacot 
> > wrote:
> >
> > > Throwing an UnsupportedVersionException with an appropriate message
> > > seems to be the best option when the new API is not supported and
> > > AllowReplicationFactorChange is not set to the default value.
> > >
> > > Cheers,
> > > David
> > >
> > > On Mon, Aug 8, 2022 at 6:25 PM Vikas Singh  >
> > > wrote:
> > > >
> > > > I personally like the UVE option. It provides options for clients to
> go
> > > > either way, retry or abort. If we do it in AdminClient, then users
> have
> > > to
> > > > live with what we have chosen.
> > > >
> > > > > Note this can happen during an RF change too. e.g [1,2,3] =>
> [4,5,6,7]
> > > (RF
> > > > > change, intermediate set is [1,2,3,4,5,6,7]), and we try to do a
> > > > > reassignment to [9,10,11], the logic will compare [4,5,6,7] to
> > > [9,10,11].
> > > > > In such a situation where one wants to cancel the RF increase and
> > > reassign
> > > > > again, one first needs to cancel the existing reassignment via the
> API
> > > (no
> > > > > special action required despite RF change)
> > > >
> > > > Thanks for the explanation. I did realize this nuance and thus
> requested
> > > to
> > > > put that in KIP as it's not mentioned why the choice was made. I am
> fine
> > > if
> > > > you choose to not do it in the interest of brevity.
> > > >
> > > > Vikas
> > > >
> > > > On Sun, Aug 7, 2022 at 9:02 AM Stanislav Kozlovski
> > > >  wrote:
> > > >
> > > > > Thank you for the reviews.
> > > > >
> > > > > Vikas,
> > > > > > > In the case of an already-reassigning partition being
> reassigned
> > > again,
> > > > > the validation compares the targetReplicaSet size of the
> reassignment
> > > to
> > > > > the targetReplicaSet size of the new reassignment and throws if
> those
> > > > > differ.
> > > > > > Can you add more detail to this, or clarify what is
> targetReplicaSet
> > > (for
> > > > > e.g. why not sourceReplicaSet?) and how the target replica set
> will be
> > > > > calculated?
> > > > > If a reassignment is ongoing, such that [1,2

Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-08-15 Thread Stanislav Kozlovski
Thanks for the discussion all,

I have updated the KIP to mention throwing an UnsupportedVersionException
if the server is running an old version that would not honor the configured
allowReplicationFactor option.

Please take a look:
- KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
- changes:
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=217392873=4=3

If there aren't extra comments, I plan on starting a vote thread by the end
of this week.

Best,
Stanislav

On Tue, Aug 9, 2022 at 5:06 AM David Jacot 
wrote:

> Throwing an UnsupportedVersionException with an appropriate message
> seems to be the best option when the new API is not supported and
> AllowReplicationFactorChange is not set to the default value.
>
> Cheers,
> David
>
> On Mon, Aug 8, 2022 at 6:25 PM Vikas Singh 
> wrote:
> >
> > I personally like the UVE option. It provides options for clients to go
> > either way, retry or abort. If we do it in AdminClient, then users have
> to
> > live with what we have chosen.
> >
> > > Note this can happen during an RF change too. e.g [1,2,3] => [4,5,6,7]
> (RF
> > > change, intermediate set is [1,2,3,4,5,6,7]), and we try to do a
> > > reassignment to [9,10,11], the logic will compare [4,5,6,7] to
> [9,10,11].
> > > In such a situation where one wants to cancel the RF increase and
> reassign
> > > again, one first needs to cancel the existing reassignment via the API
> (no
> > > special action required despite RF change)
> >
> > Thanks for the explanation. I did realize this nuance and thus requested
> to
> > put that in KIP as it's not mentioned why the choice was made. I am fine
> if
> > you choose to not do it in the interest of brevity.
> >
> > Vikas
> >
> > On Sun, Aug 7, 2022 at 9:02 AM Stanislav Kozlovski
> >  wrote:
> >
> > > Thank you for the reviews.
> > >
> > > Vikas,
> > > > > In the case of an already-reassigning partition being reassigned
> again,
> > > the validation compares the targetReplicaSet size of the reassignment
> to
> > > the targetReplicaSet size of the new reassignment and throws if those
> > > differ.
> > > > Can you add more detail to this, or clarify what is targetReplicaSet
> (for
> > > e.g. why not sourceReplicaSet?) and how the target replica set will be
> > > calculated?
> > > If a reassignment is ongoing, such that [1,2,3] => [4,5,6] (the
> replica set
> > > in Kafka will be [1,2,3,4,5,6] during the reassignment), and you try to
> > > issue a new reassignment (e.g [7,8,9], Kafka should NOT think that the
> RF
> > > of the partition is 6 just because a reassignment is ongoing. Hence, we
> > > compare [4,5,6]'s length to [7,8,9]
> > > The targetReplicaSet is a term we use in KIP-455
> > > <
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> > > >.
> > > It means the desired replica set that a given reassignment is trying to
> > > achieve. Here we compare said set of the on-going reassignment to the
> new
> > > reassignment.
> > >
> > > Note this can happen during an RF change too. e.g [1,2,3] => [4,5,6,7]
> (RF
> > > change, intermediate set is [1,2,3,4,5,6,7]), and we try to do a
> > > reassignment to [9,10,11], the logic will compare [4,5,6,7] to
> [9,10,11].
> > > In such a situation where one wants to cancel the RF increase and
> reassign
> > > again, one first needs to cancel the existing reassignment via the API
> (no
> > > special action required despite RF change)
> > >
> > >
> > > > And what about the reassign partitions CLI? Do we want to expose the
> > > option there too?
> > > Yes, this is already present in the KIP if I'm not mistaken. We
> describe it
> > > in "Accordingly, the kafka-reassign-partitions.sh tool will be updated
> to
> > > allow supplying the new option:"
> > > I have edited the KIP to contain two clear paragraphs called Admin API
> and
> > > CLI now.
> > >
> > > Colin,
> > >
> > > >  it would be nice for the first paragraph to be a bit more explicit
> about
> > > this goal.
> > > sounds good, updated it with that suggestion.
> > >
> > > > client-side forward compatibility
> > > I was under the assumption that it is not

Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-08-07 Thread Stanislav Kozlovski
licaSet size of the reassignment to
> > the targetReplicaSet size of the new reassignment and throws if those
> > differ.
> > Can you add more detail to this, or clarify what is targetReplicaSet (for
> > e.g. why not sourceReplicaSet?) and how the target replica set will be
> > calculated?
> >
> > And what about the reassign partitions CLI? Do we want to expose the
> option
> > there too?
> >
> > Cheers,
> > Vikas
> >
> > On Thu, Jul 28, 2022 at 1:59 AM Stanislav Kozlovski <
> stanis...@confluent.io>
> > wrote:
> >
> >> Hey all,
> >>
> >> I'd like to start a discussion on a proposal to help API users from
> >> inadvertently increasing the replication factor of a topic through
> >> the alter partition reassignments API. The KIP describes two fairly
> >> easy-to-hit race conditions in which this can happen.
> >>
> >> The KIP itself is pretty simple, yet has a couple of alternatives that
> can
> >> help solve the same problem. I would appreciate thoughts from the
> community
> >> on how you think we should proceed, and whether the proposal makes
> sense in
> >> the first place.
> >>
> >> Thanks!
> >>
> >> KIP:
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> >> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
> >>
> >> --
> >> Best,
> >> Stanislav
> >>
>


-- 
Best,
Stanislav


[DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

2022-07-28 Thread Stanislav Kozlovski
Hey all,

I'd like to start a discussion on a proposal to help API users from
inadvertently increasing the replication factor of a topic through
the alter partition reassignments API. The KIP describes two fairly
easy-to-hit race conditions in which this can happen.

The KIP itself is pretty simple, yet has a couple of alternatives that can
help solve the same problem. I would appreciate thoughts from the community
on how you think we should proceed, and whether the proposal makes sense in
the first place.

Thanks!

KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
JIRA: https://issues.apache.org/jira/browse/KAFKA-14121

-- 
Best,
Stanislav


[jira] [Created] (KAFKA-14121) AlterPartitionReassignments API should allow callers to specify the option of preserving the replication factor

2022-07-28 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-14121:
---

 Summary: AlterPartitionReassignments API should allow callers to 
specify the option of preserving the replication factor
 Key: KAFKA-14121
 URL: https://issues.apache.org/jira/browse/KAFKA-14121
 Project: Kafka
  Issue Type: New Feature
Reporter: Stanislav Kozlovski


Using Kafka's public APIs to get metadata regarding the non-reassigning 
replicas for a topic is unreliable and prone to race conditions.
If a person or a system is to rely on the provided metadata, it can end up 
{color:#202124}unintentionally {color}increasing the replication factor for a 
partition.
It would be useful to have some sort of guardrail against this happening 
{color:#202124}inadvertently.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-8406) kafka-topics throws wrong error on invalid configuration with bootstrap-server and alter config

2021-09-08 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-8406.

Fix Version/s: 2.4.0
   Resolution: Fixed

> kafka-topics throws wrong error on invalid configuration with 
> bootstrap-server and alter config
> ---
>
> Key: KAFKA-8406
> URL: https://issues.apache.org/jira/browse/KAFKA-8406
> Project: Kafka
>  Issue Type: Improvement
>    Reporter: Stanislav Kozlovski
>    Assignee: Stanislav Kozlovski
>Priority: Minor
> Fix For: 2.4.0
>
>
> Running
> {code:java}
> ./kafka-topics --bootstrap-server  --alter --config 
> retention.ms=360 --topic topic{code}
> Results in
> {code:java}
> Missing required argument "[partitions]"{code}
> Running
> {code:java}
> ./kafka-topics --bootstrap-server  --alter --config 
> retention.ms=360 --topic topic --partitions 25{code}
> Results in
> {code:java}
> Option combination "[bootstrap-server],[config]" can't be used with option 
> "[alter]"{code}
> For better clarity, we should just throw the last error outright.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] KIP-334 Include partitions in exceptions raised during consumer record deserialization/validation

2021-04-14 Thread Stanislav Kozlovski
Hey all,

To revive this old KIP, Sarwar Bhuiyan has volunteered to take over
ownership.
He will continue to drive this KIP to approval and completion - I
understand that he will re-start the discussion with a new [DISCUSS] or
[VOTE] thread.

Thank you Sarwar!

Best,
Stanislav

On Fri, Jan 10, 2020 at 5:55 PM Gwen Shapira  wrote:

> Sorry for the super late reply, but since the vote thread showed up,
> I've read the KIP and have a concern.
> The concern was raised by Matthias Sax earlier, but I didn't see it
> addressed.
>
> Basically the current iteration of the KIP unifies deserialization
> errors with corruption. As was pointed out, these are not the same
> thing. Corruption means that the message itself (including envelope,
> not just the payload) is broken. De-serialization errors mean that
> either key or value serializers have a problem. It can even be a
> temporary problem of connecting to schema registry, I believe. Corrupt
> messages can only be skipped. De-serialization errors can (and
> arguably should) be retried with a different serializer. Something
> like Connect will need to skip corrupt messages, but messages with
> SerDe issues should probably go into a dead-letter queue.
>
> Anyway, IMO we need exceptions that will let us tell the difference.
>
> Gwen
>
> On Fri, Oct 11, 2019 at 10:05 AM Stanislav Kozlovski
>  wrote:
> >
> > Thanks Jason. I've edited the KIP with the latest proposal.
> >
> > On Thu, Oct 10, 2019 at 2:00 AM Jason Gustafson 
> wrote:
> >
> > > Hi Stanislav,
> > >
> > > Sorry for the late comment. I'm happy with the current proposal. Just
> one
> > > small request is to include an example which shows how a user could use
> > > this to skip over a record.
> > >
> > > I'd suggest pushing this to a vote to see if anyone else has feedback.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Sat, Jul 13, 2019 at 2:27 PM Stanislav Kozlovski <
> > > stanis...@confluent.io>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > Could we restart the discussion here again?
> > > >
> > > > My last message was sent on the 3rd of June but I haven't received
> > > replies
> > > > since then.
> > > >
> > > > I'd like to get this KIP to a finished state, regardless of whether
> that
> > > is
> > > > merged-in or discarded. It has been almost one year since the
> publication
> > > > of the KIP.
> > > >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > On Mon, Jun 3, 2019 at 11:19 AM Stanislav Kozlovski <
> > > > stanis...@confluent.io>
> > > > wrote:
> > > >
> > > > > Do people agree with the approach I outlined in my last reply?
> > > > >
> > > > > On Mon, May 6, 2019 at 2:12 PM Stanislav Kozlovski <
> > > > stanis...@confluent.io>
> > > > > wrote:
> > > > >
> > > > >> Hey there Kamal,
> > > > >>
> > > > >> I'm sincerely sorry for missing your earlier message. As I open
> this
> > > > >> thread up, I see I have an unsent draft message about resuming
> > > > discussion
> > > > >> from some time ago.
> > > > >>
> > > > >> In retrospect, I think I may have been too pedantic with the
> exception
> > > > >> naming and hierarchy.
> > > > >> I now believe a single exception type of
> > > > `RecordDeserializationException`
> > > > >> is enough. Let's go with that.
> > > > >>
> > > > >> On Mon, May 6, 2019 at 6:40 AM Kamal Chandraprakash <
> > > > >> kamal.chandraprak...@gmail.com> wrote:
> > > > >>
> > > > >>> Matthias,
> > > > >>>
> > > > >>> We already have CorruptRecordException which doesn't extend the
> > > > >>> SerializationException. So, we need an alternate
> > > > >>> name suggestion for the corrupted record error if we decide to
> extend
> > > > the
> > > > >>> FaultyRecordException class.
> > > > >>>
> > > > >>> Stanislav,
> > > > >>>
> > > > >>> Our users are also facing this error. Could we bump up this
> > > discussion?
> > > > >>>
> > > > >>>

[jira] [Created] (KAFKA-12555) Log reason for rolling a segment

2021-03-25 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-12555:
---

 Summary: Log reason for rolling a segment
 Key: KAFKA-12555
 URL: https://issues.apache.org/jira/browse/KAFKA-12555
 Project: Kafka
  Issue Type: Improvement
Reporter: Stanislav Kozlovski


It would be useful for issue-diagnostic purposes to log the reason for why a 
log segment was rolled 
(https://github.com/apache/kafka/blob/e840b03a026ddb9a67a15a164d877545130d6e17/core/src/main/scala/kafka/log/Log.scala#L2069)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10510) Reassigning partitions should not allow increasing RF of a partition unless configured with it

2020-09-21 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-10510:
---

 Summary: Reassigning partitions should not allow increasing RF of 
a partition unless configured with it
 Key: KAFKA-10510
 URL: https://issues.apache.org/jira/browse/KAFKA-10510
 Project: Kafka
  Issue Type: Improvement
Reporter: Stanislav Kozlovski


Kafka should have some validations in place against increasing the RF of a 
partition through a reassignment. Users could otherwise shoot themselves in the 
foot by increasing the RF of a topic by reassigning its partitions to extra 
replicas and then have new partition creations use a lesser (the configured) 
replication factor.

Our tools should ideally detect when RF is increasing inconsistently with the 
config and issue a separate command to change the config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] Apache Kafka 2.6.0

2020-08-06 Thread Stanislav Kozlovski
Thanks for driving the release Randall!
Congratulations to everybody involved - awesome work!

On Thu, Aug 6, 2020 at 5:21 PM Randall Hauch  wrote:

> The Apache Kafka community is pleased to announce the release for Apache
> Kafka 2.6.0
>
> * TLSv1.3 has been enabled by default for Java 11 or newer.
> * Significant performance improvements, especially when the broker has
> large numbers of partitions
> * Smooth scaling out of Kafka Streams applications
> * Kafka Streams support for emit on change
> * New metrics for better operational insight
> * Kafka Connect can automatically create topics for source connectors
> * Improved error reporting options for sink connectors in Kafka Connect
> * New Filter and conditional SMTs in Kafka Connect
> * The default value for the `client.dns.lookup` configuration is
> now `use_all_dns_ips`
> * Upgrade Zookeeper to 3.5.8
>
> This release also includes other features, 74 improvements, 175 bug fixes,
> plus other changes.
>
> All of the changes in this release can be found in the release notes:
> https://www.apache.org/dist/kafka/2.6.0/RELEASE_NOTES.html
>
>
> You can download the source and binary release (Scala 2.12 and 2.13) from:
> https://kafka.apache.org/downloads#2.6.0
>
>
> ---
>
>
> Apache Kafka is a distributed streaming platform with four core APIs:
>
>
> ** The Producer API allows an application to publish a stream of records to
> one or more Kafka topics.
>
> ** The Consumer API allows an application to subscribe to one or more
> topics and process the stream of records produced to them.
>
> ** The Streams API allows an application to act as a stream processor,
> consuming an input stream from one or more topics and producing an
> output stream to one or more output topics, effectively transforming the
> input streams to output streams.
>
> ** The Connector API allows building and running reusable producers or
> consumers that connect Kafka topics to existing applications or data
> systems. For example, a connector to a relational database might
> capture every change to a table.
>
>
> With these APIs, Kafka can be used for two broad classes of application:
>
> ** Building real-time streaming data pipelines that reliably get data
> between systems or applications.
>
> ** Building real-time streaming applications that transform or react
> to the streams of data.
>
>
> Apache Kafka is in use at large and small companies worldwide, including
> Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
> Target, The New York Times, Uber, Yelp, and Zalando, among others.
>
> A big thank you for the following 127 contributors to this release!
>
> 17hao, A. Sophie Blee-Goldman, Aakash Shah, Adam Bellemare, Agam Brahma,
> Alaa Zbair, Alexandra Rodoni, Andras Katona, Andrew Olson, Andy Coates,
> Aneel Nazareth, Anna Povzner, Antony Stubbs, Arjun Satish, Auston, avalsa,
> Badai Aqrandista, belugabehr, Bill Bejeck, Bob Barrett, Boyang Chen, Brian
> Bushree, Brian Byrne, Bruno Cadonna, Charles Feduke, Chia-Ping Tsai, Chris
> Egerton, Colin Patrick McCabe, Daniel, Daniel Beskin, David Arthur, David
> Jacot, David Mao, dengziming, Dezhi “Andy” Fang, Dima Reznik, Dominic
> Evans, Ego, Eric Bolinger, Evelyn Bayes, Ewen Cheslack-Postava, fantayeneh,
> feyman2016, Florian Hussonnois, Gardner Vickers, Greg Harris, Gunnar
> Morling, Guozhang Wang, high.lee, Hossein Torabi, huxi, Ismael Juma, Jason
> Gustafson, Jeff Huang, jeff kim, Jeff Widman, Jeremy Custenborder, Jiamei
> Xie, jiameixie, jiao, Jim Galasyn, Joel Hamill, John Roesler, Jorge Esteban
> Quilcate Otoya, José Armando García Sancio, Konstantine Karantasis, Kowshik
> Prakasam, Kun Song, Lee Dongjin, Leonard Ge, Lev Zemlyanov, Levani
> Kokhreidze, Liam Clarke-Hutchinson, Lucas Bradstreet, Lucent-Wong, Magnus
> Edenhill, Manikumar Reddy, Mario Molina, Matthew Wong, Matthias J. Sax,
> maulin-vasavada, Michael Viamari, Michal T, Mickael Maison, Mitch, Navina
> Ramesh, Navinder Pal Singh Brar, nicolasguyomar, Nigel Liang, Nikolay,
> Okada Haruki, Paul, Piotr Fras, Radai Rosenblatt, Rajini Sivaram, Randall
> Hauch, Rens Groothuijsen, Richard Yu, Rigel Bezerra de Melo, Rob Meng,
> Rohan, Ron Dagostino, Sanjana Kaundinya, Scott, Scott Hendricks, sebwills,
> Shailesh Panwar, showuon, SoontaekLim, Stanislav Kozlovski, Steve
> Rodrigues, Svend Vanderveken, Sönke Liebau, THREE LEVEL HELMET, Tom
> Bentley, Tu V. Tran, Valeria, Vikas Singh, Viktor Somogyi, vinoth chandar,
> Vito Jeng, Xavier Léauté, xiaodongdu, Zach Zhang, zhaohaidao, zshuo, 阿洋
>
> We welcome your help and feedback. For more information on how to
> report problems, and to get involved, visit the project website at
> https://kafka.apache.org/
>
> Thank you!
>
>
> Regards,
>
> Randall Hauch
>


-- 
Best,
Stanislav


[jira] [Resolved] (KAFKA-10353) Trogdor - Fix RoundTripWorker to not fail when the topic it's trying to create already exists

2020-08-04 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-10353.
-
Resolution: Duplicate

> Trogdor - Fix RoundTripWorker to not fail when the topic it's trying to 
> create already exists
> -
>
> Key: KAFKA-10353
> URL: https://issues.apache.org/jira/browse/KAFKA-10353
> Project: Kafka
>  Issue Type: Bug
>    Reporter: Stanislav Kozlovski
>Priority: Major
>
> Trogdor's RoundTripWorker calls WorkerUtils#createTopics with a failOnCreate 
> flag equal to true, making the code throw an exception if the topic already 
> exists.
> [https://github.com/apache/kafka/blob/28b7d8e21656649fb09b09f9bacfe865b0ca133c/tools/src/main/java/org/apache/kafka/trogdor/workload/RoundTripWorker.java#L149]
> This is prone to race conditions when scheduling multiple workers to start at 
> the same time - only one will succeed in creating a topic and running the 
> test, while the rest will end up with a fatal error
> This has also been seen to happen in the RoundTripFaultTest system test where 
> a network exception can cause the CreateTopics request to reach Kafka but 
> Trogdor retry it and hit a TopicAlreadyExists exception on the retry, failing 
> the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10353) Trogdor - Fix RoundTripWorker to not fail when the topic it's trying to create already exists

2020-08-04 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-10353:
---

 Summary: Trogdor - Fix RoundTripWorker to not fail when the topic 
it's trying to create already exists
 Key: KAFKA-10353
 URL: https://issues.apache.org/jira/browse/KAFKA-10353
 Project: Kafka
  Issue Type: Bug
Reporter: Stanislav Kozlovski


Trogdor's RoundTripWorker calls WorkerUtils#createTopics with a failOnCreate 
flag equal to true, making the code throw an exception if the topic already 
exists.

[https://github.com/apache/kafka/blob/28b7d8e21656649fb09b09f9bacfe865b0ca133c/tools/src/main/java/org/apache/kafka/trogdor/workload/RoundTripWorker.java#L149]

This is prone to race conditions when scheduling multiple workers to start at 
the same time - only one will succeed in creating a topic and running the test, 
while the rest will end up with a fatal error

This has also been seen to happen in the RoundTripFaultTest system test where a 
network exception can cause the CreateTopics request to reach Kafka but Trogdor 
retry it and hit a TopicAlreadyExists exception on the retry, failing the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10301) Partition#remoteReplicasMap can be empty in certain race conditions

2020-07-27 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-10301.
-
Resolution: Fixed

> Partition#remoteReplicasMap can be empty in certain race conditions
> ---
>
> Key: KAFKA-10301
> URL: https://issues.apache.org/jira/browse/KAFKA-10301
> Project: Kafka
>  Issue Type: Bug
>    Reporter: Stanislav Kozlovski
>    Assignee: Stanislav Kozlovski
>Priority: Blocker
>
> In Partition#updateAssignmentAndIsr, we would previously update the 
> `partition#remoteReplicasMap` by adding the new replicas to the map and then 
> removing the old ones 
> ([source]([https://github.com/apache/kafka/blob/7f9187fe399f3f6b041ca302bede2b3e780491e7/core/src/main/scala/kafka/cluster/Partition.scala#L657)]
> During a recent refactoring, we changed it to first clear the map and then 
> add all the replicas to it 
> ([source]([https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/cluster/Partition.scala#L663]))
> While this is done in a write lock (`inWriteLock(leaderIsrUpdateLock)`), not 
> all callers that access the map structure use a lock. Some examples:
>  - Partition#updateFollowerFetchState
>  - DelayedDeleteRecords#tryComplete
>  - Partition#getReplicaOrException - called in 
> `checkEnoughReplicasReachOffset` without a lock, which itself is called by 
> DelayedProduce. I think this can fail a  `ReplicaManager#appendRecords` call.
> While we want to polish the code to ensure these sort of race conditions 
> become harder (or impossible) to introduce, it sounds safest to revert to the 
> previous behavior given the timelines regarding the 2.6 release. Jira 
> https://issues.apache.org/jira/browse/KAFKA-10302 tracks further 
> modifications to the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10302) Ensure thread-safe access to Partition#remoteReplicasMap

2020-07-23 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-10302:
---

 Summary: Ensure thread-safe access to Partition#remoteReplicasMap
 Key: KAFKA-10302
 URL: https://issues.apache.org/jira/browse/KAFKA-10302
 Project: Kafka
  Issue Type: Bug
Reporter: Stanislav Kozlovski


A recent Jira (https://issues.apache.org/jira/browse/KAFKA-10301) exposed how 
easy it is to introduce nasty race conditions with the 
Partition#remoteReplicasMap data structure. It is a concurrent map which is 
modified inside a write lock but it is not always accessed through that lock.

Therefore it's possible for callers to access an intermediate state of the map, 
for instance in between updating the replica assignment for a given partition.



It would be good to ensure thread-safe access to the data structure in a way 
which makes it harder to introduce such regressions in the future



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10301) RemoteReplicasMap can be empty in certain race conditions

2020-07-23 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-10301:
---

 Summary: RemoteReplicasMap can be empty in certain race conditions
 Key: KAFKA-10301
 URL: https://issues.apache.org/jira/browse/KAFKA-10301
 Project: Kafka
  Issue Type: Bug
Reporter: Stanislav Kozlovski
Assignee: Stanislav Kozlovski


In Partition#updateAssignmentAndIsr, we would previously update the 
`partition#remoteReplicasMap` by adding the new replicas to the map and then 
removing the old ones 
([source]([https://github.com/apache/kafka/blob/7f9187fe399f3f6b041ca302bede2b3e780491e7/core/src/main/scala/kafka/cluster/Partition.scala#L657)]

During a recent refactoring, we changed it to first clear the map and then add 
all the replicas to it 
([source]([https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/cluster/Partition.scala#L663]))

While this is done in a write lock (`inWriteLock(leaderIsrUpdateLock)`), not 
all callers that access the map structure use a lock. Some examples:
- Partition#updateFollowerFetchState
- DelayedDeleteRecords#tryComplete
- Partition#getReplicaOrException - called in `checkEnoughReplicasReachOffset` 
without a lock, which itself is called by DelayedProduce. I think this can fail 
a  `ReplicaManager#appendRecords` call.

While we want to polish the code to ensure these sort of race conditions become 
harder (or impossible) to introduce, it sounds safest to revert to the previous 
behavior given the timelines regarding the 2.6 release. Jira X tracks that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] KIP-627: Expose Trogdor-specific JMX Metrics for Tasks and Agents

2020-06-26 Thread Stanislav Kozlovski
+1 (non-binding).

Thanks for the work! I am also happy to see Trogdor being improved

Best,
Stanislav

On Fri, Jun 26, 2020 at 5:34 AM Colin McCabe  wrote:

> +1 (binding).
>
> Thanks, Sam.
>
> best,
> Colin
>
>
> On Thu, Jun 25, 2020, at 18:05, Gwen Shapira wrote:
> > +1 (binding)
> >
> > Thank you, Sam. It is great to see Trogdor getting the care it deserves.
> >
> > On Mon, Jun 22, 2020, 1:46 PM Sam Pal  wrote:
> >
> > > Hi all,
> > >
> > > I would like to start a vote for KIP-627, which adds metrics about
> active
> > > agents and the number of created, running, and done tasks in a Trogdor
> > > cluster:
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-627%3A+Expose+Trogdor-specific+JMX+Metrics+for+Tasks+and+Agents
> > >
> > > Looking forward to hearing from you all!
> > >
> > > Best,
> > > Sam
> > >
> > >
> >
>


-- 
Best,
Stanislav


[jira] [Resolved] (KAFKA-8723) flaky test LeaderElectionCommandTest#testAllTopicPartition

2020-06-02 Thread Stanislav Kozlovski (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski resolved KAFKA-8723.

Resolution: Fixed

> flaky test LeaderElectionCommandTest#testAllTopicPartition
> --
>
> Key: KAFKA-8723
> URL: https://issues.apache.org/jira/browse/KAFKA-8723
> Project: Kafka
>  Issue Type: Bug
>Reporter: Boyang Chen
>    Assignee: Stanislav Kozlovski
>Priority: Major
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/23737/console]
>  
> *15:52:26* kafka.admin.LeaderElectionCommandTest > testAllTopicPartition 
> STARTED*15:53:08* kafka.admin.LeaderElectionCommandTest.testAllTopicPartition 
> failed, log available in 
> /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk8-scala2.11@2/core/build/reports/testOutput/kafka.admin.LeaderElectionCommandTest.testAllTopicPartition.test.stdout*15:53:08*
>  *15:53:08* kafka.admin.LeaderElectionCommandTest > testAllTopicPartition 
> FAILED*15:53:08* kafka.common.AdminCommandFailedException: Timeout 
> waiting for election results*15:53:08* at 
> kafka.admin.LeaderElectionCommand$.electLeaders(LeaderElectionCommand.scala:133)*15:53:08*
>  at 
> kafka.admin.LeaderElectionCommand$.run(LeaderElectionCommand.scala:88)*15:53:08*
>  at 
> kafka.admin.LeaderElectionCommand$.main(LeaderElectionCommand.scala:41)*15:53:08*
>  at 
> kafka.admin.LeaderElectionCommandTest$$anonfun$testAllTopicPartition$1.apply(LeaderElectionCommandTest.scala:91)*15:53:08*
>  at 
> kafka.admin.LeaderElectionCommandTest$$anonfun$testAllTopicPartition$1.apply(LeaderElectionCommandTest.scala:74)*15:53:08*
>  at kafka.utils.TestUtils$.resource(TestUtils.scala:1588)*15:53:08*   
>   at 
> kafka.admin.LeaderElectionCommandTest.testAllTopicPartition(LeaderElectionCommandTest.scala:74)*15:53:08*
>  *15:53:08* Caused by:*15:53:08* 
> org.apache.kafka.common.errors.TimeoutException: Aborted due to 
> timeout.*15:53:08*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

2020-04-24 Thread Stanislav Kozlovski
; > > > > > partition replicas, which gets modified rarely by comparison in a
> > > > typical
> > > > > > cluster.
> > > > > >
> > > > > > Hope this addresses your comments.
> > > > > >
> > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > alexandre.dupr...@gmail.com> wrote:
> > > > > >
> > > > > >> Hi Gokul,
> > > > > >>
> > > > > >> Thanks for the KIP.
> > > > > >>
> > > > > >> From what I understand, the objective of the new configuration
> is
> > to
> > > > > >> protect a cluster from an overload driven by an excessive number
> > of
> > > > > >> partitions independently from the load handled on the partitions
> > > > > >> themselves. As such, the approach uncouples the data-path load
> > from
> > > > > >> the number of unit of distributions of throughput and intends to
> > > avoid
> > > > > >> the degradation of performance exhibited in the test results
> > > provided
> > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > >>
> > > > > >> Couple of comments:
> > > > > >>
> > > > > >> 900. Multi-tenancy - one concern I would have with a cluster and
> > > > > >> broker-level configuration is that it is possible for a user to
> > > > > >> consume a large proportions of the allocatable partitions within
> > the
> > > > > >> configured limit, leaving other users with not enough partitions
> > to
> > > > > >> satisfy their requirements.
> > > > > >>
> > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> upper-bound
> > > on
> > > > > >> resource consumptions is via client/user quotas. Could this
> > > framework
> > > > > >> be leveraged to add this limit?
> > > > > >>
> > > > > >> 902. Partition assignment - one potential problem with the new
> > > > > >> repartitioning scheme is that if a subset of brokers have
> reached
> > > > > >> their number of assignable partitions, yet their data path is
> > > > > >> under-loaded, new topics and/or partitions will be assigned
> > > > > >> exclusively to other brokers, which could increase the
> likelihood
> > of
> > > > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > > > >> constraint on the number of partitions from the data-path
> > throughput
> > > > > >> can have conflicting requirements.
> > > > > >>
> > > > > >> 903. Rebalancing - as a corollary to 902, external tools used to
> > > > > >> balance ingress throughput may adopt an incremental approach in
> > > > > >> partition re-assignment to redistribute load, and could hit the
> > > limit
> > > > > >> on the number of partitions on a broker when a (too)
> conservative
> > > > > >> limit is used, thereby over-constraining the objective function
> > and
> > > > > >> reducing the migration path.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Alexandre
> > > > > >>
> > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > >>  a écrit :
> > > > > >> >
> > > > > >> > Hi. Requesting you to take a look at this KIP and provide
> > > feedback.
> > > > > >> >
> > > > > >> > Thanks. Regards.
> > > > > >> >
> > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > > > >> > gokul24...@gmail.com> wrote:
> > > > > >> >
> > > > > >> > > Hi.
> > > > > >> > >
> > > > > >> > > I have opened KIP-578, intended to provide a mechanism to
> > limit
> > > > the
> > > > > >> number
> > > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback on
> > the
> > > > KIP
> > > > > >> which
> > > > > >> > > you can find at
> > > > > >> > >
> > > > > >> > >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > >> > >
> > > > > >> > > I want to specially thank Stanislav Kozlovski who helped in
> > > > > >> formulating
> > > > > >> > > some aspects of the KIP.
> > > > > >> > >
> > > > > >> > > Many thanks,
> > > > > >> > >
> > > > > >> > > Gokul.
> > > > > >> > >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Stanislav


[jira] [Created] (KAFKA-9866) Do not attempt to elect preferred leader replicas which are outside ISR

2020-04-14 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-9866:
--

 Summary: Do not attempt to elect preferred leader replicas which 
are outside ISR
 Key: KAFKA-9866
 URL: https://issues.apache.org/jira/browse/KAFKA-9866
 Project: Kafka
  Issue Type: Improvement
Reporter: Stanislav Kozlovski


The controller automatically triggers a preferred leader election every N 
minutes. It tries to elect all preferred leaders of partitions without doing 
some basic checks like whether the leader is in sync.

This leads to a multitude of errors which cause confusion:
{code:java}
April 14th 2020, 17:01:11.015   [Controller id=0] Partition TOPIC-9 failed to 
complete preferred replica leader election to 1. Leader is still 0{code}
{code:java}
April 14th 2020, 17:01:11.002   [Controller id=0] Error completing replica 
leader election (PREFERRED) for partition TOPIC-9
kafka.common.StateChangeFailedException: Failed to elect leader for partition 
TOPIC-9 under strategy PreferredReplicaPartitionLeaderElectionStrategy {code}
It would be better if the Controller filtered out some of these elections, not 
attempt them at all and maybe log an aggregate INFO-level log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9617) Replica Fetcher can mark partition as failed when max.message.bytes is changed

2020-02-27 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-9617:
--

 Summary: Replica Fetcher can mark partition as failed when 
max.message.bytes is changed
 Key: KAFKA-9617
 URL: https://issues.apache.org/jira/browse/KAFKA-9617
 Project: Kafka
  Issue Type: Bug
Reporter: Stanislav Kozlovski
Assignee: Stanislav Kozlovski


There exists a race condition when changing the dynamic max.message.bytes 
config for a topic. A follower replica can replicate a message that is over 
that size after it processes the config change. When this happens, the replica 
fetcher catches the unexpected exception, marks the partition as failed and 
stops replicating it.
{code:java}
06:38:46.596Processing override for entityPath: topics/partition-1 with 
config: Map(max.message.bytes -> 512)

06:38:46.597 [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] 
Unexpected error occurred while processing data for partition partition-1 at 
offset 20964
org.apache.kafka.common.errors.RecordTooLargeException: The record batch size 
in the append to partition-1 is 3349 bytes which exceeds the maximum configured 
value of 512.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9589) LogValidatorTest#testLogAppendTimeNonCompressedV2 is not executed and does not pass

2020-02-21 Thread Stanislav Kozlovski (Jira)
Stanislav Kozlovski created KAFKA-9589:
--

 Summary: LogValidatorTest#testLogAppendTimeNonCompressedV2 is not 
executed and does not pass
 Key: KAFKA-9589
 URL: https://issues.apache.org/jira/browse/KAFKA-9589
 Project: Kafka
  Issue Type: Bug
Reporter: Stanislav Kozlovski


The LogValidatorTest#testLogAppendTimeNonCompressedV2 test does not execute 
because it's missing a '@Test' annotation.

When executed locally, it fails with the following error:
{code:java}
java.lang.AssertionError: The offset of max timestamp should be 0 
Expected :0
Actual   :2
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] New Kafka PMC Members: Colin, Vahid and Manikumar

2020-01-14 Thread Stanislav Kozlovski
Congratulations to all!

Best,
Stanislav

On Tue, Jan 14, 2020 at 9:30 AM Gwen Shapira  wrote:

> Hi everyone,
>
> I'm happy to announce that Colin McCabe, Vahid Hashemian and Manikumar
> Reddy are now members of Apache Kafka PMC.
>
> Colin and Manikumar became committers on Sept 2018 and Vahid on Jan
> 2019. They all contributed many patches, code reviews and participated
> in many KIP discussions. We appreciate their contributions and are
> looking forward to many more to come.
>
> Congrats Colin, Vahid and Manikumar!
>
> Gwen, on behalf of Apache Kafka PMC
>


-- 
Best,
Stanislav


Re: [VOTE] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

2020-01-06 Thread Stanislav Kozlovski
+1 (non-binding)

Thanks for the KIP, Brian!

On Thu, Jan 2, 2020 at 7:15 PM Brian Byrne  wrote:

> Hello all,
>
> After further discussion and improvements, I'd like to reinstate the voting
> process.
>
> The updated KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-526
> %3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> 
>
> The continued discussion:
>
> https://lists.apache.org/thread.html/b2f8f830ef04587144cf0840c7d4811bbf0a14f3c459723dbc5acf9e@%3Cdev.kafka.apache.org%3E
>
> I'd be happy to address any further comments/feedback.
>
> Thanks,
> Brian
>
> On Mon, Dec 9, 2019 at 11:02 PM Guozhang Wang  wrote:
>
> > With the concluded summary on the other discussion thread, I'm +1 on the
> > proposal.
> >
> > Thanks Brian!
> >
> > On Tue, Nov 19, 2019 at 8:00 PM deng ziming 
> > wrote:
> >
> > > >
> > > > For new (uncached) topics, one problem here is that we don't know
> which
> > > > partition to map a record to in the event that it has a key or custom
> > > > partitioner, so the RecordAccumulator wouldn't know which
> batch/broker
> > it
> > > > belongs. We'd need an intermediate record queue that subsequently
> moved
> > > the
> > > > records into RecordAccumulators once metadata resolution was
> complete.
> > > For
> > > > known topics, we don't currently block at all in waitOnMetadata.
> > > >
> > >
> > > You are right, I forget this fact, and the intermediate record queue
> will
> > > help, but I have some questions
> > >
> > > if we add an intermediate record queue in KafkaProducer, when should we
> > > move the records into RecordAccumulators?
> > > only NetworkClient is aware of the MetadataResponse, here is the
> > > hierarchical structure of the related classes:
> > > KafkaProducer
> > > Accumulator
> > > Sender
> > > NetworkClient
> > > metadataUpdater.handleCompletedMetadataResponse
> > >
> > > so
> > > 1. we should also add a metadataUpdater to KafkaProducer?
> > > 2. if the topic really does not exists? the intermediate record queue
> > will
> > > become too large?
> > > 3. and should we `block` when the intermediate record queue is too
> large?
> > > and this will again bring the blocking problem?
> > >
> > >
> > >
> > > On Wed, Nov 20, 2019 at 12:40 AM Brian Byrne 
> > wrote:
> > >
> > > > Hi Deng,
> > > >
> > > > Thanks for the feedback.
> > > >
> > > > On Mon, Nov 18, 2019 at 6:56 PM deng ziming <
> dengziming1...@gmail.com>
> > > > wrote:
> > > >
> > > > > hi, I reviewed the current code, the ProduceMetadata maintains an
> > > expiry
> > > > > threshold for every topic, every time when we write to a topic we
> > will
> > > > set
> > > > > the expiry time to -1 to indicate it should be updated, this does
> > work
> > > to
> > > > > reduce the size of the topic working set, but the producer will
> > > continue
> > > > > fetching metadata for these topics in every metadata request for
> the
> > > full
> > > > > expiry duration.
> > > > >
> > > >
> > > > Indeed, you are correct, I terribly misread the code here.
> Fortunately
> > > this
> > > > was only a minor optimization in the KIP that's no longer necessary.
> > > >
> > > >
> > > > and we can improve the situation by 2 means:
> > > > > 1. we maintain a refresh threshold for every topic which is for
> > > > example
> > > > > 0.8 * expiry_threshold, and when we send `MetadataRequest` to
> brokers
> > > we
> > > > > just request unknownLeaderTopics + unknownPartitionTopics + topics
> > > > > reach refresh threshold.
> > > > >
> > > >
> > > > Right, this is similar to what I suggested, with a larger window on
> the
> > > > "staleness" that permits for batching to an appropriate size (except
> if
> > > > there's any unknown topics, you'd want to issue the request
> > immediately).
> > > >
> > > >
> > > >
> > > > > 2. we don't invoke KafkaProducer#waitOnMetadata when we call
> > > > > KafkaProducer#send because of we just send data to
> RecordAccumulator,
> > > and
> > > > > before we send data to brokers we will invoke
> > > RecordAccumulator#ready(),
> > > > so
> > > > > we can only invoke waitOnMetadata to block when (number topics
> > > > > reach refresh threshold)>(number of all known topics)*0.2.
> > > > >
> > > >
> > > > For new (uncached) topics, one problem here is that we don't know
> which
> > > > partition to map a record to in the event that it has a key or custom
> > > > partitioner, so the RecordAccumulator wouldn't know which
> batch/broker
> > it
> > > > belongs. We'd need an intermediate record queue that subsequently
> moved
> > > the
> > > > records into RecordAccumulators once metadata resolution was
> complete.
> > > For
> > > > known topics, we don't currently block at all in waitOnMetadata.
> > > >
> > > > The last major point of minimizing producer startup metadata RPCs may
> > > still
> > > > need to be improved, but this would be a 

Re: [DISCUSS] KIP-552: Add interface to handle unused config

2020-01-06 Thread Stanislav Kozlovski
Hey Artur,

Perhaps changing the log level to DEBUG is the simplest approach.

I wonder if other people know what the motivation behind the WARN log was?
I'm struggling to think up of a scenario where I'd like to see unused
values printed in anything above DEBUG.

Best,
Stanislav

On Mon, Dec 30, 2019 at 12:52 PM Artur Burtsev  wrote:

> Hi,
>
> Indeed changing the log level for the whole AbstractConfig is not an
> option, because logAll is extremely useful.
>
> Grouping warnings into 1 (with the count of unused only) will not be a
> good option for us either. It will still be pretty noisy. Imagine we
> have 32 partitions and scaled up the application to 32 instances then
> we still have 32 warnings per application (instead of 96 now) while we
> would like to have 0 warnings because we are perfectly aware of using
> schema.registry.url and its totally fine, and we don't have to be
> warned every time we start the application. Now imagine we use more
> than one consumer per application, then it will add another
> multiplication factor to these grouped warnings and we still have a
> lot of those. So I would say grouping doesn't help much.
>
> I think adding extra logger like
> "org.apache.kafka.clients.producer.ProducerConfig.unused" could be
> another good option. That would leave the existing interface untouched
> and give everyone an option to mute irrelevant warnings.
>
> To summarize, I still can see 3 options with its pros and cons
> discussed in the thread:
> 1) extra config with interface to handle unused
> 2) change unused warn to debug
> 3) add extra logger for unused
>
> Please let me know what do you think.
>
> Thanks,
> Artur
>
> On Mon, Dec 30, 2019 at 11:07 AM Stanislav Kozlovski
>  wrote:
> >
> > Hi all,
> >
> > Would printing all the unused configurations in one line, versus N lines,
> > be more helpful? I know that it would greatly reduce the verbosity in log
> > visualization tools like Kibana while still allowing us to see all the
> > relevant information without the need for an explicit action (e.g
> > changing the log level)
> >
> > Best,
> > Stanislav
> >
> > On Sat, Dec 28, 2019 at 3:13 PM John Roesler 
> wrote:
> >
> > > Hi Artur,
> > >
> > > That’s a good point.
> > >
> > > One thing you can do is log a summary at WARN level, like “27
> > > configurations were ignored. Ignored configurations are logged at DEBUG
> > > level.”
> > >
> > > I looked into the code a little, and these log messages are generated
> in
> > > AbstractConfig (logAll and logUnused). They both use the logger
> associated
> > > with the relevant config class (StreamsConfig, ProducerConfig, etc.).
> The
> > > list of all configs is logged at INFO level, and the list of unused
> configs
> > > is logged at WARN level. This means that it's not possible to silence
> the
> > > unused config messages while still logging the list of all configs. You
> > > could only silence both by setting (for example) ProducerConfig logger
> to
> > > ERROR or OFF.
> > >
> > > If it's desirable to be able to toggle them independently, then you can
> > > create a separate logger for unused configs, named something like
> > > "org.apache.kafka.clients.producer.ProducerConfig.unused". Then, you
> can
> > > leave the log at WARN, so it would continue to be printed by default,
> and
> > > anyone could disable it by setting
> > > "org.apache.kafka.clients.producer.ProducerConfig.unused" to ERROR or
> OFF,
> > > without disturbing the rest of the config log messages.
> > >
> > > It's simpler without the extra logger, but you also get less control.
> Do
> > > you think the extra control is necessary, versus printing a summary at
> WARN
> > > level?
> > > -John
> > >
> > >
> > > On Fri, Dec 27, 2019, at 04:26, Artur Burtsev wrote:
> > > > Hi,
> > > >
> > > > Indeed changing log level to debug would be the easiest and I think
> > > > that would be a good solution. When no one object I'm ready to move
> > > > forward with this approach and submit a MR.
> > > >
> > > > The only minor thing I have – having it at debug log level might make
> > > > it a bit less friendly for developers, especially for those who just
> > > > do the first steps in Kafka. For example, if you misspelled the
> > > > property name and trying to understand why things don't do what you
> > > > expect. Having a wa

[VOTE] KIP-334 Include partitions in exceptions raised during consumer record deserialization/validation

2020-01-04 Thread Stanislav Kozlovski
Hey there,

I'm restarting the vote thread for KIP-334 Include partitions in exceptions
raised during consumer record deserialization/validation


We had some discussions on the previous vote thread which I believe were
resolved. I thought opening a new thread would be cleaner.
Last vote thread:
https://mail-archives.apache.org/mod_mbox/kafka-dev/201910.mbox/%3cCANZZNGy5fj0gkSqjMGbZVG5eBihoHdqeNryr7vOCANo=3gc...@mail.gmail.com%3e

-- 
Best,
Stanislav


Re: [DISCUSS] KIP-552: Add interface to handle unused config

2019-12-30 Thread Stanislav Kozlovski
Hi all,

Would printing all the unused configurations in one line, versus N lines,
be more helpful? I know that it would greatly reduce the verbosity in log
visualization tools like Kibana while still allowing us to see all the
relevant information without the need for an explicit action (e.g
changing the log level)

Best,
Stanislav

On Sat, Dec 28, 2019 at 3:13 PM John Roesler  wrote:

> Hi Artur,
>
> That’s a good point.
>
> One thing you can do is log a summary at WARN level, like “27
> configurations were ignored. Ignored configurations are logged at DEBUG
> level.”
>
> I looked into the code a little, and these log messages are generated in
> AbstractConfig (logAll and logUnused). They both use the logger associated
> with the relevant config class (StreamsConfig, ProducerConfig, etc.). The
> list of all configs is logged at INFO level, and the list of unused configs
> is logged at WARN level. This means that it's not possible to silence the
> unused config messages while still logging the list of all configs. You
> could only silence both by setting (for example) ProducerConfig logger to
> ERROR or OFF.
>
> If it's desirable to be able to toggle them independently, then you can
> create a separate logger for unused configs, named something like
> "org.apache.kafka.clients.producer.ProducerConfig.unused". Then, you can
> leave the log at WARN, so it would continue to be printed by default, and
> anyone could disable it by setting
> "org.apache.kafka.clients.producer.ProducerConfig.unused" to ERROR or OFF,
> without disturbing the rest of the config log messages.
>
> It's simpler without the extra logger, but you also get less control. Do
> you think the extra control is necessary, versus printing a summary at WARN
> level?
> -John
>
>
> On Fri, Dec 27, 2019, at 04:26, Artur Burtsev wrote:
> > Hi,
> >
> > Indeed changing log level to debug would be the easiest and I think
> > that would be a good solution. When no one object I'm ready to move
> > forward with this approach and submit a MR.
> >
> > The only minor thing I have – having it at debug log level might make
> > it a bit less friendly for developers, especially for those who just
> > do the first steps in Kafka. For example, if you misspelled the
> > property name and trying to understand why things don't do what you
> > expect. Having a warning might save some time in this case. Other than
> > that I cannot see any reasons to have warnings there.
> >
> > Thanks,
> > Artur
> >
> > On Thu, Dec 26, 2019 at 10:01 PM John Roesler 
> wrote:
> > >
> > > Thanks for the KIP, Artur!
> > >
> > > For reference, here is the kip:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-552%3A+Add+interface+to+handle+unused+config
> > >
> > > I agree, these warnings are kind of a nuisance. Would it be feasible
> just to leverage log4j in some way to make it easy to filter these
> messages? For example, we could move those warnings to debug level, or even
> use a separate logger for them.
> > >
> > > Thanks for starting the discussion.
> > > -John
> > >
> > > On Tue, Dec 24, 2019, at 07:23, Artur Burtsev wrote:
> > > > Hi,
> > > >
> > > > This KIP provides a way to deal with a warning "The configuration {}
> > > > was supplied but isn't a known config." when it is not relevant.
> > > >
> > > > Cheers,
> > > > Artur
> > > >
> >
>


-- 
Best,
Stanislav


Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

2019-12-26 Thread Stanislav Kozlovski
Hey Brian,

1. Could we more explicitly clarify the behavior of the algorithm when `|T|
> TARGET_METADATA_FETCH SIZE` ? I assume we ignore the config in that
scenario
2. Should `targetMetadataFetchSize = Math.max(topicsPerSec / 10, 20)` be
`topicsPerSec * 10` ?
3. When is this new algorithm applied? To confirm my understanding - what
is the behavior of `metadata.max.age.ms` after this KIP? Are we adding a
new, more proactive metadata fetch for topics in |U|?

Thanks,
Stanislav

On Thu, Dec 19, 2019 at 11:37 PM Brian Byrne  wrote:

> Hello everyone,
>
> For all interested, please take a look at the proposed algorithm as I'd
> like to get more feedback. I'll call for a vote once the break is over.
>
> Thanks,
> Brian
>
> On Mon, Dec 9, 2019 at 10:18 PM Guozhang Wang  wrote:
>
> > Sounds good, I agree that should not make a big difference in practice.
> >
> > On Mon, Dec 9, 2019 at 2:07 PM Brian Byrne  wrote:
> >
> > > Hi Guozhang,
> > >
> > > I see, we agree on the topic threshold not applying to urgent topics,
> but
> > > differ slightly on what should be considered urgent. I would argue that
> > we
> > > should consider topics nearing the metadata.max.age.ms to be urgent
> > since
> > > they may still be well within the metadata.expiry.ms. That is, the
> > client
> > > still considers these topics to be relevant (not expired), but doesn't
> > want
> > > to incur the latency bubble of having to wait for the metadata to be
> > > re-fetched if it's stale. This could be a frequent case if the
> > > metadata.max.age.ms << metadata.expiry.ms.
> > >
> > > In practice, I wouldn't expect this to make a noticeable difference so
> I
> > > don't have a strong leaning, but the current behavior today is to
> > > aggressively refresh the metadata of stale topics by ensuring a refresh
> > is
> > > triggered before that metadata.max.age.ms duration elapses.
> > >
> > > Thanks,
> > > Brian
> > >
> > >
> > > On Mon, Dec 9, 2019 at 11:57 AM Guozhang Wang 
> > wrote:
> > >
> > > > Hello Brian,
> > > >
> > > > Thanks for your explanation, could you then update the wiki page for
> > the
> > > > algorithm part since when I read it, I thought it was different from
> > the
> > > > above, e.g. urgent topics should not be added just because of max.age
> > > > expiration, but should only be added if there are sending data
> pending.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Mon, Dec 9, 2019 at 10:57 AM Brian Byrne 
> > wrote:
> > > >
> > > > > Hi Guozhang,
> > > > >
> > > > > Thanks for the feedback!
> > > > >
> > > > > On Sun, Dec 8, 2019 at 6:25 PM Guozhang Wang 
> > > wrote:
> > > > >
> > > > > > 1. The addition of *metadata.expiry.ms <
> http://metadata.expiry.ms>
> > > > > *should
> > > > > > be included in the public interface. Also its semantics needs
> more
> > > > > > clarification (since previously it is hard-coded internally we do
> > not
> > > > > need
> > > > > > to explain it publicly, but now with the configurable value we do
> > > > need).
> > > > > >
> > > > >
> > > > > This was an oversight. Done.
> > > > >
> > > > >
> > > > > > 2. There are a couple of hard-coded parameters like 25 and 0.5 in
> > the
> > > > > > proposal, maybe we need to explain why these magic values makes
> > sense
> > > > in
> > > > > > common scenarios.
> > > > > >
> > > > >
> > > > > So these are pretty fuzzy numbers, and seemed to be a decent
> balance
> > > > > between trade-offs. I've updated the target size to account for
> > setups
> > > > with
> > > > > a large number of topics or a shorter refresh time, as well as
> added
> > > some
> > > > > light rationale.
> > > > >
> > > > >
> > > > > > 3. In the Urgent set condition, do you actually mean "with no
> > cached
> > > > > > metadata AND there are existing data buffered for the topic"?
> > > > > >
> > > > >
> > > > > Yes, fixed.
> > > > >
> > > > >
> > > > >
> > > > > > One concern I have is whether or not we may introduce a
> regression,
> > > > > > especially during producer startup such that since we only
> require
> > up
> > > > to
> > > > > 25
> > > > > > topics each request, it may cause the send data to be buffered
> more
> > > > time
> > > > > > than now due to metadata not available. I understand this is a
> > > > > acknowledged
> > > > > > trade-off in our design but any regression that may surface to
> > users
> > > > need
> > > > > > to be very carefully considered. I'm wondering, e.g. if we can
> > tweak
> > > > our
> > > > > > algorithm for the Urgent set, e.g. to consider those with non
> > cached
> > > > > > metadata have higher priority than those who have elapsed max.age
> > but
> > > > not
> > > > > > yet have been called for sending. More specifically:
> > > > > >
> > > > > > Urgent: topics that have been requested for sending but no cached
> > > > > metadata,
> > > > > > and topics that have send request failed with e.g. NOT_LEADER.
> > > > > > Non-urgent: topics that are not in Urgent but have expired
> max.age.
> > > > > >
> > > > > > Then when sending 

Re: [DISCUSS] KIP-542: Partition Reassignment Throttling

2019-12-13 Thread Stanislav Kozlovski
Hey Viktor,

I intuitively think that reassignment is a form of (extra) replication, so
I think the non-additive version sounds more natural to me. Would be good
to see what others think

Thanks for summing up what you changed in the KIP's wording here.

Best,
Stanislav

On Fri, Dec 13, 2019 at 3:39 PM Viktor Somogyi-Vass 
wrote:

> Hey Stan,
>
> 1. Yes.
>
> 2. Yes and no :). My earlier suggestion was exactly that. In the last reply
> to you I meant that if the replication throttle is 20 and the reassignment
> throttle is 10 then we'd still have 20 total throttle but 10 of that can be
> used for general replication and 10 again for reassignment. I think that
> either your version or this one can be good solutions, the main difference
> is how do you think about reassignment.
> If we think that reassignment is a special kind of replication then it
> might make sense to treat it as that and for me it sounds logical that we
> sort of count under the replication quota. It also protects you from
> setting too high value as you won't be able to configure something higher
> than replication.throttled.rate (but then what's left for general
> replication). On the other side users may have to increase their
> replication.throttled.rate if they want to increase or set their
> reassignment quota. This doesn't really play when you treat them under
> non-related quotas but you have to keep your total quota
> (replication+reassignment) in mind. Also I don't usually see customers
> using replication throttling during normal operation so for them it might
> be better to use the additive version (where 20 + 10 = 30) from an
> operational perspective (less things to configure).
> I'll add this non-additive version to the rejected alternatives.
>
> Considering the mentioned drawbacks I think it's better to go with the
> additive one.
> The updated KIP:
> "The possible configuration variations are:
> - replication.throttled.rate is set but reassignment.throttled.rate isn't
> (or -1): any kind of replication (so including reassignment) can take up to
> replication.throttled.rate bytes.
> - replication.throttled.rate and reassignment.throttled.rate both set: both
> can use a bandwidth up to the configured limit and the total replication
> limit will be reassignment.throttled.rate + replication.throttled.rate
> - replication.throttled.rate is not set but reassignment.throttled.rate is
> set: in this case general replication has no bandwidth limits but
> reassignment.throttled.rate has the configured limits.
> - neither replication.throttled.rate nor reassignment.throttled.rate are
> set (or -1): no throttling is set on any replication."
>
> 3. Yea, the motivation section might be a bit poorly worded as both you and
> Ismael pointed out problems, so let me rephrase it:
> "A user is able to specify the partition and the throttle rate but it will
> be applied to all non-ISR replication traffic. This is can be undesirable
> because during reassignment it also applies to non-reassignment replication
> and causes a replica to be throttled if it falls out of ISR. Also if
> leadership changes during reassignment, the throttles also have to be
> changed manually."
>
> Viktor
>
> On Tue, Dec 10, 2019 at 8:16 PM Stanislav Kozlovski <
> stanis...@confluent.io>
> wrote:
>
> > Hey Viktor,
> >
> > I like your latest idea regarding the replication/reassignment configs
> > interplay - I think it makes sense for replication to always be higher. A
> > small matrix of possibilities in the KIP may be useful to future readers
> > (users)
> > To be extra clear:
> > 1. if reassignment.throttle is -1, reassignment traffic is counted with
> > replication traffic against replication.throttle
> > 2. if replication.throttle is 20 and reassignment.throttle is 10, we
> have a
> > 30 total throttle
> > Is my understanding correct?
> >
> > Regarding the KIP - the motivation states
> >
> > > So a user is able to specify the partition and the throttle rate but it
> > will be applied to all non-ISR replication traffic. This is undesirable
> > because if a node that is being throttled falls out of ISR it would
> further
> > prevent it from catching up.
> >
> > This KIP does not solve this problem, right?
> > Or did you mean to mention the problem where reassignment replicas would
> > eat up the throttle and further limit the non-ISR "original" replicas
> from
> > catching up?
> >
> > Best,
> > Stanislav
> >
> > On Tue, Dec 10, 2019 at 9:09 AM Viktor Somogyi-Vass <
> > viktorsomo...@gmail.com>
> > wrote:
> >
> > > This config will only be ap

Re: [DISCUSS] KIP-542: Partition Reassignment Throttling

2019-12-10 Thread Stanislav Kozlovski
Hey Viktor,

I like your latest idea regarding the replication/reassignment configs
interplay - I think it makes sense for replication to always be higher. A
small matrix of possibilities in the KIP may be useful to future readers
(users)
To be extra clear:
1. if reassignment.throttle is -1, reassignment traffic is counted with
replication traffic against replication.throttle
2. if replication.throttle is 20 and reassignment.throttle is 10, we have a
30 total throttle
Is my understanding correct?

Regarding the KIP - the motivation states

> So a user is able to specify the partition and the throttle rate but it
will be applied to all non-ISR replication traffic. This is undesirable
because if a node that is being throttled falls out of ISR it would further
prevent it from catching up.

This KIP does not solve this problem, right?
Or did you mean to mention the problem where reassignment replicas would
eat up the throttle and further limit the non-ISR "original" replicas from
catching up?

Best,
Stanislav

On Tue, Dec 10, 2019 at 9:09 AM Viktor Somogyi-Vass 
wrote:

> This config will only be applied to those replicas which are reassigning
> and not yet in ISR. When they become ISR then reassignment throttling stops
> altogether and won't apply when they fall out of ISR. Specifically
> the validity of the config spans from the point when a reassignment is
> propagated by the adding_replicas field in the LeaderAndIsr request until
> the broker gets another LeaderAndIsr request saying that the new replica is
> added and in ISR. Furthermore the config will be applied only the actual
> leader and follower so if the leader changes in the meanwhile the
> throttling changes with it (again based on the LeaderAndIsr requests).
>
> For instance when a new broker is added to offload some partitions there,
> it will be safer to use this config instead of general fetch throttling for
> this very reason: when an existing partition that is being reassigned falls
> out of ISR then it will be propagated via the LeaderAndIsr request so
> throttling also changes. This removes the need for changing the configs
> manually and would give an easy way for people to configure throttling yet
> would make better efforts to not throttle what's not needed to be throttled
> (the replica which is falling out of ISR).
>
> Viktor
>
> On Fri, Dec 6, 2019 at 5:12 PM Ismael Juma  wrote:
>
> > My concern is that we're very focused on reassignment where I think users
> > enable throttling to avoid overwhelming brokers with replica catch up
> > traffic (typically disk and/or bandwidth). The current approach achieves
> > that by not throttling ISR replication.
> >
> > The downside is that when a broker falls out of the ISR, it may suddenly
> > get throttled and never catch up. However, if the throttle can cause this
> > kind of issue, then it's broken for replicas being reassigned too, so one
> > could say that it's a configuration error.
> >
> > Do we have specific scenarios that would be solved by the proposed
> change?
> >
> > Ismael
> >
> > On Fri, Dec 6, 2019 at 2:26 AM Viktor Somogyi-Vass <
> > viktorsomo...@gmail.com>
> > wrote:
> >
> > > Thanks for the question. I think it depends on how the user will try to
> > fix
> > > it.
> > > - If they just replace the disk then I think it shouldn't count as a
> > > reassignment and should be allocated under the normal replication
> quotas.
> > > In this case there is no reassignment going on as far as I can tell,
> the
> > > broker shuts down serving replicas from that dir/disk, notifies the
> > > controller which changes the leadership. When the disk is fixed the
> > broker
> > > will be restarted to pick up the changes and it starts the replication
> > from
> > > the current leader.
> > > - If the user reassigns the partitions to other brokers then it will
> fall
> > > under the reassignment traffic.
> > > Also if the user moves a partition to a different disk it would also
> > count
> > > as normal replication as it technically not a reassignment but an
> > > alter-replica-dir event but it's still done with the reassignment tool,
> > so
> > > I'd keep the current functionality of the
> > > --replica-alter-log-dirs-throttle.
> > > Is this aligned with your thinking?
> > >
> > > Viktor
> > >
> > > On Wed, Dec 4, 2019 at 2:47 PM Ismael Juma  wrote:
> > >
> > > > Thanks Viktor. How do we intend to handle the case where a broker
> loses
> > > its
> > > > disk and has to catch up from the beginning?
> > > >
> > > > Ismael
> > > >
> > > > On Wed, Dec 4, 2019, 4:31 AM Viktor Somogyi-Vass <
> > > viktorsomo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for the notice Ismael, KAFKA-4313 fixed this issue indeed.
> > I've
> > > > > updated the KIP.
> > > > >
> > > > > Viktor
> > > > >
> > > > > On Tue, Dec 3, 2019 at 3:28 PM Ismael Juma 
> > wrote:
> > > > >
> > > > > > Hi Viktor,
> > > > > >
> > > > > > The KIP states:
> > > > > >
> > > > > > "KIP-73
> > > > > > <
> > > > > >
> > > > >
> > > >
> 

Re: [DISCUSS] KIP-435: Internal Partition Reassignment Batching

2019-12-04 Thread Stanislav Kozlovski
ktor Somogyi-Vass <
> > viktorsomo...@gmail.com> wrote:
> >
> >> Hi Stan,
> >>
> >> I meant the following (maybe rare) scenario - we have replicas [1, 2, 3]
> >> on
> >> a lot of partitions and the user runs a massive rebalance to change them
> >> all to [3, 2, 1]. In the old behavior, I think that this would not do
> >> anything but simply change the replica set in ZK.
> >> Then, the user could run kafka-preferred-replica-election.sh on a given
> >> set
> >> of partitions to make sure the new leader 3 gets elected.
> >>
> >> I thought the old algorithm would elect 3 as the leader in this case
> >> right away at the end but I have to double check this. In any case I
> think
> >> it would make sense in the new algorithm if we elected the new preferred
> >> leader right away, regardless of the new leader is chosen from the
> existing
> >> replicas or not. If the whole reassignment is in fact just changing the
> >> replica order then either way it is a simple (trivial) operation and
> doing
> >> it batched wouldn't slow it down much as there is no data movement
> >> involved. If the reassignment is mixed, meaning it contains reordering
> and
> >> real movement as well then in fact it would help to even out the load
> >> faster as data movements can get long. For instance in case of a
> >> reassignment batch of two partitions concurrently: P1: (1,2,3) ->
> (3,2,1)
> >> and P2: (4,5,6) -> (7,8,9) the P2 reassignment would elect a new leader
> but
> >> P1 wouldn't and it wouldn't help the goal of normalizing traffic on
> broker
> >> 1 that much.
> >> Again, I'll have to check how the current algorithm works and if it has
> >> any unknown drawbacks to implement what I sketched up above.
> >>
> >> As for generic preferred leader election when called from the admin api
> >> or the auto leader balance feature I think you're right that we should
> >> leave it as it is. It doesn't involve any data movement so it's fairly
> fast
> >> and it normalizes the cluster state quickly.
> >>
> >> Viktor
> >>
> >> On Tue, Jul 9, 2019 at 9:04 PM Stanislav Kozlovski <
> >> stanis...@confluent.io> wrote:
> >>
> >>> Hey Viktor,
> >>>
> >>>  I think it is intuitive if there are on a global level...If we applied
> >>> > them on every batch then we
> >>> > couldn't really guarantee any limits as the user would be able to get
> >>> > around it with submitting lots of reassignments.
> >>>
> >>>
> >>> Agreed. Could we mention this explicitly in the KIP?
> >>>
> >>> Also if I understand correctly, AlterPartitionAssignmentsRequest would
> >>> be a
> >>> > partition level batching, isn't it? So if you submit 10 partitions at
> >>> once
> >>> > then they'll all be started by the controller immediately as per my
> >>> > understanding.
> >>>
> >>>
> >>> Yep, absolutely
> >>>
> >>> I've raised the ordering problem on the discussion of KIP-455 in a bit
> >>> > different form and as I remember the verdict there was that we
> >>> shouldn't
> >>> > expose ordering as an API. It might not be easy as you say and there
> >>> might
> >>> > be much better strategies to follow (like disk or network utilization
> >>> > goals). Therefore I'll remove this section from the KIP.
> >>>
> >>>
> >>> Sounds good to me.
> >>>
> >>> I'm not sure I get this scenario. So are you saying that after they
> >>> > submitted a reassignment they also submit a preferred leader change?
> >>> > In my mind what I would do is:
> >>> > i) make auto.leader.rebalance.enable to obey the leader movement
> limit
> >>> as
> >>> > this way it will be easier to calculate the reassignment batches.
> >>> >
> >>>
> >>> Sorry, this is my fault -- I should have been more clear.
> >>> First, I didn't think through this well enough at the time, I don't
> >>> think.
> >>> If we have replicas=[1, 2, 3] and we reassign them to [4, 5, 6], it is
> >>> obvious that a leader shift will happen. Your KIP proposes we shift
> >>> replicas 1 and 4 first.
> >>>
> >>> I meant the following (maybe rare) scenario - we hav

Re: [DISCUSS] KIP-548 Add Option to enforce rack-aware custom partition reassignment execution

2019-11-22 Thread Stanislav Kozlovski
Hello Satish,

Could you provide a link to the KIP? I am unable to find it in the KIP
parent page
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

Thanks,
Stanislav

On Fri, Nov 22, 2019 at 8:21 AM Satish Bellapu 
wrote:

> Hi All,
>
> This [KIP-548] is basically extending the capabilities of
> "kafka-reassign-partitions" tool by adding rack-aware verification option
> when used along with custom or manually generated reassignment planner with
> --execute scenario.
>
> @sbellapu.
>


-- 
Best,
Stanislav


Re: [VOTE] KIP-544: Make metrics exposed via JMX configurable

2019-11-11 Thread Stanislav Kozlovski
+1 (non-binding). Thanks Xavier


On Sat, Nov 9, 2019 at 9:54 AM Manikumar  wrote:

> +1 (binding). Thanks for the KIP.
>
>
> On Sat, Nov 9, 2019 at 3:11 PM Alexandre Dupriez <
> alexandre.dupr...@gmail.com> wrote:
>
> > +1 (non-binding)
> >
> > Le ven. 8 nov. 2019 à 20:21, Bill Bejeck  a écrit :
> >
> > > Thanks for the KIP, +1 (binding).
> > >
> > > -Bill
> > >
> > > On Fri, Nov 8, 2019 at 1:28 PM Gwen Shapira  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Thu, Nov 7, 2019 at 11:17 AM Xavier Léauté 
> > > wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > I'd like to open up the vote for KIP-544 on making exposed JMX
> > metrics
> > > > more
> > > > > configurable.
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-544%3A+Make+metrics+exposed+via+JMX+configurable
> > > > >
> > > > > Thank you!
> > > > > Xavier
> > > >
> > >
> >
>


-- 
Best,
Stanislav


Re: [DISCUSS] KIP-542: Partition Reassignment Throttling

2019-11-04 Thread Stanislav Kozlovski
Hi Viktor,

> As for the first question I think is no need for *.throttled.replicas in
case of reassignment because the LeaderAndIsrRequest exactly specifies the
replicas needed to be throttled.

Exactly. I also can't envision scenarios where we would like to throttle
the reassignment traffic to only a subset of the reassigning replicas.

> For instance a bootstrapping server where all replicas are throttled and
there are reassigning replicas and the reassignment throttle set higher I
think we should still apply the replication throttle to ensure the broker
won't have problems. What do you think?

If we always take the lowest value, this means that the reassignment
throttle must always be equal to or lower than the replication throttle.
Doesn't that mean that the reassigning partitions may never catch up? I
guess not, since we expect to always be moving less than the total number
of partitions at one time.
I have mixed feelings about this - I like the flexibility of being able to
configure whatever value we please, yet I struggle to come up with a
scenario where we would want a higher reassignment throttle than
replication. Perhaps your suggestion is better.

This begs another question - since we're separating the replication
throttle from the reassignment throttle, the maximum traffic a broker may
replicate now becomes `replication.throttled.rate` + `
reassignment.throttled.rate`
Seems like we would benefit from having a total cap to ensure users don't
shoot themselves in the foot.

We could have a new config that denotes the total possible throttle rate
and we then divide that by reassignment and replication. But that assumes
that we would set the replication.throttled.rate much lower than what the
broker could handle.

Perhaps the best approach would be to denote how much the broker can handle
(total.replication.throttle.rate) and then allow only up to N% of that go
towards reassignments (reassignment.throttled.rate) in a best-effort way
(preferring replication traffic). That sounds tricky to implement though
Interested to hear what others think

Best,
Stanislav


On Mon, Nov 4, 2019 at 11:08 AM Viktor Somogyi-Vass 
wrote:

> Hey Stan,
>
> > We will introduce two new configs in order to eventually replace
> *.replication.throttled.rate.
> Just to clarify, you mean to replace said config in the context of
> reassignment throttling, right? We are not planning to remove that config
>
> Yes, I don't want to remove that config either. Removed that sentence.
>
> And also to clarify, *.throttled.replicas will not apply to the new
> *reassignment* configs, correct? We will throttle all reassigning replicas.
> (I am +1 on this, I believe it is easier to reason about. We could always
> add a new config later)
>
> Are you asking whether there is a need for a
> leader.reassignment.throttled.replicas and
> follower.reassignment.throttled.replicas config or are you interested in
> the behavior between the old and the new configs?
> As for the first question I think is no need for *.throttled.replicas in
> case of reassignment because the LeaderAndIsrRequest exactly specifies the
> replicas needed to be throttled.
> As for the second, see below.
>
> I have one comment about backwards-compatibility - should we ensure that
> the old `*.replication.throttled.rate` and `*.throttled.replicas` still
> apply to reassigning traffic if set? We could have the new config take
> precedence, but still preserve backwards compatibility.
>
> Sure, we should apply replication throttling to reassignment too if set.
> But instead of the new taking precedence I'd apply whichever has lower
> value.
> For instance a bootstrapping server where all replicas are throttled and
> there are reassigning replicas and the reassignment throttle set higher I
> think we should still apply the replication throttle to ensure the broker
> won't have problems. What do you think?
>
> Thanks,
> Viktor
>
>
> On Fri, Nov 1, 2019 at 9:57 AM Stanislav Kozlovski  >
> wrote:
>
> > Hey Viktor. Thanks for the KIP!
> >
> > > We will introduce two new configs in order to eventually replace
> > *.replication.throttled.rate.
> > Just to clarify, you mean to replace said config in the context of
> > reassignment throttling, right? We are not planning to remove that config
> >
> > And also to clarify, *.throttled.replicas will not apply to the new
> > *reassignment* configs, correct? We will throttle all reassigning
> replicas.
> > (I am +1 on this, I believe it is easier to reason about. We could always
> > add a new config later)
> >
> > I have one comment about backwards-compatibility - should we ensure that
> > the old `*.replication.throttled.rate` and `*.throttled.replicas` still
> > apply to reassigning traffic if set? We could

Re: [DISCUSS] KIP-536: Propagate broker timestamp to Admin API

2019-11-01 Thread Stanislav Kozlovski
Hey Noa,

KIP-436 added a JMX metric in Kafka for this exact use-case, called
`start-time-ms`. Perhaps it would be useful to name this public interface
in the same way for consistency.

Could you update the KIP to include the specific RPC changes regarding the
metadata request/responses? Here is a recent example of how to portray the
changes -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-525+-+Return+topic+metadata+and+configs+in+CreateTopics+response

Thanks,
Stanislav!

On Mon, Oct 14, 2019 at 2:46 PM Noa Resare  wrote:

> We are in the process of migrating the pieces of automation that currently
> reads and modifies zookeeper state to use the Admin API.
>
> One of the things that we miss doing this is access to the start time of
> brokers in a cluster which is used by our automation doing rolling
> restarts. We currently read this from the timestamp field from the
> epehmeral broker znodes in zookeeper. To address this limitation, I have
> authored KIP-536, that proposes adding a timestamp field to the Node class
> that the AdminClient.describeCluster() method returns.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-536%3A+Propagate+broker+timestamp+to+Admin+API
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-536:+Propagate+broker+timestamp+to+Admin+API
> >
>
> Any and all feedback is most welcome
>
> cheers
> noa



-- 
Best,
Stanislav


Re: [VOTE] KIP-541: Create a fetch.max.bytes configuration for the broker

2019-11-01 Thread Stanislav Kozlovski
+1 (non-binding).
Thanks!
Stanislav

On Fri, Oct 25, 2019 at 2:29 PM David Arthur  wrote:

> +1 binding, this will be a nice improvement. Thanks, Colin!
>
> -David
>
> On Fri, Oct 25, 2019 at 4:33 AM Tom Bentley  wrote:
>
> > +1 nb. Thanks!
> >
> > On Fri, Oct 25, 2019 at 7:43 AM Ismael Juma  wrote:
> >
> > > +1 (binding)
> > >
> > > On Thu, Oct 24, 2019, 4:56 PM Colin McCabe  wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to start the vote on KIP-541: Create a fetch.max.bytes
> > > > configuration for the broker.
> > > >
> > > > KIP: https://cwiki.apache.org/confluence/x/4g73Bw
> > > >
> > > > Discussion thread:
> > > >
> > >
> >
> https://lists.apache.org/thread.html/9d9dde93a07e1f1fc8d9f182f94f4bda9d016c5e9f3c8541cdc6f53b@%3Cdev.kafka.apache.org%3E
> > > >
> > > > cheers,
> > > > Colin
> > > >
> > >
> >
>
>
> --
> David Arthur
>


-- 
Best,
Stanislav


  1   2   3   4   >