Re: [DISCUSS] KIP-1004: Enforce tasks.max property in Kafka Connect

2023-11-22 Thread Yash Mayya
Hi Chris,

Thanks for the well written and comprehensive KIP! Given that we're already
past the KIP freeze deadline for 3.7.0 (
https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0) and
there may not be a 3.8.0 release before the 4.0.0 release, would we then be
forced to punt the removal of "tasks.max.enforce" to a future 5.0.0
release? I don't have any other comments, and the proposed changes make
sense to me.

Thanks,
Yash

On Mon, Nov 20, 2023 at 10:50 PM Chris Egerton 
wrote:

> Hi Hector,
>
> Thanks for taking a look! I think the key difference between the proposed
> behavior and the rejected alternative is that the set of tasks that will be
> running with the former is still a complete set of tasks, whereas the set
> of tasks in the latter is a subset of tasks. Also noteworthy but slightly
> less important: the problem will be more visible to users with the former
> (the connector will still be marked FAILED) than with the latter.
>
> Cheers,
>
> Chris
>
> On Tue, Nov 21, 2023, 00:53 Hector Geraldino (BLOOMBERG/ 919 3RD A) <
> hgerald...@bloomberg.net> wrote:
>
> > Thanks for the KIP Chris, adding this check makes total sense.
> >
> > I do have one question. The second paragraph in the Public Interfaces
> > section states:
> >
> > "If the connector generated excessive tasks after being reconfigured,
> then
> > any existing tasks for the connector will be allowed to continue running,
> > unless that existing set of tasks also exceeds the tasks.max property."
> >
> > Would not failing the connector land us in the second scenario of
> > 'Rejected Alternatives'?
> >
> > From: dev@kafka.apache.org At: 11/11/23 00:27:44 UTC-5:00To:
> > dev@kafka.apache.org
> > Subject: [DISCUSS] KIP-1004: Enforce tasks.max property in Kafka Connect
> >
> > Hi all,
> >
> > I'd like to open up KIP-1004 for discussion:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1004%3A+Enforce+tasks.max+
> > property+in+Kafka+Connect
> >
> > As a brief summary: this KIP proposes that the Kafka Connect runtime
> start
> > failing connectors that generate a greater number of tasks than the
> > tasks.max property, with an optional emergency override that can be used
> to
> > continue running these (probably-buggy) connectors if absolutely
> necessary.
> >
> > I'll be taking time off most of the next three weeks, so response latency
> > may be a bit higher than usual, but I wanted to kick off the discussion
> in
> > case we can land this in time for the upcoming 3.7.0 release.
> >
> > Cheers,
> >
> > Chris
> >
> >
> >
>


Re: [DISCUSS] Road to Kafka 4.0

2023-11-22 Thread Colin McCabe
On Tue, Nov 21, 2023, at 19:30, Luke Chen wrote:
> Yes, KIP-853 and disk failure support are both very important missing
> features. For the disk failure support, I don't think this is a
> "good-to-have-feature", it should be a "must-have" IMO. We can't announce
> the 4.0 release without a good solution for disk failure in KRaft.

Hi Luke,

Thanks for the reply.

Controller disk failure support is not missing from KRaft. I described how to 
handle controller disk failures earlier in this thread.

I should note here that the broker in ZooKeeper mode also requires manual 
handling of disk failures. Restarting a broker with the same ID, but an empty 
disk, breaks the invariants of replication when in ZK mode. Consider:

1. Broker 1 goes down. A ZK state change notification for /brokers fires and 
goes on the controller queue.

2. Broker 1 comes back up with an empty disk.

3. The controller processes the zk state change notification for /brokers. 
Since broker 1 is up no action is taken.

4. Now broker 1 is in the ISR for any partitions it was previously, but has no 
data. If it is or becomes leader for any partitions, irreversable data loss 
will occur.

This problem is more than theoretical. We at Confluent have observed it in 
production and put in place special workarounds for the ZK clusters we still 
have.

KRaft has never had this problem because brokers are removed from ISRs when a 
new incarnation of the broker registers.

So perhaps ZK mode is not ready for production for Aiven? Since disk failures 
do in fact require special handling there. (And/or bringing up new nodes with 
empty disks, which seems to be their main concern.)

>
> It’s also worth thinking about how Apache Kafka users who depend on JBOD
> might look at the risks of not having a 3.8 release. JBOD support on KRaft
> is planned to be added in 3.7, and is still in progress so far. So it’s
> hard to say it’s a blocker or not. But in practice, even if the feature is
> made into 3.7 in time, a lot of new code for this feature is unlikely to be
> entirely bug free. We need to maintain the confidence of those users, and
> forcing them to migrate through 3.7 where this new code is hardly
> battle-tested doesn’t appear to do that.
>

As Ismael said, if there are JBOD bugs in 3.7, we will do follow-on point 
releases to address them.

> Our goal for 4.0 should be that all the “main” features in KRaft are in
> production ready state. To reach the goal, I think having one more release
> makes sense. We can have different opinions about what the “main features”
> in KRaft are, but we should all agree, JBOD is one of them.

The current plan is for JBOD to be production-ready in the 3.7 branch.

The other features of KRaft have been in production-ready state since the 3.3 
release. (Well, except for delegation tokens and SCRAM, which were implemented 
in 3.5 and 3.6)

> I totally agree with you. We can keep delaying the 4.0 release forever. I'd
> also like to draw a line to it. So, in my opinion, the 3.8 release is the
> line. No 3.9, 3.10 releases after that. If this is the decision, will your
> concern about this infinite loop disappear?

Well, the line was drawn in KIP-833. If we redraw it, what is to stop us from 
redrawing it again and again?

>
> Final note: Speaking of the missing features, I can always cooperate with
> you and all other community contributors to make them happen, like we have
> discussed earlier. Just let me know.
>

Thanks, Luke. I appreciate the offer.

But, on the KRaft side, I still maintain that nothing is missing except JBOD, 
which we already have a plan for.

best,
Colin


> Thank you.
> Luke
>
> On Wed, Nov 22, 2023 at 2:54 AM Colin McCabe  wrote:
>
>> On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote:
>> > Hi Colin,
>> >
>> > I think it's great that Confluent runs KRaft clusters in production,
>> > and it means that it is production ready for Confluent and it's users.
>> > But luckily for Kafka, the community is bigger than this (self managed
>> > in the cloud or in-prem, or customers of other SaaS companies).
>>
>> Hi Josep,
>>
>> Confluent is not the only company using or developing KRaft. Most of the
>> big organizations developing Kafka are involved. I mentioned Confluent's
>> deployments because I wanted to be clear that KRaft mode is not
>> experimental or new. Talking about software in production is a good way to
>> clear up these misconceptions.
>>
>> Indeed, KRaft mode is many years old. It started around 2020, and became
>> production-ready in AK 3.5 in 2022. ZK mode was deprecated in AK 3.5, which
>> was released June 2023. If we release AK 4.0 around April (or maybe a month
>> or two later) then that will be almost a full year between deprecation and
>> removal of ZK mode. We've talked about this a lot, in KIPs, in Apache blog
>> posts, at conferences, and so forth.
>>
>> > We've heard at least from 1 SaaS company, Aiven (disclaimer, it is my
>> > employer) where the current feature set makes it not tri

Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2409

2023-11-22 Thread Apache Jenkins Server
See 




Re: Apache Kafka 3.6.1 release

2023-11-22 Thread Kirk True
Hi Mickael,

Is there still time to put in another fix? I'd like to propose we include 
https://issues.apache.org/jira/browse/KAFKA-15817.

Thanks,
Kirk

On Fri, Nov 17, 2023, at 6:15 AM, Mickael Maison wrote:
> Hi,
> 
> Quick update on 3.6.1:
> We have 1 blocker issue left: KAFKA-15802. Once it's fixed I'll start
> building the first release candidate.
> 
> Thanks,
> Mickael
> 
> On Mon, Nov 13, 2023 at 6:01 PM Mickael Maison  
> wrote:
> >
> > Hi,
> >
> > Ok, I've put together a release plan:
> > https://cwiki.apache.org/confluence/display/KAFKA/Release+plan+3.6.1
> >
> > I'll start chasing the owners of the few open issues. If there's any
> > other issues you'd like to have in 3.6.1, please let me know.
> >
> > Thanks,
> > Mickael
> >
> > On Mon, Nov 13, 2023 at 4:26 PM Divij Vaidya  
> > wrote:
> > >
> > > Thanks for volunteering Mickael. Please feel free to take over this 
> > > thread.
> > >
> > > From a Tiered Storage perspective, there is a long list of known bugs in
> > > 3.6.0 [1] but we shouldn't wait on fixing them all for 3.6.1. This should
> > > be ok since this feature is in early access. We will do a best-effort to
> > > merge some of the critical ones by next week. I will nudge the 
> > > contributors
> > > where things are pending for a while.
> > >
> > > [1] https://issues.apache.org/jira/browse/KAFKA-15420
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Mon, Nov 13, 2023 at 4:10 PM Mickael Maison 
> > > wrote:
> > >
> > > > Hi Divij,
> > > >
> > > > You beat me to it, I was about to propose doing a 3.6.1 release later 
> > > > this
> > > > week.
> > > > While there's only a dozen or so issues fixed since 3.6.0, as
> > > > mentioned there's a few important dependency upgrades that would be
> > > > good to release.
> > > >
> > > > I'm happy to volunteer to run the release if we agree to releasing
> > > > sooner than initially proposed.
> > > > There seems to only be a few unresolved Jiras targeting 3.6.1 [0] (all
> > > > have PRs with some of them even already merged!).
> > > >
> > > > 0:
> > > > https://issues.apache.org/jira/browse/KAFKA-15552?jql=project%20%3D%20KAFKA%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%203.6.1%20ORDER%20BY%20priority%20DESC%2C%20status%20DESC%2C%20updated%20DESC
> > > >
> > > > Thanks,
> > > > Mickael
> > > >
> > > > On Mon, Nov 13, 2023 at 3:57 PM Divij Vaidya 
> > > > wrote:
> > > > >
> > > > > Hi Ismael, I am all-in favour for frequent releases. Sooner is always
> > > > > better. Unfortunately, I won't have bandwidth to volunteer for a 
> > > > > release
> > > > in
> > > > > December. If someone else volunteers to be RM prior to this timeline, 
> > > > > I
> > > > > would be happy to ceed the RM role to them but in the worst case
> > > > scenario,
> > > > > my offer to volunteer for Jan release could be considered as a backup.
> > > > >
> > > > > --
> > > > > Divij Vaidya
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Nov 13, 2023 at 3:40 PM Ismael Juma  
> > > > > wrote:
> > > > >
> > > > > > Hi Divij,
> > > > > >
> > > > > > I think we should be releasing 3.6.1 this year rather than next. 
> > > > > > There
> > > > are
> > > > > > some critical bugs in 3.6.0 and I don't think we should be waiting 
> > > > > > that
> > > > > > long to fix them. What do you think?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Mon, Nov 13, 2023 at 6:32 AM Divij Vaidya 
> > > > > > 
> > > > > > wrote:
> > > > > >
> > > > > > > Hey folks,
> > > > > > >
> > > > > > >
> > > > > > > I'd like to volunteer to be the release manager for a bug fix
> > > > release of
> > > > > > > the 3.6 line. This will be the first bug fix release of this line 
> > > > > > > and
> > > > > > will
> > > > > > > be version 3.6.1. It would contain critical bug fixes for  
> > > > > > > features
> > > > such
> > > > > > as
> > > > > > > Transaction verification [1], will stabilize Tiered Storage early
> > > > access
> > > > > > > release [2] [3] and upgrade dependencies to fix CVEs such as Netty
> > > > [4]
> > > > > > and
> > > > > > > Zookeeper [5].
> > > > > > >
> > > > > > > If no one has any objections, I will send out a release plan 
> > > > > > > latest
> > > > by
> > > > > > 23rd
> > > > > > > Dec 2023 with a tentative release in mid-Jan 2024. The release 
> > > > > > > plan
> > > > will
> > > > > > > include a list of all of the fixes we are targeting for 3.6.1 
> > > > > > > along
> > > > with
> > > > > > > the detailed timeline.
> > > > > > >
> > > > > > > If anyone is interested in releasing this sooner, please feel 
> > > > > > > free to
> > > > > > take
> > > > > > > over from me.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Regards,
> > > > > > > Divij Vaidya
> > > > > > > Apache Kafka Committer
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/KAFKA-15653
> > > > > > > [2] https://issues.apache.org/jira/browse/KAFKA-15481
> > > > > > > [3] https://issues.apache.org/jira/browse/KAFKA-15695
> > > > > > > [4]

Re: [DISCUSS] KIP-974 Docker Image for GraalVM based Native Kafka Broker

2023-11-22 Thread Ismael Juma
Hi Krishna,

I am still finding it difficult to evaluate this choice. A couple of things
would help:

1. How much smaller is the alpine image compared to the best alternative?
2. Is there any performance impact of going with Alpine?

Ismael


On Wed, Nov 22, 2023, 8:42 AM Krishna Agarwal 
wrote:

> Hi Ismael,
> Thanks for the feedback.
>
> The alpine image does present a few drawbacks, such as the use of musl libc
> instead of glibc, the absence of bash, and reliance on the less popular
> package manager "apk". Considering the advantage of a smaller image size
> and installing the missing packages(glibc and bash), I have proposed the
> alpine image as the base image. Let me know if you have any suggestions.
> I have added a detailed section for the same in the KIP.
>
> Regards,
> Krishna
>
> On Wed, Nov 22, 2023 at 8:08 PM Ismael Juma  wrote:
>
> > Hi,
> >
> > One question I have is regarding the choice to use alpine - it would be
> > good to clarify if there are downsides (the upside was explained - images
> > are smaller).
> >
> > Ismael
> >
> > On Fri, Sep 8, 2023, 12:17 AM Krishna Agarwal <
> > krishna0608agar...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I want to submit a KIP to deliver an experimental Apache Kafka docker
> > > image.
> > > The proposed docker image can launch brokers with sub-second startup
> time
> > > and minimal memory footprint by leveraging a GraalVM based native Kafka
> > > binary.
> > >
> > > KIP-974: Docker Image for GraalVM based Native Kafka Broker
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
> > > >
> > >
> > > Regards,
> > > Krishna
> > >
> >
>


Re: [VOTE] KIP-974: Docker Image for GraalVM based Native Kafka Broker

2023-11-22 Thread Federico Valeri
Hi,

+1 (non binding)

Thanks

On Wed, Nov 22, 2023 at 3:16 PM Manikumar  wrote:
>
> Hi Krishna,
>
> Thanks for KIP.  +1 (binding).
>
>
> Thanks,
> Manikumar
>
> On Mon, Nov 20, 2023 at 11:57 AM Krishna Agarwal <
> krishna0608agar...@gmail.com> wrote:
>
> > Hi,
> > I'd like to call a vote on KIP-974 which aims to publish a docker image for
> > GraalVM based Native Kafka Broker.
> >
> > KIP -
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
> >
> > Discussion thread -
> > https://lists.apache.org/thread/98wnx4w92fqj5wymkqlqyjsvzxz277hk
> >
> > Regards,
> > Krishna
> >


[DISCUSS] KIP-996: Pre-Vote

2023-11-22 Thread Alyssa Huang
Hey folks,

Starting a discussion thread for Pre-Vote design. Appreciate your comments
in advance!
https://cwiki.apache.org/confluence/display/KAFKA/KIP-996%3A+Pre-Vote

Best,
Alyssa


[jira] [Created] (KAFKA-15888) DistributedHerder log context should not use the same client ID for each Connect worker by default

2023-11-22 Thread Yash Mayya (Jira)
Yash Mayya created KAFKA-15888:
--

 Summary: DistributedHerder log context should not use the same 
client ID for each Connect worker by default
 Key: KAFKA-15888
 URL: https://issues.apache.org/jira/browse/KAFKA-15888
 Project: Kafka
  Issue Type: Bug
  Components: connect, KafkaConnect
Reporter: Yash Mayya
Assignee: Yash Mayya


By default, if there is no "{{{}client.id"{}}} configured on a Connect worker 
running in distributed mode, the same client ID ("connect-1") will be used in 
the log context for the DistributedHerder class in every single worker in the 
Connect cluster. This default is quite confusing and obviously not very useful. 
Further, based on how this default is configured 
([ref|https://github.com/apache/kafka/blob/150b0e8290cda57df668ba89f6b422719866de5a/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L299]),
 it seems like this might have been an unintentional bug. We could simply use 
the workerId (the advertised host name and port of the worker) by default 
instead, which should be unique for each worker in a cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15887) Autocommit during close consistently fails with exception in background thread

2023-11-22 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15887:
--

 Summary: Autocommit during close consistently fails with exception 
in background thread
 Key: KAFKA-15887
 URL: https://issues.apache.org/jira/browse/KAFKA-15887
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy
Assignee: Philip Nee


when I run {{AsyncKafkaConsumerTest}} I get this every time I call close:

{code:java}
java.lang.IndexOutOfBoundsException: Index: 0
at java.base/java.util.Collections$EmptyList.get(Collections.java:4483)
at 
java.base/java.util.Collections$UnmodifiableList.get(Collections.java:1310)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.findCoordinatorSync(ConsumerNetworkThread.java:302)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.ensureCoordinatorReady(ConsumerNetworkThread.java:288)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.maybeAutoCommitAndLeaveGroup(ConsumerNetworkThread.java:276)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.cleanup(ConsumerNetworkThread.java:257)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.run(ConsumerNetworkThread.java:101)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-11-22 Thread Artem Livshits
Hi Justine,

After thinking a bit about supporting atomic dual writes for Kafka + NoSQL
database, I came to a conclusion that we do need to bump the epoch even
with InitProducerId(keepPreparedTxn=true).  As I described in my previous
email, we wouldn't need to bump the epoch to protect from zombies so that
reasoning is still true.  But we cannot protect from split-brain scenarios
when two or more instances of a producer with the same transactional id try
to produce at the same time.  The dual-write example for SQL databases (
https://github.com/apache/kafka/pull/14231/files) doesn't have a
split-brain problem because execution is protected by the update lock on
the transaction state record; however NoSQL databases may not have this
protection (I'll write an example for NoSQL database dual-write soon).

In a nutshell, here is an example of a split-brain scenario:

   1. (instance1) InitProducerId(keepPreparedTxn=true), got epoch=42
   2. (instance2) InitProducerId(keepPreparedTxn=true), got epoch=42
   3. (instance1) CommitTxn, epoch bumped to 43
   4. (instance2) CommitTxn, this is considered a retry, so it got epoch 43
   as well
   5. (instance1) Produce messageA w/sequence 1
   6. (instance2) Produce messageB w/sequence 1, this is considered a
   duplicate
   7. (instance2) Produce messageC w/sequence 2
   8. (instance1) Produce messageD w/sequence 2, this is considered a
   duplicate

Now if either of those commit the transaction, it would have a mix of
messages from the two instances (messageA and messageC).  With the proper
epoch bump, instance1 would get fenced at step 3.

In order to update epoch in InitProducerId(keepPreparedTxn=true) we need to
preserve the ongoing transaction's epoch (and producerId, if the epoch
overflows), because we'd need to make a correct decision when we compare
the PreparedTxnState that we read from the database with the (producerId,
epoch) of the ongoing transaction.

I've updated the KIP with the following:

   - Ongoing transaction now has 2 (producerId, epoch) pairs -- one pair
   describes the ongoing transaction, the other pair describes expected epoch
   for operations on this transactional id
   - InitProducerIdResponse now returns 2 (producerId, epoch) pairs
   - TransactionalLogValue now has 2 (producerId, epoch) pairs, the new
   values added as tagged fields, so it's easy to downgrade
   - Added a note about downgrade in the Compatibility section
   - Added a rejected alternative

-Artem

On Fri, Oct 6, 2023 at 5:16 PM Artem Livshits 
wrote:

> Hi Justine,
>
> Thank you for the questions.  Currently (pre-KIP-939) we always bump the
> epoch on InitProducerId and abort an ongoing transaction (if any).  I
> expect this behavior will continue with KIP-890 as well.
>
> With KIP-939 we need to support the case when the ongoing transaction
> needs to be preserved when keepPreparedTxn=true.  Bumping epoch without
> aborting or committing a transaction is tricky because epoch is a short
> value and it's easy to overflow.  Currently, the overflow case is handled
> by aborting the ongoing transaction, which would send out transaction
> markers with epoch=Short.MAX_VALUE to the partition leaders, which would
> fence off any messages with the producer id that started the transaction
> (they would have epoch that is less than Short.MAX_VALUE).  Then it is safe
> to allocate a new producer id and use it in new transactions.
>
> We could say that maybe when keepPreparedTxn=true we bump epoch unless it
> leads to overflow, and don't bump epoch in the overflow case.  I don't
> think it's a good solution because if it's not safe to keep the same epoch
> when keepPreparedTxn=true, then we must handle the epoch overflow case as
> well.  So either we should convince ourselves that it's safe to keep the
> epoch and do it in the general case, or we always bump the epoch and handle
> the overflow.
>
> With KIP-890, we bump the epoch on every transaction commit / abort.  This
> guarantees that even if InitProducerId(keepPreparedTxn=true) doesn't
> increment epoch on the ongoing transaction, the client will have to call
> commit or abort to finish the transaction and will increment the epoch (and
> handle epoch overflow, if needed).  If the ongoing transaction was in a bad
> state and had some zombies waiting to arrive, the abort operation would
> fence them because with KIP-890 every abort would bump the epoch.
>
> We could also look at this from the following perspective.  With KIP-890,
> zombies won't be able to cross transaction boundaries; each transaction
> completion creates a boundary and any activity in the past gets confined in
> the boundary.  Then data in any partition would look like this:
>
> 1. message1, epoch=42
> 2. message2, epoch=42
> 3. message3, epoch=42
> 4. marker (commit or abort), epoch=43
>
> Now if we inject steps 3a and 3b like this:
>
> 1. message1, epoch=42
> 2. message2, epoch=42
> 3. message3, epoch=42
> 3a. crash
> 3b. InitProducerId(keepPreparedTxn=true

Re: [VOTE] 3.5.2 RC1

2023-11-22 Thread Federico Valeri
Hi Luke,

- Compiled from source (Java 17 and Scala 2.13)
- Ran unit and integration tests
- Ran custom client apps using staging artifacts

+1 (non binding)

Thanks
Fede

On Wed, Nov 22, 2023 at 2:44 PM Josep Prat  wrote:
>
> Hi Luke,
>
> Thanks for running the release.
> I did the following:
> - Verified artifact's signatures and hashes
> - Checked JavaDoc (with navigation to Oracle JavaDoc)
> - Compiled source code
> - Run unit tests and integration tests
> - Run getting started with ZK and KRaft
>
> It gets a +1 from my side (non-binding)
>
> Best,
>
> On Tue, Nov 21, 2023 at 11:09 AM Luke Chen  wrote:
>
> > Hello Kafka users, developers and client-developers,
> >
> > This is the first candidate for release of Apache Kafka 3.5.2.
> >
> > This is a bugfix release with several fixes since the release of 3.5.1,
> > including dependency version bumps for CVEs.
> >
> > Release notes for the 3.5.2 release:
> > https://home.apache.org/~showuon/kafka-3.5.2-rc1/RELEASE_NOTES.html
> >
> > *** Please download, test and vote by Nov. 28.
> >
> > Kafka's KEYS file containing PGP keys we use to sign the release:
> > https://kafka.apache.org/KEYS
> >
> > * Release artifacts to be voted upon (source and binary):
> > https://home.apache.org/~showuon/kafka-3.5.2-rc1/
> >
> > * Maven artifacts to be voted upon:
> > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> >
> > * Javadoc:
> > https://home.apache.org/~showuon/kafka-3.5.2-rc1/javadoc/
> >
> > * Tag to be voted upon (off 3.5 branch) is the 3.5.2 tag:
> > https://github.com/apache/kafka/releases/tag/3.5.2-rc1
> >
> > * Documentation:
> > https://kafka.apache.org/35/documentation.html
> >
> > * Protocol:
> > https://kafka.apache.org/35/protocol.html
> >
> > * Successful Jenkins builds for the 3.5 branch:
> > Unit/integration tests:
> > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.5/98/
> > There are some falky tests, including the testSingleIP test failure. It
> > failed because of some infra change and we fixed it
> >  recently.
> >
> > System tests: running, will update the results later.
> >
> >
> >
> > Thank you.
> > Luke
> >
>
>
> --
> [image: Aiven] 
>
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io    |   
>      
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B


Re: [DISCUSS] KIP-974 Docker Image for GraalVM based Native Kafka Broker

2023-11-22 Thread Krishna Agarwal
Hi Ismael,
Thanks for the feedback.

The alpine image does present a few drawbacks, such as the use of musl libc
instead of glibc, the absence of bash, and reliance on the less popular
package manager "apk". Considering the advantage of a smaller image size
and installing the missing packages(glibc and bash), I have proposed the
alpine image as the base image. Let me know if you have any suggestions.
I have added a detailed section for the same in the KIP.

Regards,
Krishna

On Wed, Nov 22, 2023 at 8:08 PM Ismael Juma  wrote:

> Hi,
>
> One question I have is regarding the choice to use alpine - it would be
> good to clarify if there are downsides (the upside was explained - images
> are smaller).
>
> Ismael
>
> On Fri, Sep 8, 2023, 12:17 AM Krishna Agarwal <
> krishna0608agar...@gmail.com>
> wrote:
>
> > Hi,
> > I want to submit a KIP to deliver an experimental Apache Kafka docker
> > image.
> > The proposed docker image can launch brokers with sub-second startup time
> > and minimal memory footprint by leveraging a GraalVM based native Kafka
> > binary.
> >
> > KIP-974: Docker Image for GraalVM based Native Kafka Broker
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
> > >
> >
> > Regards,
> > Krishna
> >
>


[jira] [Created] (KAFKA-15886) Always specify directories for new partition registrations

2023-11-22 Thread Igor Soarez (Jira)
Igor Soarez created KAFKA-15886:
---

 Summary: Always specify directories for new partition registrations
 Key: KAFKA-15886
 URL: https://issues.apache.org/jira/browse/KAFKA-15886
 Project: Kafka
  Issue Type: Sub-task
Reporter: Igor Soarez


When creating partition registrations directories must always be defined.

If creating a partition from a PartitionRecord or PartitionChangeRecord from an 
older version that does not support directory assignments, then 
DirectoryId.MIGRATING is assumed.

If creating a new partition, or triggering a change in assignment, 
DirectoryId.UNASSIGNED should be specified, unless the target broker has a 
single online directory registered, in which case the replica should be 
assigned directly to that single directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #118

2023-11-22 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 307496 lines...]

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteTopicZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeletePath() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeletePath() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetBrokerMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetBrokerMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testCreateTokenChangeNotification() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testCreateTokenChangeNotification() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicsAndPartitions() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicsAndPartitions() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [1] createChrootIfNecessary=true STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [1] createChrootIfNecessary=true PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [2] createChrootIfNecessary=false STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [2] createChrootIfNecessary=false PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRegisterBrokerInfo() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRegisterBrokerInfo() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRetryRegisterBrokerInfo() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRetryRegisterBrokerInfo() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConsumerOffsetPath() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConsumerOffsetPath() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursiveWithControllerEpochVersionCheck() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursiveWithControllerEpochVersionCheck() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignments() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignments() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerManagementMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerManagementMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignmentMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignmentMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConnectionViaNettyClient() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConnectionViaNettyClient() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testPropagateIsrChanges() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testPropagateIsrChanges() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerEpochMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerEpochMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursive() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursive() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicPartitionStates() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicPartitionStates() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testCreateConfigChangeNotification() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testCreateConfigChangeNotification() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDelegationTokenMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDelegationTokenMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > 
Reassi

Re: [DISCUSS] KIP-974 Docker Image for GraalVM based Native Kafka Broker

2023-11-22 Thread Ismael Juma
Hi,

One question I have is regarding the choice to use alpine - it would be
good to clarify if there are downsides (the upside was explained - images
are smaller).

Ismael

On Fri, Sep 8, 2023, 12:17 AM Krishna Agarwal 
wrote:

> Hi,
> I want to submit a KIP to deliver an experimental Apache Kafka docker
> image.
> The proposed docker image can launch brokers with sub-second startup time
> and minimal memory footprint by leveraging a GraalVM based native Kafka
> binary.
>
> KIP-974: Docker Image for GraalVM based Native Kafka Broker
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
> >
>
> Regards,
> Krishna
>


[jira] [Created] (KAFKA-15885) Reduce lock contention when cleaning topics

2023-11-22 Thread Krzysztof Piecuch (Jira)
Krzysztof Piecuch created KAFKA-15885:
-

 Summary: Reduce lock contention when cleaning topics
 Key: KAFKA-15885
 URL: https://issues.apache.org/jira/browse/KAFKA-15885
 Project: Kafka
  Issue Type: Improvement
  Components: log cleaner
Reporter: Krzysztof Piecuch


Somewhat similar to KAFKA-14213, there are a couple of subroutines which 
require the same lock which results in throttling compaction speed and limits 
parallelism.

 

There's a couple of problems here:
 # LogCleanerManager.grabFilthiestCompactedLog - iterates through a list of 
partitions multiple times, all of this while holding a lock
 #  LogCleanerManager.grabFilthiestCompactedLog doesn't cache anything and 
returns only 1 item at a time - method is issued every time a cleaner thread 
asks for a new partition to compact
 # LogCleanerManager.checkCleaningAborted - a quick check which:

 ## shares a lock with grabFilthiestCompactedLog
 ## is executed every time a LogCleaner reads bufsize data to compact
 # LogCleaner's bufsize is limited to  1G / (number of log cleaner threads)

 

Here's the scenario where this design falls short:
 * I have 15k partitions
 * all of which need to be compacted fairly often but it doesn't take a lot of 
time to compact them
 * Most of the cputime spent by cleaner threads is spent on 
grabFilthiestCompactedLog
 ** so the other cleaners can't do anything since they need to acquire a lock 
to read data to compact as per 3.1. and 3.2.
 ** because of 4. log cleaners run out of work to do as soon as 
grabFilthiestLog is  called
 * Negative performance scaling - increasing # of log cleaner threads decreases 
log cleaner's bufsize which makes them hammer the lock mentioned in 3.1. and 
3.2. more often

 

I suggest:
 * making LogCleanerManager to use more fine-grained locking (ex. RW lock for 
checkCleaningAborted data structures) to decrease the effect of negative 
performance scaling
 * making LogCleanerManager.grabFilthiestLog faster on average:
 ** we don't need grabFilthiestLog to be 100% accurate
 ** we can try caching candidates for "filthiestLog" and re-calculate the cache 
every 1 minute or so.
 ** change the algorithm to probabilistic sampling (get 100 topics and pick the 
worst one?) or even round-robin
 * Alternatively we could make LogCleaner's bufsize to allow values higher



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15884) Reduce lock contention when cleaning topics

2023-11-22 Thread Krzysztof Piecuch (Jira)
Krzysztof Piecuch created KAFKA-15884:
-

 Summary: Reduce lock contention when cleaning topics
 Key: KAFKA-15884
 URL: https://issues.apache.org/jira/browse/KAFKA-15884
 Project: Kafka
  Issue Type: Improvement
  Components: log cleaner
Reporter: Krzysztof Piecuch


Somewhat similar to KAFKA-14213, there are a couple of subroutines which 
require the same lock which results in throttling compaction speed and limits 
parallelism.

 

There's a couple of problems here:
 # LogCleanerManager.grabFilthiestCompactedLog - iterates through a list of 
partitions multiple times, all of this while holding a lock
 #  LogCleanerManager.grabFilthiestCompactedLog doesn't cache anything and 
returns only 1 item at a time - method is issued every time a cleaner thread 
asks for a new partition to compact
 # LogCleanerManager.checkCleaningAborted - a quick check which:

 ## shares a lock with grabFilthiestCompactedLog
 ## is executed every time a LogCleaner reads bufsize data to compact
 # LogCleaner's bufsize is limited to  1G / (number of log cleaner threads)

 

Here's the scenario where this design falls short:
 * I have 15k partitions
 * all of which need to be compacted fairly often but it doesn't take a lot of 
time to compact them
 * Most of the cputime spent by cleaner threads is spent on 
grabFilthiestCompactedLog
 ** so the other cleaners can't do anything since they need to acquire a lock 
to read data to compact as per 3.1. and 3.2.
 ** because of 4. log cleaners run out of work to do as soon as 
grabFilthiestLog is  called
 * Negative performance scaling - increasing # of log cleaner threads decreases 
log cleaner's bufsize which makes them hammer the lock mentioned in 3.1. and 
3.2. more often

 

I suggest:
 * making LogCleanerManager to use more fine-grained locking (ex. RW lock for 
checkCleaningAborted data structures) to decrease the effect of negative 
performance scaling
 * making LogCleanerManager.grabFilthiestLog faster on average:
 ** we don't need grabFilthiestLog to be 100% accurate
 ** we can try caching candidates for "filthiestLog" and re-calculate the cache 
every 1 minute or so.
 ** change the algorithm to probabilistic sampling (get 100 topics and pick the 
worst one?) or even round-robin
 * Alternatively we could make LogCleaner's bufsize to allow values higher



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-7631) NullPointerException when SCRAM is allowed bu ScramLoginModule is not in broker's jaas.conf

2023-11-22 Thread Andrew Olson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Olson resolved KAFKA-7631.
-
Resolution: Fixed

Marking as resolved since I believe it is.

> NullPointerException when SCRAM is allowed bu ScramLoginModule is not in 
> broker's jaas.conf
> ---
>
> Key: KAFKA-7631
> URL: https://issues.apache.org/jira/browse/KAFKA-7631
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 2.0.0, 2.5.0
>Reporter: Andras Beni
>Priority: Minor
> Fix For: 2.7.0
>
> Attachments: KAFKA-7631.patch
>
>
> When user wants to use delegation tokens and lists {{SCRAM}} in 
> {{sasl.enabled.mechanisms}}, but does not add {{ScramLoginModule}} to 
> broker's JAAS configuration, a null pointer exception is thrown on broker 
> side and the connection is closed.
> Meaningful error message should be logged and sent back to the client.
> {code}
> java.lang.NullPointerException
> at 
> org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.handleSaslToken(SaslServerAuthenticator.java:376)
> at 
> org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:262)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:127)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:489)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:427)
> at kafka.network.Processor.poll(SocketServer.scala:679)
> at kafka.network.Processor.run(SocketServer.scala:584)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-974: Docker Image for GraalVM based Native Kafka Broker

2023-11-22 Thread Manikumar
Hi Krishna,

Thanks for KIP.  +1 (binding).


Thanks,
Manikumar

On Mon, Nov 20, 2023 at 11:57 AM Krishna Agarwal <
krishna0608agar...@gmail.com> wrote:

> Hi,
> I'd like to call a vote on KIP-974 which aims to publish a docker image for
> GraalVM based Native Kafka Broker.
>
> KIP -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
>
> Discussion thread -
> https://lists.apache.org/thread/98wnx4w92fqj5wymkqlqyjsvzxz277hk
>
> Regards,
> Krishna
>


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2408

2023-11-22 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-15883) Implement RemoteCopyLagBytes

2023-11-22 Thread Christo Lolov (Jira)
Christo Lolov created KAFKA-15883:
-

 Summary: Implement RemoteCopyLagBytes
 Key: KAFKA-15883
 URL: https://issues.apache.org/jira/browse/KAFKA-15883
 Project: Kafka
  Issue Type: Sub-task
Reporter: Christo Lolov
Assignee: Christo Lolov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] 3.5.2 RC1

2023-11-22 Thread Josep Prat
Hi Luke,

Thanks for running the release.
I did the following:
- Verified artifact's signatures and hashes
- Checked JavaDoc (with navigation to Oracle JavaDoc)
- Compiled source code
- Run unit tests and integration tests
- Run getting started with ZK and KRaft

It gets a +1 from my side (non-binding)

Best,

On Tue, Nov 21, 2023 at 11:09 AM Luke Chen  wrote:

> Hello Kafka users, developers and client-developers,
>
> This is the first candidate for release of Apache Kafka 3.5.2.
>
> This is a bugfix release with several fixes since the release of 3.5.1,
> including dependency version bumps for CVEs.
>
> Release notes for the 3.5.2 release:
> https://home.apache.org/~showuon/kafka-3.5.2-rc1/RELEASE_NOTES.html
>
> *** Please download, test and vote by Nov. 28.
>
> Kafka's KEYS file containing PGP keys we use to sign the release:
> https://kafka.apache.org/KEYS
>
> * Release artifacts to be voted upon (source and binary):
> https://home.apache.org/~showuon/kafka-3.5.2-rc1/
>
> * Maven artifacts to be voted upon:
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
>
> * Javadoc:
> https://home.apache.org/~showuon/kafka-3.5.2-rc1/javadoc/
>
> * Tag to be voted upon (off 3.5 branch) is the 3.5.2 tag:
> https://github.com/apache/kafka/releases/tag/3.5.2-rc1
>
> * Documentation:
> https://kafka.apache.org/35/documentation.html
>
> * Protocol:
> https://kafka.apache.org/35/protocol.html
>
> * Successful Jenkins builds for the 3.5 branch:
> Unit/integration tests:
> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.5/98/
> There are some falky tests, including the testSingleIP test failure. It
> failed because of some infra change and we fixed it
>  recently.
>
> System tests: running, will update the results later.
>
>
>
> Thank you.
> Luke
>


-- 
[image: Aiven] 

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


[jira] [Created] (KAFKA-15882) Scheduled nightly github actions workflow for CVE reports on published docker images

2023-11-22 Thread Vedarth Sharma (Jira)
Vedarth Sharma created KAFKA-15882:
--

 Summary: Scheduled nightly github actions workflow for CVE reports 
on published docker images
 Key: KAFKA-15882
 URL: https://issues.apache.org/jira/browse/KAFKA-15882
 Project: Kafka
  Issue Type: Sub-task
Reporter: Vedarth Sharma
Assignee: Vedarth Sharma


This scheduled github actions workflow will check supported published docker 
images for CVEs and generate reports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15881) Update release.py script to include docker image

2023-11-22 Thread Vedarth Sharma (Jira)
Vedarth Sharma created KAFKA-15881:
--

 Summary: Update release.py script to include docker image
 Key: KAFKA-15881
 URL: https://issues.apache.org/jira/browse/KAFKA-15881
 Project: Kafka
  Issue Type: Sub-task
Reporter: Vedarth Sharma
Assignee: Vedarth Sharma


Make changes in release.py to include build and push of RC docker image to RM's 
dockerhub account and include the details in VOTE email template.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15880) Add github actions workflow for promoting RC docker image

2023-11-22 Thread Vedarth Sharma (Jira)
Vedarth Sharma created KAFKA-15880:
--

 Summary: Add github actions workflow for promoting RC docker image
 Key: KAFKA-15880
 URL: https://issues.apache.org/jira/browse/KAFKA-15880
 Project: Kafka
  Issue Type: Sub-task
Reporter: Vedarth Sharma
Assignee: Vedarth Sharma


RC docker image needs to be pulled and pushed to apache/kafka through this 
github actions workflow.

We need to ensure that only PMC members can access this workflow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15879) Add documentation for the Docker image

2023-11-22 Thread Vedarth Sharma (Jira)
Vedarth Sharma created KAFKA-15879:
--

 Summary: Add documentation for the Docker image
 Key: KAFKA-15879
 URL: https://issues.apache.org/jira/browse/KAFKA-15879
 Project: Kafka
  Issue Type: Sub-task
Reporter: Vedarth Sharma
Assignee: Vedarth Sharma


Update quickstart with docker image details and docker section in getting 
started section



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15878) KIP-768: Extend support for opaque (i.e. non-JWT) tokens in SASL/OAUTHBEARER

2023-11-22 Thread Anuj Sharma (Jira)
Anuj Sharma created KAFKA-15878:
---

 Summary: KIP-768: Extend support for opaque (i.e. non-JWT) tokens 
in SASL/OAUTHBEARER
 Key: KAFKA-15878
 URL: https://issues.apache.org/jira/browse/KAFKA-15878
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Reporter: Anuj Sharma


{code:java}
// code placeholder
{code}
h1. Overview
 * This issue pertains to 
[SASL/OAUTHBEARER|https://kafka.apache.org/documentation/#security_sasl_oauthbearer]
 mechanism of Kafka authentication. 
 * Kafka clients can use [SASL/OAUTHBEARER  
|https://kafka.apache.org/documentation/#security_sasl_oauthbearer]mechanism by 
overriding the [custom call back 
handlers|https://kafka.apache.org/documentation/#security_sasl_oauthbearer_prod]
 . 
 * 
[KIP-768|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575]
 available from v3.1  further extends the mechanism with a production grade 
implementation. 
 * Kafka's 
[SASL/OAUTHBEARER|https://kafka.apache.org/documentation/#security_sasl_oauthbearer]
  mechanism currently {*}rejects the non-JWT (i.e. opaque) tokens{*}. This is 
because of a more restrictive set of characters than what 
[RFC-6750|https://datatracker.ietf.org/doc/html/rfc6750#section-2.1] 
recommends. 
 * This JIRA can be considered an extension of 
[KIP-768|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575]
 to support the opaque tokens as well apart from the JWT tokens.

In summary the following character set should be supported as per the RFC - 
{code:java}
1*( ALPHA / DIGIT /
   "-" / "." / "_" / "~" / "+" / "/" ) *"="
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: How Kafka handle partition leader change?

2023-11-22 Thread Jack Vanlightly
If you want to understand how the replication protocol works, how it can be
configured for consistency, how it can be configured for availability then
I have written up a more formal description of the protocol and written
TLA+ specifications. These should answer most of your questions and if not,
then please do come back and ask further questions.

How it works today:
- Formal description:
https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/v3.5/description/0_kafka_replication_protocol.md
- TLA+ specification:
https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/v3.5/kafka_replication_v3_5.tla

How it will work when KIP-966 is implemented:
- Formal description:
https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/description/0_kafka_replication_protocol.md
- TLA+ specification:
https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/kafka_replication_kip_966.tla

Hope that helps
Jack


On Wed, Nov 22, 2023 at 6:27 AM De Gao  wrote:

> Looks like the core of the problem should still be the juggling game of
> consistency, availability and partition tolerance.  If we want the cluster
> still work when brokers have inconsistent information due to network
> partition, we have to choose between consistency and availability.
> My proposal is not about fix the message loss. Will share when ready.
> Thanks Andrew.
> 
> From: Andrew Grant 
> Sent: 21 November 2023 12:35
> To: dev@kafka.apache.org 
> Subject: Re: How Kafka handle partition leader change?
>
> Hey De Gao,
>
> Message loss or duplication can actually happen even without a leadership
> change for a partition. For example if there are network issues and the
> producer never gets the ack from the server, it’ll retry and cause
> duplicates. Message loss can usually occur when you use acks=1 config -
> mostly you’d lose after a leadership change but in theory if the leader was
> restarted, the page cache was lost and it stayed leader again we could lose
> the message if it wasn’t replicated soon enough.
>
> You might be right it’s more likely to occur during leadership change
> though - not 100% sure myself on that.
>
> Point being, the idempotent producer really is the way to write once and
> only once as far as I’m aware.
>
> If you have any suggestions for improvements I’m sure the community would
> love to hear them! It’s possible there are ways to make leadership changes
> more seamless and at least reduce the probability of duplicates or loss.
> Not sure myself. I’ve wondered before if the older leader could reroute
> messages for a small period of time until the client knew the new leader
> for example.
>
> Andrew
>
> Sent from my iPhone
>
> > On Nov 21, 2023, at 1:42 AM, De Gao  wrote:
> >
> > I am asking this because I want to propose a change to Kafka. But looks
> like in certain scenario it is very hard to not loss or duplication
> messages. Wonder in what scenario we can accept that and where to draw the
> line?
> >
> > 
> > From: De Gao 
> > Sent: 21 November 2023 6:25
> > To: dev@kafka.apache.org 
> > Subject: Re: How Kafka handle partition leader change?
> >
> > Thanks Andrew.  Sounds like the leadership change from Kafka side is a
> 'best effort' to avoid message duplicate or loss. Can we say that message
> lost is very likely during leadership change unless producer uses
> idempotency? Is this a generic situation that no intent to provide data
> integration guarantee upon metadata change?
> > 
> > From: Andrew Grant 
> > Sent: 20 November 2023 12:26
> > To: dev@kafka.apache.org 
> > Subject: Re: How Kafka handle partition leader change?
> >
> > Hey De Gao,
> >
> > The controller is the one that always elects a new leader. When that
> happens that metadata is changed on the controller and once committed it’s
> broadcast to all brokers in the cluster. In KRaft this would be via a
> PartitonChange record that each broker will fetch from the controller. In
> ZK it’d be via an RPC from the controller to the broker.
> >
> > In either case each broker might get the notification at a different
> time. No ordering guarantee among the brokers. But eventually they’ll all
> know the new leader which means eventually the Produce will fail with
> NotLeader and the client will refresh its metadata and find out the new one.
> >
> > In between all that leadership movement, there are various ways messages
> can get duplicated or lost. However if you use the idempotent producer I
> believe you actually won’t see dupes or missing messages so if that’s an
> important requirement you could look into that. The producer is designed to
> retry in general and when you use the idempotent producer some extra
> metadata is sent around to dedupe any messages server-side that were sent
> multiple times by the client.
> >
> > If you’re intereste

[DISCUSS] KIP-1007: Introduce Remote Storage Not Ready Exception

2023-11-22 Thread Kamal Chandraprakash
Hi,

I would like to start a discussion to introduce a new error code for
retriable remote storage errors. Please take a look at the proposal:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1007%3A+Introduce+Remote+Storage+Not+Ready+Exception


[jira] [Created] (KAFKA-15877) Support change of temporality in Java client

2023-11-22 Thread Apoorv Mittal (Jira)
Apoorv Mittal created KAFKA-15877:
-

 Summary: Support change of temporality in Java client 
 Key: KAFKA-15877
 URL: https://issues.apache.org/jira/browse/KAFKA-15877
 Project: Kafka
  Issue Type: Sub-task
Reporter: Apoorv Mittal
Assignee: Apoorv Mittal


Details: https://github.com/apache/kafka/pull/14620#discussion_r1401554867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15876) Introduce Remote Storage Not Ready Exception

2023-11-22 Thread Kamal Chandraprakash (Jira)
Kamal Chandraprakash created KAFKA-15876:


 Summary: Introduce Remote Storage Not Ready Exception
 Key: KAFKA-15876
 URL: https://issues.apache.org/jira/browse/KAFKA-15876
 Project: Kafka
  Issue Type: Task
Reporter: Kamal Chandraprakash
Assignee: Kamal Chandraprakash


When tiered storage is enabled on the cluster, Kafka broker has to build the 
remote log metadata for all the partitions that it is either leader/follower on 
node restart. The remote log metadata is built in asynchronous fashion and does 
not interfere with the broker startup path. Once the broker becomes online, it 
cannot handle the client requests (FETCH and LIST_OFFSETS) to access remote 
storage until the metadata gets built for those partitions. Currently, we are 
returning a ReplicaNotAvailable exception back to the client so that it will 
retry after sometime.

[ReplicaNotAvailableException|https://sourcegraph.com/github.com/apache/kafka@254335d24ab6b6d13142dcdb53fec3856c16de9e/-/blob/clients/src/main/java/org/apache/kafka/common/errors/ReplicaNotAvailableException.java]
 is applicable when there is a reassignment is in-progress and kind of 
deprecated with the NotLeaderOrFollowerException 
([PR#8979|https://github.com/apache/kafka/pull/8979]). It's good to introduce 
an appropriate retriable exception for remote storage errors to denote that it 
is not ready to accept the client requests yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-974 Docker Image for GraalVM based Native Kafka Broker

2023-11-22 Thread Krishna Agarwal
Hi Justine,
Thanks for the feedback.

   1. I have added the name of the other image in the KIP.
   2. By experimental we mean the docker image is intended for local
   development and testing usage.
   GraalVM Native-Image tool is still in maturing stage, hence the usage of
   this image for production can’t be recommended.
   Regarding the upgrade, we will be running the existing System Tests
   covering upgrade(JVM to Native) and downgrade.

Regards,
Krishna

On Wed, Nov 22, 2023 at 6:56 AM Justine Olshan 
wrote:

> Hey -- just catching up here, since I saw the vote thread. I had 2
> questions that I'm not sure got answered from the previous discussion.
>
> 1. Can we update the KIP to include the name of the other image so if
> someone stumbles across this KIP they know the name of the other one?
> 2. Did we cover what "experimental" means here? I think Ismael asked
> > Can we talk a bit more about the compatibility guarantees while this
> image is still experimental?
> I took this to mean should we be able to upgrade to or from this image or
> if clusters running on it can only run on it? Or if there are no
> guarantees about the upgrade/downgrade story.
>
> Thanks,
> Justine
>
> On Sun, Nov 19, 2023 at 7:54 PM Krishna Agarwal <
> krishna0608agar...@gmail.com> wrote:
>
> > Hi,
> > Thanks for the insightful feedback on this KIP.
> > As there are no ongoing discussions, I'm considering moving into the
> voting
> > process.
> > Your continued input is greatly appreciated!
> >
> > Regards,
> > Krishna
> >
> > On Fri, Sep 8, 2023 at 12:47 PM Krishna Agarwal <
> > krishna0608agar...@gmail.com> wrote:
> >
> > > Hi,
> > > I want to submit a KIP to deliver an experimental Apache Kafka docker
> > > image.
> > > The proposed docker image can launch brokers with sub-second startup
> time
> > > and minimal memory footprint by leveraging a GraalVM based native Kafka
> > > binary.
> > >
> > > KIP-974: Docker Image for GraalVM based Native Kafka Broker
> > > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
> > >
> > >
> > > Regards,
> > > Krishna
> > >
> >
>


[DISCUSS] KIP-956: Tiered Storage Quotas

2023-11-22 Thread Abhijeet Kumar
Hi All,

I have created KIP-956 for defining read and write quota for tiered storage.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas

Feedback and suggestions are welcome.

Regards,
Abhijeet.


Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-22 Thread Ismael Juma
I think it breaks the Jenkins output otherwise. Feel free to test it via a
PR.

Ismael

On Wed, Nov 22, 2023, 12:42 AM David Jacot 
wrote:

> Hi Ismael,
>
> No, I was not aware of KAFKA-12216. My understanding is that we could still
> do it without the JUnitFlakyTestDataPublisher plugin and we could use
> gradle enterprise for this. Or do you think that reporting the flaky tests
> in the build results is required?
>
> David
>
> On Wed, Nov 22, 2023 at 9:35 AM Ismael Juma  wrote:
>
> > Hi David,
> >
> > Did you take a look at https://issues.apache.org/jira/browse/KAFKA-12216
> ?
> > I
> > looked into this option already (yes, there isn't much that we haven't
> > considered in this space).
> >
> > Ismael
> >
> > On Wed, Nov 22, 2023 at 12:24 AM David Jacot  >
> > wrote:
> >
> > > Hi all,
> > >
> > > Thanks for the good discussion and all the comments. Overall, it seems
> > that
> > > we all agree on the bad state of our CI. That's a good first step!
> > >
> > > I have talked to a few folks this week about it and it seems that many
> > > folks (including me) are not comfortable with merging PRs at the moment
> > > because the results of our builds are so bad. I had 40+ failed tests in
> > one
> > > of my PRs, all unrelated to my changes. It is really hard to be
> > productive
> > > with this.
> > >
> > > Personally, I really want to move towards requiring a green build to
> > merge
> > > to trunk because this is a clear and binary signal. I agree that we
> need
> > to
> > > stabilize the builds before we could even require this so here is my
> > > proposal.
> > >
> > > 1) We could leverage the `reports.junitXml.mergeReruns` option in
> gradle.
> > > From the doc [1]:
> > >
> > > > When mergeReruns is enabled, if a test fails but is then retried and
> > > succeeds, its failures will be recorded as  instead of
> > > , within one . This is effectively the reporting
> > > produced by the surefire plugin of Apache Maven™ when enabling reruns.
> If
> > > your CI server understands this format, it will indicate that the test
> > was
> > > flaky. If it > does not, it will indicate that the test succeeded as it
> > > will ignore the  information. If the test does not
> succeed
> > > (i.e. it fails for every retry), it will be indicated as having failed
> > > whether your tool understands this format or not.
> > > > When mergeReruns is disabled (the default), each execution of a test
> > will
> > > be listed as a separate test case.
> > >
> > > It would not resolve all the faky tests for sure but it would at least
> > > reduce the noise. I see this as a means to get to green builds faster.
> I
> > > played a bit with this setting and I discovered [2]. I was hoping that
> > [3]
> > > could help to resolve it but I need to confirm.
> > >
> > > 2) I suppose that we would still have flaky tests preventing us from
> > > getting a green build even with the setting in place. For those, I
> think
> > > that we need to review them one by one and decide whether we want to
> fix
> > or
> > > disable them. This is a short term effort to help us get to green
> builds.
> > >
> > > 3) When we get to a point where we can get green builds consistently,
> we
> > > could enforce it.
> > >
> > > 4) Flaky tests won't disappear with this. They are just hidden.
> > Therefore,
> > > we also need a process to review the flaky tests and address them.
> Here,
> > I
> > > think that we could leverage the dashboard shared by Ismael. One
> > > possibility would be to review it regularly and decide for each test
> > > whether it should be fixed, disabled or even removed.
> > >
> > > Please let me know what you think.
> > >
> > > Another angle that we could consider is improving the CI infrastructure
> > as
> > > well. I think that many of those flaky tests are due to overloaded
> > Jenkins
> > > workers. We should perhaps discuss with the infra team to see whether
> we
> > > could do something there.
> > >
> > > Best,
> > > David
> > >
> > > [1]
> > >
> https://docs.gradle.org/current/userguide/java_testing.html#mergereruns
> > > [2] https://github.com/gradle/gradle/issues/23324
> > > [3] https://github.com/apache/kafka/pull/14687
> > >
> > >
> > > On Wed, Nov 22, 2023 at 4:10 AM Ismael Juma  wrote:
> > >
> > > > Hi,
> > > >
> > > > We have a dashboard already:
> > > >
> > > > [image: image.png]
> > > >
> > > >
> > > >
> > >
> >
> https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY
> > > >
> > > > On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков  >
> > > > wrote:
> > > >
> > > >> Hello guys.
> > > >>
> > > >> I want to tell you about one more approach to deal with flaky tests.
> > > >> We adopt this approach in Apache Ignite community, so may be it can
> be
> > > >> helpful for Kafka, also.
> > > >>
> > > >> TL;DR: Apache Ignite community have a tool that provide a statistic
> of
> > > >> tests and can 

[jira] [Created] (KAFKA-15875) Snapshot class is package protected but returned in public methods

2023-11-22 Thread Josep Prat (Jira)
Josep Prat created KAFKA-15875:
--

 Summary: Snapshot class is package protected but returned in 
public methods
 Key: KAFKA-15875
 URL: https://issues.apache.org/jira/browse/KAFKA-15875
 Project: Kafka
  Issue Type: Task
Affects Versions: 3.6.0
Reporter: Josep Prat
Assignee: Josep Prat


org.apache.kafka.timeline.Snapshot class is package protected but it is part of 
the public API of org.apache.kafka.timeline.SnapshotRegistry. This might cause 
compilation errors if we ever try to assign the returned object of these 
methods to a variable.

org.apache.kafka.controller.OffsetControlManager is calling SnapshotRegistry's 
methods that return a Snapshot and OffsetControlManager is in another package.

 

The SnapshotRegistry class seems to not be public API so I don't think this 
needs a KIP.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-22 Thread David Jacot
Hi Ismael,

No, I was not aware of KAFKA-12216. My understanding is that we could still
do it without the JUnitFlakyTestDataPublisher plugin and we could use
gradle enterprise for this. Or do you think that reporting the flaky tests
in the build results is required?

David

On Wed, Nov 22, 2023 at 9:35 AM Ismael Juma  wrote:

> Hi David,
>
> Did you take a look at https://issues.apache.org/jira/browse/KAFKA-12216?
> I
> looked into this option already (yes, there isn't much that we haven't
> considered in this space).
>
> Ismael
>
> On Wed, Nov 22, 2023 at 12:24 AM David Jacot 
> wrote:
>
> > Hi all,
> >
> > Thanks for the good discussion and all the comments. Overall, it seems
> that
> > we all agree on the bad state of our CI. That's a good first step!
> >
> > I have talked to a few folks this week about it and it seems that many
> > folks (including me) are not comfortable with merging PRs at the moment
> > because the results of our builds are so bad. I had 40+ failed tests in
> one
> > of my PRs, all unrelated to my changes. It is really hard to be
> productive
> > with this.
> >
> > Personally, I really want to move towards requiring a green build to
> merge
> > to trunk because this is a clear and binary signal. I agree that we need
> to
> > stabilize the builds before we could even require this so here is my
> > proposal.
> >
> > 1) We could leverage the `reports.junitXml.mergeReruns` option in gradle.
> > From the doc [1]:
> >
> > > When mergeReruns is enabled, if a test fails but is then retried and
> > succeeds, its failures will be recorded as  instead of
> > , within one . This is effectively the reporting
> > produced by the surefire plugin of Apache Maven™ when enabling reruns. If
> > your CI server understands this format, it will indicate that the test
> was
> > flaky. If it > does not, it will indicate that the test succeeded as it
> > will ignore the  information. If the test does not succeed
> > (i.e. it fails for every retry), it will be indicated as having failed
> > whether your tool understands this format or not.
> > > When mergeReruns is disabled (the default), each execution of a test
> will
> > be listed as a separate test case.
> >
> > It would not resolve all the faky tests for sure but it would at least
> > reduce the noise. I see this as a means to get to green builds faster. I
> > played a bit with this setting and I discovered [2]. I was hoping that
> [3]
> > could help to resolve it but I need to confirm.
> >
> > 2) I suppose that we would still have flaky tests preventing us from
> > getting a green build even with the setting in place. For those, I think
> > that we need to review them one by one and decide whether we want to fix
> or
> > disable them. This is a short term effort to help us get to green builds.
> >
> > 3) When we get to a point where we can get green builds consistently, we
> > could enforce it.
> >
> > 4) Flaky tests won't disappear with this. They are just hidden.
> Therefore,
> > we also need a process to review the flaky tests and address them. Here,
> I
> > think that we could leverage the dashboard shared by Ismael. One
> > possibility would be to review it regularly and decide for each test
> > whether it should be fixed, disabled or even removed.
> >
> > Please let me know what you think.
> >
> > Another angle that we could consider is improving the CI infrastructure
> as
> > well. I think that many of those flaky tests are due to overloaded
> Jenkins
> > workers. We should perhaps discuss with the infra team to see whether we
> > could do something there.
> >
> > Best,
> > David
> >
> > [1]
> > https://docs.gradle.org/current/userguide/java_testing.html#mergereruns
> > [2] https://github.com/gradle/gradle/issues/23324
> > [3] https://github.com/apache/kafka/pull/14687
> >
> >
> > On Wed, Nov 22, 2023 at 4:10 AM Ismael Juma  wrote:
> >
> > > Hi,
> > >
> > > We have a dashboard already:
> > >
> > > [image: image.png]
> > >
> > >
> > >
> >
> https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY
> > >
> > > On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков 
> > > wrote:
> > >
> > >> Hello guys.
> > >>
> > >> I want to tell you about one more approach to deal with flaky tests.
> > >> We adopt this approach in Apache Ignite community, so may be it can be
> > >> helpful for Kafka, also.
> > >>
> > >> TL;DR: Apache Ignite community have a tool that provide a statistic of
> > >> tests and can tell if PR introduces new failures.
> > >>
> > >> Apache Ignite has a many tests.
> > >> Latest «Run All» contains around 75k.
> > >> Most of test has integration style therefore count of flacky are
> > >> significant.
> > >>
> > >> We build a tool - Team City Bot [1]
> > >> That provides a combined statistic of flaky tests [2]
> > >>
> > >> This tool can compare results of Run All for PR and master.
> > >> If all OK one 

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-22 Thread Ismael Juma
Hi David,

Did you take a look at https://issues.apache.org/jira/browse/KAFKA-12216? I
looked into this option already (yes, there isn't much that we haven't
considered in this space).

Ismael

On Wed, Nov 22, 2023 at 12:24 AM David Jacot 
wrote:

> Hi all,
>
> Thanks for the good discussion and all the comments. Overall, it seems that
> we all agree on the bad state of our CI. That's a good first step!
>
> I have talked to a few folks this week about it and it seems that many
> folks (including me) are not comfortable with merging PRs at the moment
> because the results of our builds are so bad. I had 40+ failed tests in one
> of my PRs, all unrelated to my changes. It is really hard to be productive
> with this.
>
> Personally, I really want to move towards requiring a green build to merge
> to trunk because this is a clear and binary signal. I agree that we need to
> stabilize the builds before we could even require this so here is my
> proposal.
>
> 1) We could leverage the `reports.junitXml.mergeReruns` option in gradle.
> From the doc [1]:
>
> > When mergeReruns is enabled, if a test fails but is then retried and
> succeeds, its failures will be recorded as  instead of
> , within one . This is effectively the reporting
> produced by the surefire plugin of Apache Maven™ when enabling reruns. If
> your CI server understands this format, it will indicate that the test was
> flaky. If it > does not, it will indicate that the test succeeded as it
> will ignore the  information. If the test does not succeed
> (i.e. it fails for every retry), it will be indicated as having failed
> whether your tool understands this format or not.
> > When mergeReruns is disabled (the default), each execution of a test will
> be listed as a separate test case.
>
> It would not resolve all the faky tests for sure but it would at least
> reduce the noise. I see this as a means to get to green builds faster. I
> played a bit with this setting and I discovered [2]. I was hoping that [3]
> could help to resolve it but I need to confirm.
>
> 2) I suppose that we would still have flaky tests preventing us from
> getting a green build even with the setting in place. For those, I think
> that we need to review them one by one and decide whether we want to fix or
> disable them. This is a short term effort to help us get to green builds.
>
> 3) When we get to a point where we can get green builds consistently, we
> could enforce it.
>
> 4) Flaky tests won't disappear with this. They are just hidden. Therefore,
> we also need a process to review the flaky tests and address them. Here, I
> think that we could leverage the dashboard shared by Ismael. One
> possibility would be to review it regularly and decide for each test
> whether it should be fixed, disabled or even removed.
>
> Please let me know what you think.
>
> Another angle that we could consider is improving the CI infrastructure as
> well. I think that many of those flaky tests are due to overloaded Jenkins
> workers. We should perhaps discuss with the infra team to see whether we
> could do something there.
>
> Best,
> David
>
> [1]
> https://docs.gradle.org/current/userguide/java_testing.html#mergereruns
> [2] https://github.com/gradle/gradle/issues/23324
> [3] https://github.com/apache/kafka/pull/14687
>
>
> On Wed, Nov 22, 2023 at 4:10 AM Ismael Juma  wrote:
>
> > Hi,
> >
> > We have a dashboard already:
> >
> > [image: image.png]
> >
> >
> >
> https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY
> >
> > On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков 
> > wrote:
> >
> >> Hello guys.
> >>
> >> I want to tell you about one more approach to deal with flaky tests.
> >> We adopt this approach in Apache Ignite community, so may be it can be
> >> helpful for Kafka, also.
> >>
> >> TL;DR: Apache Ignite community have a tool that provide a statistic of
> >> tests and can tell if PR introduces new failures.
> >>
> >> Apache Ignite has a many tests.
> >> Latest «Run All» contains around 75k.
> >> Most of test has integration style therefore count of flacky are
> >> significant.
> >>
> >> We build a tool - Team City Bot [1]
> >> That provides a combined statistic of flaky tests [2]
> >>
> >> This tool can compare results of Run All for PR and master.
> >> If all OK one can comment jira ticket with a visa from bot [3]
> >>
> >> Visa is a quality proof of PR for Ignite committers.
> >> And we can sort out most flaky tests and prioritize fixes with the bot
> >> statistic [2]
> >>
> >> TC bot integrated with the Team City only, for now.
> >> But, if Kafka community interested we can try to integrate it with
> >> Jenkins.
> >>
> >> [1] https://github.com/apache/ignite-teamcity-bot
> >> [2]
> https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master&count=10
> >> [3]
> >>
> https://issues.apache.org/jira/browse/IGNITE-19950?focusedComment

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-22 Thread David Jacot
Hi all,

Thanks for the good discussion and all the comments. Overall, it seems that
we all agree on the bad state of our CI. That's a good first step!

I have talked to a few folks this week about it and it seems that many
folks (including me) are not comfortable with merging PRs at the moment
because the results of our builds are so bad. I had 40+ failed tests in one
of my PRs, all unrelated to my changes. It is really hard to be productive
with this.

Personally, I really want to move towards requiring a green build to merge
to trunk because this is a clear and binary signal. I agree that we need to
stabilize the builds before we could even require this so here is my
proposal.

1) We could leverage the `reports.junitXml.mergeReruns` option in gradle.
>From the doc [1]:

> When mergeReruns is enabled, if a test fails but is then retried and
succeeds, its failures will be recorded as  instead of
, within one . This is effectively the reporting
produced by the surefire plugin of Apache Maven™ when enabling reruns. If
your CI server understands this format, it will indicate that the test was
flaky. If it > does not, it will indicate that the test succeeded as it
will ignore the  information. If the test does not succeed
(i.e. it fails for every retry), it will be indicated as having failed
whether your tool understands this format or not.
> When mergeReruns is disabled (the default), each execution of a test will
be listed as a separate test case.

It would not resolve all the faky tests for sure but it would at least
reduce the noise. I see this as a means to get to green builds faster. I
played a bit with this setting and I discovered [2]. I was hoping that [3]
could help to resolve it but I need to confirm.

2) I suppose that we would still have flaky tests preventing us from
getting a green build even with the setting in place. For those, I think
that we need to review them one by one and decide whether we want to fix or
disable them. This is a short term effort to help us get to green builds.

3) When we get to a point where we can get green builds consistently, we
could enforce it.

4) Flaky tests won't disappear with this. They are just hidden. Therefore,
we also need a process to review the flaky tests and address them. Here, I
think that we could leverage the dashboard shared by Ismael. One
possibility would be to review it regularly and decide for each test
whether it should be fixed, disabled or even removed.

Please let me know what you think.

Another angle that we could consider is improving the CI infrastructure as
well. I think that many of those flaky tests are due to overloaded Jenkins
workers. We should perhaps discuss with the infra team to see whether we
could do something there.

Best,
David

[1] https://docs.gradle.org/current/userguide/java_testing.html#mergereruns
[2] https://github.com/gradle/gradle/issues/23324
[3] https://github.com/apache/kafka/pull/14687


On Wed, Nov 22, 2023 at 4:10 AM Ismael Juma  wrote:

> Hi,
>
> We have a dashboard already:
>
> [image: image.png]
>
>
> https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY
>
> On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков 
> wrote:
>
>> Hello guys.
>>
>> I want to tell you about one more approach to deal with flaky tests.
>> We adopt this approach in Apache Ignite community, so may be it can be
>> helpful for Kafka, also.
>>
>> TL;DR: Apache Ignite community have a tool that provide a statistic of
>> tests and can tell if PR introduces new failures.
>>
>> Apache Ignite has a many tests.
>> Latest «Run All» contains around 75k.
>> Most of test has integration style therefore count of flacky are
>> significant.
>>
>> We build a tool - Team City Bot [1]
>> That provides a combined statistic of flaky tests [2]
>>
>> This tool can compare results of Run All for PR and master.
>> If all OK one can comment jira ticket with a visa from bot [3]
>>
>> Visa is a quality proof of PR for Ignite committers.
>> And we can sort out most flaky tests and prioritize fixes with the bot
>> statistic [2]
>>
>> TC bot integrated with the Team City only, for now.
>> But, if Kafka community interested we can try to integrate it with
>> Jenkins.
>>
>> [1] https://github.com/apache/ignite-teamcity-bot
>> [2] https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master&count=10
>> [3]
>> https://issues.apache.org/jira/browse/IGNITE-19950?focusedCommentId=17767394&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17767394
>>
>>
>>
>> > 15 нояб. 2023 г., в 09:18, Ismael Juma  написал(а):
>> >
>> > To use the pain analogy, people seem to have really good painkillers and
>> > hence they somehow don't feel the pain already. ;)
>> >
>> > The reality is that important and high quality tests will get fixed.
>> Poor
>> > quality tests (low signal to noise ratio) might not get fixed a