Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Николай Ижиков
Hello guys.

I want to tell you about one more approach to deal with flaky tests.
We adopt this approach in Apache Ignite community, so may be it can be helpful 
for Kafka, also.

TL;DR: Apache Ignite community have a tool that provide a statistic of tests 
and can tell if PR introduces new failures.

Apache Ignite has a many tests.
Latest «Run All» contains around 75k.
Most of test has integration style therefore count of flacky are significant.

We build a tool - Team City Bot [1] 
That provides a combined statistic of flaky tests [2]

This tool can compare results of Run All for PR and master.
If all OK one can comment jira ticket with a visa from bot [3]

Visa is a quality proof of PR for Ignite committers.
And we can sort out most flaky tests and prioritize fixes with the bot 
statistic [2]

TC bot integrated with the Team City only, for now.
But, if Kafka community interested we can try to integrate it with Jenkins.

[1] https://github.com/apache/ignite-teamcity-bot
[2] https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master=10
[3] 
https://issues.apache.org/jira/browse/IGNITE-19950?focusedCommentId=17767394=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17767394



> 15 нояб. 2023 г., в 09:18, Ismael Juma  написал(а):
> 
> To use the pain analogy, people seem to have really good painkillers and
> hence they somehow don't feel the pain already. ;)
> 
> The reality is that important and high quality tests will get fixed. Poor
> quality tests (low signal to noise ratio) might not get fixed and that's ok.
> 
> I'm not opposed to marking the tests as release blockers as a starting
> point, but I'm saying it's fine if people triage them and decide they are
> not blockers. In fact, that has already happened in the past.
> 
> Ismael
> 
> On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax  wrote:
> 
>> I agree on the test gap argument. However, my worry is, if we don't
>> "force the pain", it won't get fixed at all. -- I also know, that we try
>> to find an working approach for many years...
>> 
>> My take is that if we disable a test and file a non-blocking Jira, it's
>> basically the same as just deleting the test all together and never talk
>> about it again. -- I believe, this is not want we aim for, but we aim
>> for good test coverage and a way to get these test fixed?
>> 
>> Thus IMHO we need some forcing function (either keep the tests and feel
>> the pain on every PR), or disable the test and file a blocker JIRA so
>> the pain surfaces on a release forcing us to do something about it.
>> 
>> If there is no forcing function, it basically means we are willing to
>> accept test gaps forever.
>> 
>> 
>> -Matthias
>> 
>> On 11/14/23 9:09 PM, Ismael Juma wrote:
>>> Matthias,
>>> 
>>> Flaky tests are worse than useless. I know engineers find it hard to
>>> disable them because of the supposed test gap (I find it hard too), but
>> the
>>> truth is that the test gap is already there! No-one blocks merging PRs on
>>> flaky tests, but they do get used to ignoring build failures.
>>> 
>>> The current approach has been attempted for nearly a decade and it has
>>> never worked. I think we should try something different.
>>> 
>>> When it comes to marking flaky tests as release blockers, I don't think
>>> this should be done as a general rule. We should instead assess on a case
>>> by case basis, same way we do it for bugs.
>>> 
>>> Ismael
>>> 
>>> On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax 
>> wrote:
>>> 
 Thanks for starting this discussion David! I totally agree to "no"!
 
 I think there is no excuse whatsoever for merging PRs with compilation
 errors (except an honest mistake for conflicting PRs that got merged
 interleaved). -- Every committer must(!) check the Jenkins status before
 merging to avoid such an issue.
 
 Similar for actual permanently broken tests. If there is no green build,
 and the same test failed across multiple Jenkins runs, a committer
 should detect this and cannot merge a PR.
 
 Given the current state of the CI pipeline, it seems possible to get
 green runs, and thus I support the policy (that we actually always had)
 to only merge if there is at least one green build. If committers got
 sloppy about this, we need to call it out and put a hold on this
>> practice.
 
 (The only exception from the above policy would be a very unstable
 status for which getting a green build is not possible at all, due to
 too many flaky tests -- for this case, I would accept to merge even
 there is no green build, but committer need to manually ensure that
 every test did pass in at least one test run. -- We had this in the
 past, but we I don't think we are in such a bad situation right now).
 
 About disabling tests: I was never a fan of this, because in my
 experience those tests are not fixed any time soon. Especially, because
 we do not consider such tickets as 

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Ismael Juma
To use the pain analogy, people seem to have really good painkillers and
hence they somehow don't feel the pain already. ;)

The reality is that important and high quality tests will get fixed. Poor
quality tests (low signal to noise ratio) might not get fixed and that's ok.

I'm not opposed to marking the tests as release blockers as a starting
point, but I'm saying it's fine if people triage them and decide they are
not blockers. In fact, that has already happened in the past.

Ismael

On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax  wrote:

> I agree on the test gap argument. However, my worry is, if we don't
> "force the pain", it won't get fixed at all. -- I also know, that we try
> to find an working approach for many years...
>
> My take is that if we disable a test and file a non-blocking Jira, it's
> basically the same as just deleting the test all together and never talk
> about it again. -- I believe, this is not want we aim for, but we aim
> for good test coverage and a way to get these test fixed?
>
> Thus IMHO we need some forcing function (either keep the tests and feel
> the pain on every PR), or disable the test and file a blocker JIRA so
> the pain surfaces on a release forcing us to do something about it.
>
> If there is no forcing function, it basically means we are willing to
> accept test gaps forever.
>
>
> -Matthias
>
> On 11/14/23 9:09 PM, Ismael Juma wrote:
> > Matthias,
> >
> > Flaky tests are worse than useless. I know engineers find it hard to
> > disable them because of the supposed test gap (I find it hard too), but
> the
> > truth is that the test gap is already there! No-one blocks merging PRs on
> > flaky tests, but they do get used to ignoring build failures.
> >
> > The current approach has been attempted for nearly a decade and it has
> > never worked. I think we should try something different.
> >
> > When it comes to marking flaky tests as release blockers, I don't think
> > this should be done as a general rule. We should instead assess on a case
> > by case basis, same way we do it for bugs.
> >
> > Ismael
> >
> > On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax 
> wrote:
> >
> >> Thanks for starting this discussion David! I totally agree to "no"!
> >>
> >> I think there is no excuse whatsoever for merging PRs with compilation
> >> errors (except an honest mistake for conflicting PRs that got merged
> >> interleaved). -- Every committer must(!) check the Jenkins status before
> >> merging to avoid such an issue.
> >>
> >> Similar for actual permanently broken tests. If there is no green build,
> >> and the same test failed across multiple Jenkins runs, a committer
> >> should detect this and cannot merge a PR.
> >>
> >> Given the current state of the CI pipeline, it seems possible to get
> >> green runs, and thus I support the policy (that we actually always had)
> >> to only merge if there is at least one green build. If committers got
> >> sloppy about this, we need to call it out and put a hold on this
> practice.
> >>
> >> (The only exception from the above policy would be a very unstable
> >> status for which getting a green build is not possible at all, due to
> >> too many flaky tests -- for this case, I would accept to merge even
> >> there is no green build, but committer need to manually ensure that
> >> every test did pass in at least one test run. -- We had this in the
> >> past, but we I don't think we are in such a bad situation right now).
> >>
> >> About disabling tests: I was never a fan of this, because in my
> >> experience those tests are not fixed any time soon. Especially, because
> >> we do not consider such tickets as release blockers. To me, seeing tests
> >> fails on PR build is actually a good forcing function for people to feel
> >> the pain, and thus get motivated to make time to fix the tests.
> >>
> >> I have to admit that I was a little bit sloppy paying attention to flaky
> >> tests recently, and I highly appreciate this effort. Also thanks to
> >> everyone how actually filed a ticket! IMHO, we should file a ticket for
> >> every flaky test, and also keep adding comments each time we see a test
> >> fail to be able to track the frequency at which a tests fails, so we can
> >> fix the most pressing ones first.
> >>
> >> Te me, the best forcing function to get test stabilized is to file
> >> tickets and consider them release blockers. Disabling tests does not
> >> really help much IMHO to tackle the problem (we can of course still
> >> disable them to get noise out of the system, but it would only introduce
> >> testing gaps for the time being and also does not help to figure out how
> >> often a test fails, so it's not a solution to the problem IMHO)
> >>
> >>
> >> -Matthias
> >>
> >> On 11/13/23 11:40 PM, Sagar wrote:
> >>> Hi Divij,
> >>>
> >>> I think this proposal overall makes sense. My only nit sort of a
> >> suggestion
> >>> is that let's also consider a label called newbie++[1] for flaky tests
> if
> >>> we are considering 

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Chris Egerton
One potential forcing function could be to allow an unconditional revert of
any commit (including testing and non-testing changes) that can be shown to
have introduced test flakiness. This could at least prevent us from
accruing more flaky tests, though it would obviously not be viable for
existing tests that touch on already-released features.

I also liked the earlier suggestion that we immediately begin gating merges
on the successful completion of all parts of the build except tests. This
would prevent the kinds of compilation/Checkstyle errors that kicked off
this discussion.

On Wed, Nov 15, 2023, 15:02 Matthias J. Sax  wrote:

> I agree on the test gap argument. However, my worry is, if we don't
> "force the pain", it won't get fixed at all. -- I also know, that we try
> to find an working approach for many years...
>
> My take is that if we disable a test and file a non-blocking Jira, it's
> basically the same as just deleting the test all together and never talk
> about it again. -- I believe, this is not want we aim for, but we aim
> for good test coverage and a way to get these test fixed?
>
> Thus IMHO we need some forcing function (either keep the tests and feel
> the pain on every PR), or disable the test and file a blocker JIRA so
> the pain surfaces on a release forcing us to do something about it.
>
> If there is no forcing function, it basically means we are willing to
> accept test gaps forever.
>
>
> -Matthias
>
> On 11/14/23 9:09 PM, Ismael Juma wrote:
> > Matthias,
> >
> > Flaky tests are worse than useless. I know engineers find it hard to
> > disable them because of the supposed test gap (I find it hard too), but
> the
> > truth is that the test gap is already there! No-one blocks merging PRs on
> > flaky tests, but they do get used to ignoring build failures.
> >
> > The current approach has been attempted for nearly a decade and it has
> > never worked. I think we should try something different.
> >
> > When it comes to marking flaky tests as release blockers, I don't think
> > this should be done as a general rule. We should instead assess on a case
> > by case basis, same way we do it for bugs.
> >
> > Ismael
> >
> > On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax 
> wrote:
> >
> >> Thanks for starting this discussion David! I totally agree to "no"!
> >>
> >> I think there is no excuse whatsoever for merging PRs with compilation
> >> errors (except an honest mistake for conflicting PRs that got merged
> >> interleaved). -- Every committer must(!) check the Jenkins status before
> >> merging to avoid such an issue.
> >>
> >> Similar for actual permanently broken tests. If there is no green build,
> >> and the same test failed across multiple Jenkins runs, a committer
> >> should detect this and cannot merge a PR.
> >>
> >> Given the current state of the CI pipeline, it seems possible to get
> >> green runs, and thus I support the policy (that we actually always had)
> >> to only merge if there is at least one green build. If committers got
> >> sloppy about this, we need to call it out and put a hold on this
> practice.
> >>
> >> (The only exception from the above policy would be a very unstable
> >> status for which getting a green build is not possible at all, due to
> >> too many flaky tests -- for this case, I would accept to merge even
> >> there is no green build, but committer need to manually ensure that
> >> every test did pass in at least one test run. -- We had this in the
> >> past, but we I don't think we are in such a bad situation right now).
> >>
> >> About disabling tests: I was never a fan of this, because in my
> >> experience those tests are not fixed any time soon. Especially, because
> >> we do not consider such tickets as release blockers. To me, seeing tests
> >> fails on PR build is actually a good forcing function for people to feel
> >> the pain, and thus get motivated to make time to fix the tests.
> >>
> >> I have to admit that I was a little bit sloppy paying attention to flaky
> >> tests recently, and I highly appreciate this effort. Also thanks to
> >> everyone how actually filed a ticket! IMHO, we should file a ticket for
> >> every flaky test, and also keep adding comments each time we see a test
> >> fail to be able to track the frequency at which a tests fails, so we can
> >> fix the most pressing ones first.
> >>
> >> Te me, the best forcing function to get test stabilized is to file
> >> tickets and consider them release blockers. Disabling tests does not
> >> really help much IMHO to tackle the problem (we can of course still
> >> disable them to get noise out of the system, but it would only introduce
> >> testing gaps for the time being and also does not help to figure out how
> >> often a test fails, so it's not a solution to the problem IMHO)
> >>
> >>
> >> -Matthias
> >>
> >> On 11/13/23 11:40 PM, Sagar wrote:
> >>> Hi Divij,
> >>>
> >>> I think this proposal overall makes sense. My only nit sort of a
> >> suggestion
> >>> is that 

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Matthias J. Sax
I agree on the test gap argument. However, my worry is, if we don't 
"force the pain", it won't get fixed at all. -- I also know, that we try 
to find an working approach for many years...


My take is that if we disable a test and file a non-blocking Jira, it's 
basically the same as just deleting the test all together and never talk 
about it again. -- I believe, this is not want we aim for, but we aim 
for good test coverage and a way to get these test fixed?


Thus IMHO we need some forcing function (either keep the tests and feel 
the pain on every PR), or disable the test and file a blocker JIRA so 
the pain surfaces on a release forcing us to do something about it.


If there is no forcing function, it basically means we are willing to 
accept test gaps forever.



-Matthias

On 11/14/23 9:09 PM, Ismael Juma wrote:

Matthias,

Flaky tests are worse than useless. I know engineers find it hard to
disable them because of the supposed test gap (I find it hard too), but the
truth is that the test gap is already there! No-one blocks merging PRs on
flaky tests, but they do get used to ignoring build failures.

The current approach has been attempted for nearly a decade and it has
never worked. I think we should try something different.

When it comes to marking flaky tests as release blockers, I don't think
this should be done as a general rule. We should instead assess on a case
by case basis, same way we do it for bugs.

Ismael

On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax  wrote:


Thanks for starting this discussion David! I totally agree to "no"!

I think there is no excuse whatsoever for merging PRs with compilation
errors (except an honest mistake for conflicting PRs that got merged
interleaved). -- Every committer must(!) check the Jenkins status before
merging to avoid such an issue.

Similar for actual permanently broken tests. If there is no green build,
and the same test failed across multiple Jenkins runs, a committer
should detect this and cannot merge a PR.

Given the current state of the CI pipeline, it seems possible to get
green runs, and thus I support the policy (that we actually always had)
to only merge if there is at least one green build. If committers got
sloppy about this, we need to call it out and put a hold on this practice.

(The only exception from the above policy would be a very unstable
status for which getting a green build is not possible at all, due to
too many flaky tests -- for this case, I would accept to merge even
there is no green build, but committer need to manually ensure that
every test did pass in at least one test run. -- We had this in the
past, but we I don't think we are in such a bad situation right now).

About disabling tests: I was never a fan of this, because in my
experience those tests are not fixed any time soon. Especially, because
we do not consider such tickets as release blockers. To me, seeing tests
fails on PR build is actually a good forcing function for people to feel
the pain, and thus get motivated to make time to fix the tests.

I have to admit that I was a little bit sloppy paying attention to flaky
tests recently, and I highly appreciate this effort. Also thanks to
everyone how actually filed a ticket! IMHO, we should file a ticket for
every flaky test, and also keep adding comments each time we see a test
fail to be able to track the frequency at which a tests fails, so we can
fix the most pressing ones first.

Te me, the best forcing function to get test stabilized is to file
tickets and consider them release blockers. Disabling tests does not
really help much IMHO to tackle the problem (we can of course still
disable them to get noise out of the system, but it would only introduce
testing gaps for the time being and also does not help to figure out how
often a test fails, so it's not a solution to the problem IMHO)


-Matthias

On 11/13/23 11:40 PM, Sagar wrote:

Hi Divij,

I think this proposal overall makes sense. My only nit sort of a

suggestion

is that let's also consider a label called newbie++[1] for flaky tests if
we are considering adding newbie as a label. I think some of the flaky
tests need familiarity with the codebase or the test setup so as a first
time contributor, it might be difficult. newbie++ IMO covers that aspect.

[1]


https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22


Let me know what you think.

Thanks!
Sagar.

On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya 
wrote:


   Please, do it.

We can use specific labels to effectively filter those tickets.

We already have a label and a way to discover flaky tests. They are

tagged

with the label "flaky-test" [1]. There is also a label "newbie" [2]

meant

for folks who are new to Apache Kafka code base.
My suggestion is to send a broader email to the community (since many

will

miss details in this thread) and call for action for committers to
volunteer as "shepherds" for these tickets. I 

Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-14 Thread Matthias J. Sax

Just catching up on this one.


50) I am also in favor of setting `validTo` in VersionedRecord for 
single-key single-ts lookup; it seems better to return the proper 
timestamp. The timestamp is already in the store and it's cheap to 
extract it and add to the result, and it might be valuable information 
for the user. Not sure though if we should deprecate the existing 
constructor though, because for "latest" it's convenient to have?



60) Yes, I meant `VersionedRecord`. Sorry for the mixup.


80) We did discuss this question on KIP-985 (maybe you missed it Bruno). 
It's kinda tricky.


Historically, it seems that IQv1, ie, the `ReadOnlyXxx` interfaces 
provide a clear contract that `range()` is ascending and 
`reverseRange()` is descending.


For `RangeQuery`, the question is, if we did implicitly inherit this 
contract? Our conclusion on KIP-985 discussion was, that we did inherit 
it. If this holds true, changing the contract would be a breaking change 
(what might still be acceptable, given that the interface is annotated 
as unstable, and that IQv2 is not widely adopted yet). I am happy to go 
with the 3-option contract, but just want to ensure we all agree it's 
the right way to go, and we are potentially willing to pay the price of 
backward incompatibility.




Do we need a snapshot semantic or can we specify a weaker but still useful semantic? 


I don't think we necessarily need it, but as pointed out by Lucas, all 
existing queries provide it. Overall, my main point is really about not 
implementing something "random", but defining a proper binding contract 
that allows users to reason about it.


I general, I agree that weaker semantics might be sufficient, but I am 
not sure if we can implement anything weaker in a reasonable way? Happy 
to be convinced otherwise. (I have some example, that I will omit for 
now, as I hope we can actually go with snapshot semantics.)


The RocksDB Snaptshot idea from Lucas sounds very promising. I was not 
aware that we could do this with RocksDB (otherwise I might have 
suggested it on the PR right away). I guess the only open question 
remaining would be, if we can provide the same guarantees for a future 
in-memory implementation for VersionedStores? It sounds possible to do, 
but we should have some level of confidence about it?



-Matthias

On 11/14/23 6:33 AM, Lucas Brutschy wrote:

Hi Alieh,

I agree with Bruno that a weaker guarantee could be useful already,
and it's anyway possible to strengthen the guarantees later on. On the
other hand, it would be nice to provide a consistent view across all
segments if it doesn't come with major downsides, because until now IQ
does provide a consistent view (via iterators), and this would be the
first feature that diverges from that guarantee.

I think a consistent could be achieved relatively easily by creating a
snapshot (https://github.com/facebook/rocksdb/wiki/Snapshot) that will
ensure a consistent view on a single DB and using
`ReadOptions.setSnapshot` for the gets. Since versioned state stores
segments seem to be backed by a single RocksDB instance (that was
unclear to me during our earlier discussion), a single snapshot should
be enough to guarantee a consistent view - we just need to make sure
to clean up the snapshots after use, similar to iterators. If we
instead need a consistent view across multiple RocksDB instances, we
may have to acquire a write lock on all of those (could use the
current object monitors of our `RocksDB` interface) for the duration
of creating snapshots across all instances - which I think would also
be permissible performance-wise. Snapshots are just a sequence number
and should be pretty lightweight to create (they have, however,
downside when it comes to compaction just like iterators).

With that in mind, I would be in favor of at least exploring the
option of using snapshots for a consistent view here, before dropping
this useful guarantee.

Cheers,
Lucas

On Tue, Nov 14, 2023 at 2:20 PM Bruno Cadonna  wrote:


Hi Alieh,

Regarding the semantics/guarantees of the query type:

Do we need a snapshot semantic or can we specify a weaker but still
useful semantic?

An option could be to guarantee that:
1. the query returns the latest version when the query arrived at the
state store
2. the query returns a valid history, i.e., versions with adjacent and
non-overlapping validity intervals.

I think guarantee 1 is quite obvious. Guarantee 2 maybe needs some
explanation. If we do not avoid writes to the state store during the
processing of interactive queries, it might for example happen that the
latest version in the state store moves to data structures that are
responsible for older versions. In our RocksDB implementation that are
the segments. Thus, it could be that during query processing Kafka
Streams reads the latest value x and encounters again x in a segment
because a processor put a newer version of x in the versioned state
store. A similar scenario might also happen to 

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Ismael Juma
Matthias,

Flaky tests are worse than useless. I know engineers find it hard to
disable them because of the supposed test gap (I find it hard too), but the
truth is that the test gap is already there! No-one blocks merging PRs on
flaky tests, but they do get used to ignoring build failures.

The current approach has been attempted for nearly a decade and it has
never worked. I think we should try something different.

When it comes to marking flaky tests as release blockers, I don't think
this should be done as a general rule. We should instead assess on a case
by case basis, same way we do it for bugs.

Ismael

On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax  wrote:

> Thanks for starting this discussion David! I totally agree to "no"!
>
> I think there is no excuse whatsoever for merging PRs with compilation
> errors (except an honest mistake for conflicting PRs that got merged
> interleaved). -- Every committer must(!) check the Jenkins status before
> merging to avoid such an issue.
>
> Similar for actual permanently broken tests. If there is no green build,
> and the same test failed across multiple Jenkins runs, a committer
> should detect this and cannot merge a PR.
>
> Given the current state of the CI pipeline, it seems possible to get
> green runs, and thus I support the policy (that we actually always had)
> to only merge if there is at least one green build. If committers got
> sloppy about this, we need to call it out and put a hold on this practice.
>
> (The only exception from the above policy would be a very unstable
> status for which getting a green build is not possible at all, due to
> too many flaky tests -- for this case, I would accept to merge even
> there is no green build, but committer need to manually ensure that
> every test did pass in at least one test run. -- We had this in the
> past, but we I don't think we are in such a bad situation right now).
>
> About disabling tests: I was never a fan of this, because in my
> experience those tests are not fixed any time soon. Especially, because
> we do not consider such tickets as release blockers. To me, seeing tests
> fails on PR build is actually a good forcing function for people to feel
> the pain, and thus get motivated to make time to fix the tests.
>
> I have to admit that I was a little bit sloppy paying attention to flaky
> tests recently, and I highly appreciate this effort. Also thanks to
> everyone how actually filed a ticket! IMHO, we should file a ticket for
> every flaky test, and also keep adding comments each time we see a test
> fail to be able to track the frequency at which a tests fails, so we can
> fix the most pressing ones first.
>
> Te me, the best forcing function to get test stabilized is to file
> tickets and consider them release blockers. Disabling tests does not
> really help much IMHO to tackle the problem (we can of course still
> disable them to get noise out of the system, but it would only introduce
> testing gaps for the time being and also does not help to figure out how
> often a test fails, so it's not a solution to the problem IMHO)
>
>
> -Matthias
>
> On 11/13/23 11:40 PM, Sagar wrote:
> > Hi Divij,
> >
> > I think this proposal overall makes sense. My only nit sort of a
> suggestion
> > is that let's also consider a label called newbie++[1] for flaky tests if
> > we are considering adding newbie as a label. I think some of the flaky
> > tests need familiarity with the codebase or the test setup so as a first
> > time contributor, it might be difficult. newbie++ IMO covers that aspect.
> >
> > [1]
> >
> https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22
> >
> > Let me know what you think.
> >
> > Thanks!
> > Sagar.
> >
> > On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya 
> > wrote:
> >
> >>>   Please, do it.
> >> We can use specific labels to effectively filter those tickets.
> >>
> >> We already have a label and a way to discover flaky tests. They are
> tagged
> >> with the label "flaky-test" [1]. There is also a label "newbie" [2]
> meant
> >> for folks who are new to Apache Kafka code base.
> >> My suggestion is to send a broader email to the community (since many
> will
> >> miss details in this thread) and call for action for committers to
> >> volunteer as "shepherds" for these tickets. I can send one out once we
> have
> >> some consensus wrt next steps in this thread.
> >>
> >>
> >> [1]
> >>
> >>
> https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
> >>
> >>
> >> [2] https://kafka.apache.org/contributing -> Finding a project to work
> on
> >>
> >>
> >> Divij Vaidya
> >>
> >>
> >>
> >> On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков 
> >> wrote:
> >>
> >>>
>  To kickstart 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2382

2023-11-14 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-15829) How to build Kafka 2.7 with maven instead of gradle?

2023-11-14 Thread rain.liang (Jira)
rain.liang created KAFKA-15829:
--

 Summary: How to build Kafka 2.7 with maven instead of gradle?
 Key: KAFKA-15829
 URL: https://issues.apache.org/jira/browse/KAFKA-15829
 Project: Kafka
  Issue Type: Wish
Affects Versions: 2.7.2
Reporter: rain.liang


It's difficult to upgrade the version of gradle in kafka building. Is there a 
solution to build kafka 2.7 with maven?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Justine Olshan
Thanks Matthias -- I agree that there are lots of flaky test JIRAs but not
a forcing function to fix them.
One option is to mark them as blockers as you alluded to.

One other thing to keep in mind is that we have the gradle enterprise tool
to track flaky tests!
https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=trunk

I think there are some false positives that may come up if we look at PRs
builds too, but we may be able to use tags to filter things out.

Justine

On Tue, Nov 14, 2023 at 5:01 PM Matthias J. Sax  wrote:

> Thanks for starting this discussion David! I totally agree to "no"!
>
> I think there is no excuse whatsoever for merging PRs with compilation
> errors (except an honest mistake for conflicting PRs that got merged
> interleaved). -- Every committer must(!) check the Jenkins status before
> merging to avoid such an issue.
>
> Similar for actual permanently broken tests. If there is no green build,
> and the same test failed across multiple Jenkins runs, a committer
> should detect this and cannot merge a PR.
>
> Given the current state of the CI pipeline, it seems possible to get
> green runs, and thus I support the policy (that we actually always had)
> to only merge if there is at least one green build. If committers got
> sloppy about this, we need to call it out and put a hold on this practice.
>
> (The only exception from the above policy would be a very unstable
> status for which getting a green build is not possible at all, due to
> too many flaky tests -- for this case, I would accept to merge even
> there is no green build, but committer need to manually ensure that
> every test did pass in at least one test run. -- We had this in the
> past, but we I don't think we are in such a bad situation right now).
>
> About disabling tests: I was never a fan of this, because in my
> experience those tests are not fixed any time soon. Especially, because
> we do not consider such tickets as release blockers. To me, seeing tests
> fails on PR build is actually a good forcing function for people to feel
> the pain, and thus get motivated to make time to fix the tests.
>
> I have to admit that I was a little bit sloppy paying attention to flaky
> tests recently, and I highly appreciate this effort. Also thanks to
> everyone how actually filed a ticket! IMHO, we should file a ticket for
> every flaky test, and also keep adding comments each time we see a test
> fail to be able to track the frequency at which a tests fails, so we can
> fix the most pressing ones first.
>
> Te me, the best forcing function to get test stabilized is to file
> tickets and consider them release blockers. Disabling tests does not
> really help much IMHO to tackle the problem (we can of course still
> disable them to get noise out of the system, but it would only introduce
> testing gaps for the time being and also does not help to figure out how
> often a test fails, so it's not a solution to the problem IMHO)
>
>
> -Matthias
>
> On 11/13/23 11:40 PM, Sagar wrote:
> > Hi Divij,
> >
> > I think this proposal overall makes sense. My only nit sort of a
> suggestion
> > is that let's also consider a label called newbie++[1] for flaky tests if
> > we are considering adding newbie as a label. I think some of the flaky
> > tests need familiarity with the codebase or the test setup so as a first
> > time contributor, it might be difficult. newbie++ IMO covers that aspect.
> >
> > [1]
> >
> https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22
> >
> > Let me know what you think.
> >
> > Thanks!
> > Sagar.
> >
> > On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya 
> > wrote:
> >
> >>>   Please, do it.
> >> We can use specific labels to effectively filter those tickets.
> >>
> >> We already have a label and a way to discover flaky tests. They are
> tagged
> >> with the label "flaky-test" [1]. There is also a label "newbie" [2]
> meant
> >> for folks who are new to Apache Kafka code base.
> >> My suggestion is to send a broader email to the community (since many
> will
> >> miss details in this thread) and call for action for committers to
> >> volunteer as "shepherds" for these tickets. I can send one out once we
> have
> >> some consensus wrt next steps in this thread.
> >>
> >>
> >> [1]
> >>
> >>
> https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
> >>
> >>
> >> [2] https://kafka.apache.org/contributing -> Finding a project to work
> on
> >>
> >>
> >> Divij Vaidya
> >>
> >>
> >>
> >> On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков 
> >> wrote:
> >>
> >>>
>  To kickstart this effort, we can publish a list of such tickets in the
> >>> community and assign one or more committers the role of a «shepherd"

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-14 Thread Matthias J. Sax

Thanks for starting this discussion David! I totally agree to "no"!

I think there is no excuse whatsoever for merging PRs with compilation 
errors (except an honest mistake for conflicting PRs that got merged 
interleaved). -- Every committer must(!) check the Jenkins status before 
merging to avoid such an issue.


Similar for actual permanently broken tests. If there is no green build, 
and the same test failed across multiple Jenkins runs, a committer 
should detect this and cannot merge a PR.


Given the current state of the CI pipeline, it seems possible to get 
green runs, and thus I support the policy (that we actually always had) 
to only merge if there is at least one green build. If committers got 
sloppy about this, we need to call it out and put a hold on this practice.


(The only exception from the above policy would be a very unstable 
status for which getting a green build is not possible at all, due to 
too many flaky tests -- for this case, I would accept to merge even 
there is no green build, but committer need to manually ensure that 
every test did pass in at least one test run. -- We had this in the 
past, but we I don't think we are in such a bad situation right now).


About disabling tests: I was never a fan of this, because in my 
experience those tests are not fixed any time soon. Especially, because 
we do not consider such tickets as release blockers. To me, seeing tests 
fails on PR build is actually a good forcing function for people to feel 
the pain, and thus get motivated to make time to fix the tests.


I have to admit that I was a little bit sloppy paying attention to flaky 
tests recently, and I highly appreciate this effort. Also thanks to 
everyone how actually filed a ticket! IMHO, we should file a ticket for 
every flaky test, and also keep adding comments each time we see a test 
fail to be able to track the frequency at which a tests fails, so we can 
fix the most pressing ones first.


Te me, the best forcing function to get test stabilized is to file 
tickets and consider them release blockers. Disabling tests does not 
really help much IMHO to tackle the problem (we can of course still 
disable them to get noise out of the system, but it would only introduce 
testing gaps for the time being and also does not help to figure out how 
often a test fails, so it's not a solution to the problem IMHO)



-Matthias

On 11/13/23 11:40 PM, Sagar wrote:

Hi Divij,

I think this proposal overall makes sense. My only nit sort of a suggestion
is that let's also consider a label called newbie++[1] for flaky tests if
we are considering adding newbie as a label. I think some of the flaky
tests need familiarity with the codebase or the test setup so as a first
time contributor, it might be difficult. newbie++ IMO covers that aspect.

[1]
https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22

Let me know what you think.

Thanks!
Sagar.

On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya 
wrote:


  Please, do it.

We can use specific labels to effectively filter those tickets.

We already have a label and a way to discover flaky tests. They are tagged
with the label "flaky-test" [1]. There is also a label "newbie" [2] meant
for folks who are new to Apache Kafka code base.
My suggestion is to send a broader email to the community (since many will
miss details in this thread) and call for action for committers to
volunteer as "shepherds" for these tickets. I can send one out once we have
some consensus wrt next steps in this thread.


[1]

https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC


[2] https://kafka.apache.org/contributing -> Finding a project to work on


Divij Vaidya



On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков 
wrote:




To kickstart this effort, we can publish a list of such tickets in the

community and assign one or more committers the role of a «shepherd" for
each ticket.

Please, do it.
We can use specific label to effectively filter those tickets.


13 нояб. 2023 г., в 15:16, Divij Vaidya 

написал(а):


Thanks for bringing this up David.

My primary concern revolves around the possibility that the currently
disabled tests may remain inactive indefinitely. We currently have
unresolved JIRA tickets for flaky tests that have been pending for an
extended period. I am inclined to support the idea of disabling these

tests

temporarily and merging changes only when the build is successful,

provided

there is a clear plan for re-enabling them in the future.

To address this issue, I propose the following measures:

1\ Foster a supportive environment for new contributors within the
community, encouraging them to take on tickets associated with flaky

tests.


[jira] [Created] (KAFKA-15828) Protect clients from broker hostname reuse

2023-11-14 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-15828:
---

 Summary: Protect clients from broker hostname reuse
 Key: KAFKA-15828
 URL: https://issues.apache.org/jira/browse/KAFKA-15828
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


In some environments such as k8s, brokers may be assigned to nodes dynamically 
from an available pool. When a cluster is rolling, it is possible for the 
client to see the same node advertised for different broker IDs in a short 
period of time. For example, kafka-1 might be initially assigned to node1. 
Before the client is able to establish a connection, it could be that kafka-3 
is now on node1 instead. Currently there is no protection in the client or in 
the protocol for this scenario. If the connection succeeds, the client will 
assume it has a good connection to kafka-1. Until something disrupts the 
connection, it will continue under this assumption even if the hostname for 
kafka-1 changes.

We have observed this scenario in practice. The client connected to the wrong 
broker through stale hostname information. It was unable to produce data 
because of persistent NOT_LEADER errors. The only way to recover in the end was 
by restarting the client to force a reconnection.

We have discussed a couple potential solutions to this problem:
 # Let the client be smarter managing the connection/hostname mapping. When it 
detects that a hostname has changed, it should force a disconnect to ensure it 
connects to the right node.
 # We can modify the protocol to verify that the client has connected to the 
intended broker. For example, we can add a field to ApiVersions to indicate the 
intended broker ID. The broker receiving the request can return an error if its 
ID does not match that in the request.

Are there alternatives? 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15827) KafkaBasedLog.withExistingClients leaks clients if start is not called

2023-11-14 Thread Greg Harris (Jira)
Greg Harris created KAFKA-15827:
---

 Summary: KafkaBasedLog.withExistingClients leaks clients if start 
is not called
 Key: KAFKA-15827
 URL: https://issues.apache.org/jira/browse/KAFKA-15827
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Affects Versions: 3.6.0
Reporter: Greg Harris
Assignee: Greg Harris


The KafkaBasedLog base implementation creates consumers and producers, and 
closes them after they are instantiated. There are subclasses of the 
KafkaBasedLog which accept pre-created consumers and producers, and have the 
responsibility for closing the clients when the KafkaBasedLog is stopped.

It appears that the KafkaBasedLog subclasses do not close the clients when 
start() is skipped and stop() is called directly. This happens in a few tests, 
and causes the passed-in clients to be leaked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[VOTE] KIP-1001; CurrentControllerId Metric

2023-11-14 Thread Colin McCabe
Hi all,

I'd like to call a vote for KIP-1001: Add CurrentControllerId metric.

Take a look here:
https://cwiki.apache.org/confluence/x/egyZE

best,
Colin


Re: [DISCUSS] Road to Kafka 4.0

2023-11-14 Thread Colin McCabe
On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote:
> Hi Colin,
>
> Thank you for your thoughtful and comprehensive response.
>
>> KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in
>> several KIPs that happened this year and last year. The most notable was
>> probably KIP-866, which was approved in May 2022.
>
> I understand this is the case, I'm raising my concern because I was
> foreseeing some major pain points as a consequence of this decision. Just
> to make it clear though: I am not asking for anyone to do work for me, and
> I understand the limitations of resources available to implement features.
> What I was asking is rather to consider the implications of _removing_
> features before there exists a replacement for them.
>
> I understand that the timeframe for 3.7 isn't feasible, and because of that
> I think what I was asking is rather: can we make sure that there are more
> 3.x releases until controller quorum online resizing is implemented?
>
> From your response, I gather that your stance is that it's important to
> drop ZK support sooner rather than later and that the necessary pieces for
> doing so are already in place.

Hi Anton,

Yes. I'm basically just repeating what we agreed upon in 2022 as part of 
KIP-833.

>
> ---
>
> I want to make sure I've understood your suggested sequence for controller
> node replacement. I hope the mentions of Kubernetes are rather for examples
> of how to carry things out, rather than saying "this is only supported on
> Kubernetes"?

Apache Kafka is supported in lots of environments, including non-k8s ones. I 
was just pointing out that using k8s means that you control your own DNS 
resolution, which simplifies matters. If you don't control DNS there are some 
extra steps for changing the quorum voters.

>
> Given we have three existing nodes as such:
>
> - a.local -> 192.168.0.100
> - b.local -> 192.168.0.101
> - c.local -> 192.168.0.102
>
> As well as a candidate node 192.168.0.103 that we want to replace for the
> role of c.local.
>
> 1. Shut down controller process on node .102 (to make sure we don't "go
> back in time").
> 2. rsync state from leader to .103.
> 3. Start controller process on .103.
> 4. Point the c.local entry at .103.
>
> I have a few questions about this sequence:
>
> 1. Would this sequence be safe against leadership changes?
>

If the leader changes, the new leader should have all of the committed entries 
that the old leader had.

> 2. Does it work

Probably the biggest issue is dealing with "torn writes" that happen because 
you're copying the current log segment while it's being written to. The system 
should be robust against this. However, we don't regularly do this, so there 
hasn't been a lot of testing.

I think Jose had a PR for improving the handling of this which we might want to 
dig up. We'd want the system to auto-truncate the partial record at the end of 
the log, if there is one.

> 3. By "state", do we mean `metadata.log.dir`? Something else?

Yes, the state of the metadata.log.dir. Keep in mind you will need to change 
the node ID in meta.properties after copying, of course.

> 4. What are the effects on cluster availability? (I think this is the same
> as asking what happens if a or b crashes during the process, or if network
> partitions occur).

Cluster metadata state tends to be pretty small. typically a hundred megabytes 
or so. Therefore, I do not think it will take more than a second or two to copy 
from one node to another. However, if you do experience a crash when one node 
out of three is down, then you will be unavailable until you can bring up a 
second node to regain a majority.

>
> ---
>
> If this is considered the official way of handling controller node
> replacements, does it make sense to improve documentation in this area? Is
> there already a plan for this documentation layed out in some KIPs? This is
> something I'd be happy to contribute to.
>

Yes, I think we should have official documentation about this. We'd be happy to 
review anything in that area.

>> To circle back to KIP-853, I think it stands a good chance of making it
>> into AK 4.0.
>
> This sounds good, but the point I was making was if we could have a release
> with both KRaft and ZK supporting this feature to ease the migration out of
> ZK.
>

The problem is, supporting multiple controller implementations is a huge 
burden. So we don't want to extend the 3.x release past the point that's needed 
to complete all the must-dos (SCRAM, delegation tokens, JBOD)

best,
Colin


> BR,
> Anton
>
> Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe :
>
>> Hi Anton,
>>
>> It rarely makes sense to scale up and down the number of controller nodes
>> in the cluster. Only one controller node will be active at any given time.
>> The main reason to use 5 nodes would be to be able to tolerate 2 failures
>> instead of 1.
>>
>> At Confluent, we generally run KRaft with 3 controllers. We have not seen
>> problems with this setup, 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #113

2023-11-14 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 308234 lines...]
Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@127373e2, 
org.apache.kafka.test.MockInternalProcessorContext@2b0001a8 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@127373e2, 
org.apache.kafka.test.MockInternalProcessorContext@2b0001a8 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@689092fb, 
org.apache.kafka.test.MockInternalProcessorContext@39cd0c0a STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@689092fb, 
org.apache.kafka.test.MockInternalProcessorContext@39cd0c0a PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCacheCapacity(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@7e49af91, 
org.apache.kafka.test.MockInternalProcessorContext@71f1f69b STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCacheCapacity(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@7e49af91, 
org.apache.kafka.test.MockInternalProcessorContext@71f1f69b PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCacheCapacity(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@7846d3ff, 
org.apache.kafka.test.MockInternalProcessorContext@24cbff2a STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCacheCapacity(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@7846d3ff, 
org.apache.kafka.test.MockInternalProcessorContext@24cbff2a PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfDbToAddWasAlreadyAddedForOtherSegment() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfDbToAddWasAlreadyAddedForOtherSegment() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldAddItselfToRecordingTriggerWhenFirstValueProvidersAreAddedAfterLastValueProvidersWereRemoved()
 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldAddItselfToRecordingTriggerWhenFirstValueProvidersAreAddedAfterLastValueProvidersWereRemoved()
 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > shouldThrowIfValueProvidersToRemoveNotFound() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > shouldThrowIfValueProvidersToRemoveNotFound() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsSameAsOnlyOneOfMultipleCaches() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsSameAsOnlyOneOfMultipleCaches() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldNotSetStatsLevelToExceptDetailedTimersWhenValueProvidersWithoutStatisticsAreAdded()
 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldNotSetStatsLevelToExceptDetailedTimersWhenValueProvidersWithoutStatisticsAreAdded()
 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldNotRemoveItselfFromRecordingTriggerWhenAtLeastOneValueProviderIsPresent() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldNotRemoveItselfFromRecordingTriggerWhenAtLeastOneValueProviderIsPresent() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 
shouldRemoveItselfFromRecordingTriggerWhenAllValueProvidersAreRemoved() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
RocksDBMetricsRecorderTest > 

[jira] [Created] (KAFKA-15826) WorkerSinkTask leaks Consumer if plugin start or stop blocks indefinitely

2023-11-14 Thread Greg Harris (Jira)
Greg Harris created KAFKA-15826:
---

 Summary: WorkerSinkTask leaks Consumer if plugin start or stop 
blocks indefinitely
 Key: KAFKA-15826
 URL: https://issues.apache.org/jira/browse/KAFKA-15826
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Affects Versions: 3.6.0
Reporter: Greg Harris
Assignee: Greg Harris


The WorkerSourceTask cancel() method closes the Producer, releasing it's 
resources. The WorkerSInkTask does not do the same for the Consumer, as it does 
not override the cancel() method.

WorkerSinkTask should close the consumer if the task is cancelled, as progress 
for a cancelled task will be discarded anyway. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15825) KRaft controller writes empty state to ZK after migration

2023-11-14 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15825.
--
Resolution: Fixed

This bug was fixed as part of KAFKA-15605

> KRaft controller writes empty state to ZK after migration
> -
>
> Key: KAFKA-15825
> URL: https://issues.apache.org/jira/browse/KAFKA-15825
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 3.6.0
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
>
> Immediately following the ZK migration, there is a race condition where the 
> KRaftMigrationDriver can use an empty MetadataImage when performing the full 
> "SYNC_KRAFT_TO_ZK" reconciliation. 
> After the next controller failover, or when the controller loads a metadata 
> snapshot, the correct state will be written to ZK. 
> The symptom of this bug is that we see the migration complete, and then all 
> the metadata removed from ZK. For example, 
> {code}
> [KRaftMigrationDriver id=9990] Completed migration of metadata from ZooKeeper 
> to KRaft. 573 records were generated in 2204 ms across 51 batches. The record 
> types were {TOPIC_RECORD=41, PARTITION_RECORD=410, CONFIG_RECORD=121, 
> PRODUCER_IDS_RECORD=1}. The current metadata offset is now 503794 with an 
> epoch of 21. Saw 6 brokers in the migrated metadata [0, 1, 2, 3, 4, 5].
> {code}
> immediately followed by:
> {code}
> [KRaftMigrationDriver id=9990] Made the following ZK writes when reconciling 
> with KRaft state: {DeleteBrokerConfig=7, DeleteTopic=41, UpdateTopicConfig=41}
> {code}
> If affected by this, a quick workaround is to cause the controller to 
> failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15825) KRaft controller writes empty state to ZK after migration

2023-11-14 Thread David Arthur (Jira)
David Arthur created KAFKA-15825:


 Summary: KRaft controller writes empty state to ZK after migration
 Key: KAFKA-15825
 URL: https://issues.apache.org/jira/browse/KAFKA-15825
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 3.6.0
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.7.0, 3.6.1


Immediately following the ZK migration, there is a race condition where the 
KRaftMigrationDriver can use an empty MetadataImage when performing the full 
"SYNC_KRAFT_TO_ZK" reconciliation. 

After the next controller failover, or when the controller loads a metadata 
snapshot, the correct state will be written to ZK. 

The symptom of this bug is that we see the migration complete, and then all the 
metadata removed from ZK. For example, 

{code}
[KRaftMigrationDriver id=9990] Completed migration of metadata from ZooKeeper 
to KRaft. 573 records were generated in 2204 ms across 51 batches. The record 
types were {TOPIC_RECORD=41, PARTITION_RECORD=410, CONFIG_RECORD=121, 
PRODUCER_IDS_RECORD=1}. The current metadata offset is now 503794 with an epoch 
of 21. Saw 6 brokers in the migrated metadata [0, 1, 2, 3, 4, 5].
{code}

immediately followed by:

{code}
[KRaftMigrationDriver id=9990] Made the following ZK writes when reconciling 
with KRaft state: {DeleteBrokerConfig=7, DeleteTopic=41, UpdateTopicConfig=41}
{code}

If affected by this, a quick workaround is to cause the controller to failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15605) Topics marked for deletion in ZK are incorrectly migrated to KRaft

2023-11-14 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15605.
--
Fix Version/s: 3.7.0
   Resolution: Fixed

> Topics marked for deletion in ZK are incorrectly migrated to KRaft
> --
>
> Key: KAFKA-15605
> URL: https://issues.apache.org/jira/browse/KAFKA-15605
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, kraft
>Affects Versions: 3.6.0
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
>
> When migrating topics from ZooKeeper, the KRaft controller reads all the 
> topic and partition metadata from ZK directly. This includes topics which 
> have been marked for deletion by the ZK controller. After being migrated to 
> KRaft, the pending topic deletions are never completed, so it is as if the 
> delete topic request never happened.
> Since the client request to delete these topics has already been returned as 
> successful, it would be confusing to the client that the topic still existed. 
> An operator or application would need to issue another topic deletion to 
> remove these topics once the controller had moved to KRaft. If they tried to 
> create a new topic with the same name, they would receive a 
> TOPIC_ALREADY_EXISTS error.
> The migration logic should carry over pending topic deletions and resolve 
> them either as part of the migration or shortly after.
> *Note to operators:*
> To determine if a migration was affected by this, an operator can check the 
> contents of {{/admin/delete_topics}} after the KRaft controller has migrated 
> the metadata. If any topics are listed under this ZNode, they were not 
> deleted and will still be present in KRaft. At this point the operator can 
> make a determination if the topics should be re-deleted (using 
> "kafka-topics.sh --delete") or left in place. In either case, the topics 
> should be removed from {{/admin/delete_topics}} to prevent unexpected topic 
> deletion in the event of a fallback to ZK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-979 Allow independently stop KRaft processes

2023-11-14 Thread Colin McCabe
Thanks, Hailey.

+1 (binding)

Colin

On Mon, Nov 13, 2023, at 15:13, Hailey Ni wrote:
> Hi Colin,
>
> Thank you for your review. I removed the "absolute path need to be
> provided" line from the KIP, and will modify the code to get the absolute
> path to the config files using some bash in the kafka-server-start file.
> For your second question, I've added a line in the KIP: "If both parameters
> are provided, the value for node-id parameter will take precedence, i.e,
> the process with node id specified will be killed, no matter what's the
> process role provided."
>
> What do you think?
>
> Thanks,
> Hailey
>
> On Thu, Nov 9, 2023 at 4:03 PM Colin McCabe  wrote:
>
>> Hi Hailey,
>>
>> Thanks for the KIP.
>>
>> It feels clunky to have to pass an absolute path to the configuration file
>> when starting the broker or controller. I think we should consider one of
>> two alternate options:
>>
>> 1. Use JMXtool to examine the running kafka.Kafka processes.
>> I believe ID is available from kafka.server, type=app-info,id=1 (replace 1
>> with the actual ID)
>>
>> Role can be deduced by the presence or absence of
>> kafka.server,type=KafkaServer,name=BrokerState for brokers, or
>> kafka.server,type=ControllerServer,name=ClusterId for controllers.
>>
>> 2. Alternately, we could inject the ID and role into the command line in
>> kafka-server-start.sh. Basically add -Dkafka.node.id=1,
>> -Dkafka.node.roles=broker. This would be helpful to people just examining
>> the output of ps.
>>
>> Finally, you state that either command-line option can be given. What
>> happens if both are given?
>>
>> best,
>> Colin
>>
>>
>> On Mon, Oct 23, 2023, at 12:20, Hailey Ni wrote:
>> > Hi Ron,
>> >
>> > I've added the "Rejected Alternatives" section in the KIP. Thanks for the
>> > comments and +1 vote!
>> >
>> > Thanks,
>> > Hailey
>> >
>> > On Mon, Oct 23, 2023 at 6:33 AM Ron Dagostino  wrote:
>> >
>> >> Hi Hailey.  I'm +1 (binding), but could you add a "Rejected
>> >> Alternatives" section to the KIP and mention the "--required-config "
>> >> option that we decided against and the reason why we made the decision
>> >> to reject it?  There were some other small things (dash instead of dot
>> >> in the parameter names, --node-id instead of --broker-id), but
>> >> cosmetic things like this don't warrant a mention, so I think there's
>> >> just the one thing to document.
>> >>
>> >> Thanks for the KIP, and thanks for adjusting it along the way as the
>> >> discussion moved forward.
>> >>
>> >> Ron
>> >>
>> >>
>> >> Ron
>> >>
>> >> On Mon, Oct 23, 2023 at 4:00 AM Federico Valeri 
>> >> wrote:
>> >> >
>> >> > +1 (non binding)
>> >> >
>> >> > Thanks.
>> >> >
>> >> > On Mon, Oct 23, 2023 at 9:48 AM Kamal Chandraprakash
>> >> >  wrote:
>> >> > >
>> >> > > +1 (non-binding). Thanks for the KIP!
>> >> > >
>> >> > > On Mon, Oct 23, 2023, 12:55 Hailey Ni 
>> >> wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > >
>> >> > > > I'd like to call a vote on KIP-979 that will allow users to
>> >> independently
>> >> > > > stop KRaft processes.
>> >> > > >
>> >> > > >
>> >> > > >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-979%3A+Allow+independently+stop+KRaft+processes
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Hailey
>> >> > > >
>> >>
>>


Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

2023-11-14 Thread Christo Lolov
Heya everyone,

Apologies for the delay in my response and thank you very much for all your
comments! I will start answering in reverse:

*re: Satish*

101. I am happy to scope down this KIP and start off by emitting those
metrics on a topic level. I had a preference to emit them on a partition
level because I have ran into situations where data wasn't evenly spread
across partitions and not having that granularity made it harder to
discover.

102. Fair enough, others have expressed the same preference. I will scope
down the KIP to only bytes-based and segment-based metrics.

103. I agree that we could do this, but I personally prefer this to be a
metric. At the very least a newcomer might not know to look for the log
line, while most metric-collection systems allow you to explore the whole
namespace. For example, I really dislike that while log loading happens
Kafka emits log lines of "X/Y logs loaded" rather than just show me the
progress via a metric. If you are strongly against this, however, I am
happy to scope down on this as well.

104. Ideally we have only one metadata in remote storage for every segment
of the correct lineage. Due to leadership changes, however, this is not
always the case. I envisioned that exposing such a metric will showcase if
there are problems with too many metadata records not part of the correct
lineage of a log.

*re: Luke*

1. I am a bit conflicted on this one. As discussed earlier with Jorge, in
my head such metrics are better left to plugin implementations. If you and
Kamal feel strongly about this being included I will add it to the KIP.

2. After running tiered storage in production for a while I ran into
problems where a partition-level metric would have allowed me to zone in on
the problem sooner. I tried balancing this with not exposing everything on
a partition level so not to explode the cardinality too much (point 101
from Satish). I haven't ran into a situation where knowing the
RemoteLogSizeComputationTime on a partition level helped me, but this
doesn't mean there isn't one.

3. I was thinking that the metric can be emitted while reading of those
records is happening i.e. if it takes a long time then it will just
gradually increase as we read. What do you think?

*re: Jorge*

3.5. Sure, I will aim to add my thoughts to the KIP

4. Let me check and come back to you on this one. I have a vague memory
this wasn't as easy to calculate, but if it is, I will include
RemoteDeleteBytesPerSec as well.

7. Yes, this is I believe a better explanation than the one I have in the
KIP, so I will aim to update it with your one. Thank you! LocalDeleteLag
makes sense to me as well, I will aim to include it.

*re: Kamal*

1. I can add this to the KIP, but similar to what I have mentioned earlier,
I believe these are better left to plugin implementations, no?

2. Yeah, this makes sense.

Best,
Christo

On Fri, 10 Nov 2023 at 09:33, Satish Duggana 
wrote:

> Thanks Christo for the KIP and the interesting discussion.
>
> 101. Adding metrics at partition level may increase the cardinality of
> these metrics. We should be cautious of that and see whether they are
> really needed. RLM related operations do not generally affect based on
> partition(s) but it is mostly because of the remote storage or broker
> level issues.
>
> 102. I am not sure whether the records metric is much useful when we
> have other bytes and segments related metrics available. If needed,
> records level information can be derived once we have segments/bytes
> metrics.
>
> 103. Regarding RemoteLogSizeComputationTime, we can add logs for
> debugging purposes to print the required duration while computing size
> instead of generating a metric. If there is any degradation in remote
> log size computation, it will have an effect on RLM task leading to
> remote log copy and delete lags.
>
> RLMM and RSM implementations can always add more metrics for
> observability based on the respective implementations.
>
> 104. What is the purpose of RemoteLogMetadataCount as a metric?
>
> Thanks,
> Satish.
>
> On Fri, 10 Nov 2023 at 04:10, Jorge Esteban Quilcate Otoya
>  wrote:
> >
> > Hi Christo,
> >
> > I'd like to add another suggestion:
> >
> > 7. Adding on TS lag formulas, my understanding is that per pertition:
> > - RemoteCopyLag: difference between: latest local segment candidate for
> > upload - latest remote segment
> >   - Represents how Remote Log Manager task is handling backlog of
> segments.
> >   - Ideally, this lag is zero -- grows when upload is slower than the
> > increase on candidate segments to upload
> >
> > - RemoteDeleteLag: difference between: latest remote candidate segment to
> > keep based on retention - oldest remote segment
> >   - Represents how many segments Remote Log Manager task is missing to
> > delete at a given point in time
> >   - Ideally, this lag is zero -- grows when retention condition changes
> but
> > RLM task is not able to schedule deletion yet.
> >
> > Is my understanding of 

[jira] [Created] (KAFKA-15824) SubscriptionState's maybeValidatePositionForCurrentLeader should handle partition which isn't subscribed yet

2023-11-14 Thread Mayank Shekhar Narula (Jira)
Mayank Shekhar Narula created KAFKA-15824:
-

 Summary: SubscriptionState's maybeValidatePositionForCurrentLeader 
should handle partition which isn't subscribed yet
 Key: KAFKA-15824
 URL: https://issues.apache.org/jira/browse/KAFKA-15824
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Reporter: Mayank Shekhar Narula
Assignee: Mayank Shekhar Narula
 Fix For: 3.7.0, 3.6.1


Right now in java-client, producer-batches backoff upto retry.backoff.ms(100ms 
by default). This Jira proposes that backoff should be skipped if client knows 
of a newer-leader for the partition in a sub-sequent retry(typically through 
refresh of parition-metadata via the Metadata RPC). This would help improve the 
latency of the produce-request around when partition leadership changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-14 Thread Lucas Brutschy
Hi Alieh,

I agree with Bruno that a weaker guarantee could be useful already,
and it's anyway possible to strengthen the guarantees later on. On the
other hand, it would be nice to provide a consistent view across all
segments if it doesn't come with major downsides, because until now IQ
does provide a consistent view (via iterators), and this would be the
first feature that diverges from that guarantee.

I think a consistent could be achieved relatively easily by creating a
snapshot (https://github.com/facebook/rocksdb/wiki/Snapshot) that will
ensure a consistent view on a single DB and using
`ReadOptions.setSnapshot` for the gets. Since versioned state stores
segments seem to be backed by a single RocksDB instance (that was
unclear to me during our earlier discussion), a single snapshot should
be enough to guarantee a consistent view - we just need to make sure
to clean up the snapshots after use, similar to iterators. If we
instead need a consistent view across multiple RocksDB instances, we
may have to acquire a write lock on all of those (could use the
current object monitors of our `RocksDB` interface) for the duration
of creating snapshots across all instances - which I think would also
be permissible performance-wise. Snapshots are just a sequence number
and should be pretty lightweight to create (they have, however,
downside when it comes to compaction just like iterators).

With that in mind, I would be in favor of at least exploring the
option of using snapshots for a consistent view here, before dropping
this useful guarantee.

Cheers,
Lucas

On Tue, Nov 14, 2023 at 2:20 PM Bruno Cadonna  wrote:
>
> Hi Alieh,
>
> Regarding the semantics/guarantees of the query type:
>
> Do we need a snapshot semantic or can we specify a weaker but still
> useful semantic?
>
> An option could be to guarantee that:
> 1. the query returns the latest version when the query arrived at the
> state store
> 2. the query returns a valid history, i.e., versions with adjacent and
> non-overlapping validity intervals.
>
> I think guarantee 1 is quite obvious. Guarantee 2 maybe needs some
> explanation. If we do not avoid writes to the state store during the
> processing of interactive queries, it might for example happen that the
> latest version in the state store moves to data structures that are
> responsible for older versions. In our RocksDB implementation that are
> the segments. Thus, it could be that during query processing Kafka
> Streams reads the latest value x and encounters again x in a segment
> because a processor put a newer version of x in the versioned state
> store. A similar scenario might also happen to earlier versions. If
> Streams does not account for such cases it could return invalid histories.
>
> Maybe such weaker guarantees are enough for most use cases.
>
> You could consider implementing weaker guarantees as I described and if
> there is demand propose stricter guarantees in a follow-up KIP.
>
> Maybe there are also other simpler guarantees that make sense.
>
> Best,
> Bruno
>
>
> On 11/9/23 12:30 PM, Bruno Cadonna wrote:
> > Hi,
> >
> > Thanks for the updates!
> >
> > First my take on previous comments:
> >
> >
> > 50)
> > I am in favor of deprecating the constructor that does not take the
> > validTo parameter. That implies that I am in favor of modifying get(key,
> > asOf) to set the correct validTo.
> >
> >
> > 60)
> > I am in favor of renaming ValueIterator to VersionedRecordIterator and
> > define it as:
> >
> > public interface VersionedRecordIterator extends
> > Iterator>
> >
> > (Matthias, you mixed up ValueAndTimestamp with VersionedRecord in your
> > last e-mail, didn't you? Just double-checking if I understood what you
> > are proposing.)
> >
> >
> > 70)
> > I agree with Matthias that adding a new method on the
> > VersionedKeyValueStore interface defeats the purpose of one of the goals
> > of IQv2, i.e., not to need to extend the state store interface for IQ. I
> > would say if we do not need the method in normal processing, we should
> > not extend the public state store interface. BTW, nobody forces you to
> > StoreQueryUtils. I think that internal utils class was introduced for
> > convenience to leverage existing methods on the state store interface.
> >
> >
> > 80)
> > Why do we limit ourselves by specifying a default order on the result?
> > Different state store implementation might have different strategies to
> > store the records which affects the order in which the records are
> > returned if they are not sorted before they are returned to the user.
> > Some users might not be interested in an order of the result and so
> > there is no reason those users pay the cost for sorting. I propose to
> > not specify a default order but sort the results (if needed) when
> > withDescendingX() and withAscendingX() is specified on the query object.
> >
> >
> > Regarding the snapshot guarantees for the iterators, I need to think a
> > bit more about it. I will come back to this 

[jira] [Created] (KAFKA-15823) NodeToControllerChannelManager: authentication error prevents controller update

2023-11-14 Thread Gaurav Narula (Jira)
Gaurav Narula created KAFKA-15823:
-

 Summary: NodeToControllerChannelManager: authentication error 
prevents controller update
 Key: KAFKA-15823
 URL: https://issues.apache.org/jira/browse/KAFKA-15823
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 3.5.1, 3.6.0
Reporter: Gaurav Narula


NodeToControllerChannelManager caches the activeController address in an 
AtomicReference which is updated when:
 # activeController [has not been 
set|https://github.com/apache/kafka/blob/832627fc78484fdc7c8d6da8a2d20e7691dbf882/core/src/main/scala/kafka/server/NodeToControllerChannelManager.scala#L422]
 # networkClient [disconnnects from the 
controller|https://github.com/apache/kafka/blob/832627fc78484fdc7c8d6da8a2d20e7691dbf882/core/src/main/scala/kafka/server/NodeToControllerChannelManager.scala#L395C7-L395C7]
 # A node replies with 
`[Errors.NOT_CONTROLLER|https://github.com/apache/kafka/blob/832627fc78484fdc7c8d6da8a2d20e7691dbf882/core/src/main/scala/kafka/server/NodeToControllerChannelManager.scala#L408]`,
 and
 # When a controller changes from [Zk mode to Kraft 
mode|https://github.com/apache/kafka/blob/832627fc78484fdc7c8d6da8a2d20e7691dbf882/core/src/main/scala/kafka/server/NodeToControllerChannelManager.scala#L325]

 

When running multiple Kafka clusters in a dynamic environment, there is a 
chance that a controller's IP may get reassigned to another cluster's broker 
when the controller is bounced. In this scenario, the requests from Node to the 
Controller may fail with an AuthenticationException and are then retried 
indefinitely. This causes the node to get stuck as the new controller's 
information is never set.

 

A potential fix would be disconnect the network client and invoke 
`updateControllerAddress(null)` as we do in the `Errors.NOT_CONTROLLER` case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15822) Exception when using DumpLogSegments to dump log

2023-11-14 Thread jiawen (Jira)
jiawen created KAFKA-15822:
--

 Summary: Exception when using DumpLogSegments to dump log
 Key: KAFKA-15822
 URL: https://issues.apache.org/jira/browse/KAFKA-15822
 Project: Kafka
  Issue Type: Bug
Reporter: jiawen
 Attachments: image-20231114214551634.png

Exception in thread "main" 
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
'version': java.nio.BufferUnderflowException
    at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:72)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerProtocol.deserializeAssignment(ConsumerProtocol.java:106)
    at 
kafka.tools.DumpLogSegments$OffsetsMessageParser$$anonfun$4.apply(DumpLogSegments.scala:269)
    at 
kafka.tools.DumpLogSegments$OffsetsMessageParser$$anonfun$4.apply(DumpLogSegments.scala:267)
    at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at 
kafka.tools.DumpLogSegments$OffsetsMessageParser.parseGroupMetadata(DumpLogSegments.scala:267)
    at 
kafka.tools.DumpLogSegments$OffsetsMessageParser.parse(DumpLogSegments.scala:295)
    at 
kafka.tools.DumpLogSegments$$anonfun$kafka$tools$DumpLogSegments$$dumpLog$1$$anonfun$apply$3.apply(DumpLogSegments.scala:336)
    at 
kafka.tools.DumpLogSegments$$anonfun$kafka$tools$DumpLogSegments$$dumpLog$1$$anonfun$apply$3.apply(DumpLogSegments.scala:316)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at 
kafka.tools.DumpLogSegments$$anonfun$kafka$tools$DumpLogSegments$$dumpLog$1.apply(DumpLogSegments.scala:316)
    at 
kafka.tools.DumpLogSegments$$anonfun$kafka$tools$DumpLogSegments$$dumpLog$1.apply(DumpLogSegments.scala:314)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at 
kafka.tools.DumpLogSegments$.kafka$tools$DumpLogSegments$$dumpLog(DumpLogSegments.scala:314)
    at 
kafka.tools.DumpLogSegments$$anonfun$main$1.apply(DumpLogSegments.scala:97)
    at 
kafka.tools.DumpLogSegments$$anonfun$main$1.apply(DumpLogSegments.scala:93)
    at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at kafka.tools.DumpLogSegments$.main(DumpLogSegments.scala:93)
    at kafka.tools.DumpLogSegments.main(DumpLogSegments.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-14 Thread Bruno Cadonna

Hi Alieh,

Regarding the semantics/guarantees of the query type:

Do we need a snapshot semantic or can we specify a weaker but still 
useful semantic?


An option could be to guarantee that:
1. the query returns the latest version when the query arrived at the 
state store
2. the query returns a valid history, i.e., versions with adjacent and 
non-overlapping validity intervals.


I think guarantee 1 is quite obvious. Guarantee 2 maybe needs some 
explanation. If we do not avoid writes to the state store during the 
processing of interactive queries, it might for example happen that the 
latest version in the state store moves to data structures that are 
responsible for older versions. In our RocksDB implementation that are 
the segments. Thus, it could be that during query processing Kafka 
Streams reads the latest value x and encounters again x in a segment 
because a processor put a newer version of x in the versioned state 
store. A similar scenario might also happen to earlier versions. If 
Streams does not account for such cases it could return invalid histories.


Maybe such weaker guarantees are enough for most use cases.

You could consider implementing weaker guarantees as I described and if 
there is demand propose stricter guarantees in a follow-up KIP.


Maybe there are also other simpler guarantees that make sense.

Best,
Bruno


On 11/9/23 12:30 PM, Bruno Cadonna wrote:

Hi,

Thanks for the updates!

First my take on previous comments:


50)
I am in favor of deprecating the constructor that does not take the 
validTo parameter. That implies that I am in favor of modifying get(key, 
asOf) to set the correct validTo.



60)
I am in favor of renaming ValueIterator to VersionedRecordIterator and 
define it as:


public interface VersionedRecordIterator extends 
Iterator>


(Matthias, you mixed up ValueAndTimestamp with VersionedRecord in your 
last e-mail, didn't you? Just double-checking if I understood what you 
are proposing.)



70)
I agree with Matthias that adding a new method on the 
VersionedKeyValueStore interface defeats the purpose of one of the goals 
of IQv2, i.e., not to need to extend the state store interface for IQ. I 
would say if we do not need the method in normal processing, we should 
not extend the public state store interface. BTW, nobody forces you to 
StoreQueryUtils. I think that internal utils class was introduced for 
convenience to leverage existing methods on the state store interface.



80)
Why do we limit ourselves by specifying a default order on the result? 
Different state store implementation might have different strategies to 
store the records which affects the order in which the records are 
returned if they are not sorted before they are returned to the user. 
Some users might not be interested in an order of the result and so 
there is no reason those users pay the cost for sorting. I propose to 
not specify a default order but sort the results (if needed) when 
withDescendingX() and withAscendingX() is specified on the query object.



Regarding the snapshot guarantees for the iterators, I need to think a 
bit more about it. I will come back to this thread in the next days.



Best,
Bruno


On 11/8/23 5:30 PM, Alieh Saeedi wrote:

Thank you, Bruno and Matthias, for keeping the discussion going and for
reviewing the PR.

Here are the KIP updates:

    - I removed the `peek()` from the `ValueIterator` interface since 
we do

    not need it.
    - Yes, Bruno, the `validTo` field in the `VersionedRecod` class is
    exclusive. I updated the javadocs for that.


Very important critical open questions. I list them here based on 
priority

(descendingly).

    - I implemented the `get(key, fromtime, totime)` method here

:
    the problem is that this implementation does not guarantee 
consistency
    because processing might continue interleaved (no snapshot 
semantics is

    implemented). More over, it materializes all results in memory.
   - Solution 1: Use a lock and release it after retrieving all 
desired

   records from all segments.
  - positive point: snapshot semantics is implemented
  - negative points: 1) It is expensive since iterating over all
  segments may take a long time. 2) It still requires
materializing results
  on memory
   - Solution 2: use `RocksDbIterator`.
  - positive points: 1) It guarantees snapshot segments. 2) It 
does

  not require materializing results in memory.
  - negative points: it is expensive because, anyway, we need to
  iterate over all (many) segments.

    Do you have any thoughts on this issue? (ref: Matthias's 
comment

)

    - I added the field `validTo` in 

Re: [DISCUSS] Road to Kafka 4.0

2023-11-14 Thread Anton Agestam
Hi Colin,

Thank you for your thoughtful and comprehensive response.

> KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in
several KIPs that happened this year and last year. The most notable was
probably KIP-866, which was approved in May 2022.

I understand this is the case, I'm raising my concern because I was
foreseeing some major pain points as a consequence of this decision. Just
to make it clear though: I am not asking for anyone to do work for me, and
I understand the limitations of resources available to implement features.
What I was asking is rather to consider the implications of _removing_
features before there exists a replacement for them.

I understand that the timeframe for 3.7 isn't feasible, and because of that
I think what I was asking is rather: can we make sure that there are more
3.x releases until controller quorum online resizing is implemented?

>From your response, I gather that your stance is that it's important to
drop ZK support sooner rather than later and that the necessary pieces for
doing so are already in place.

---

I want to make sure I've understood your suggested sequence for controller
node replacement. I hope the mentions of Kubernetes are rather for examples
of how to carry things out, rather than saying "this is only supported on
Kubernetes"?

Given we have three existing nodes as such:

- a.local -> 192.168.0.100
- b.local -> 192.168.0.101
- c.local -> 192.168.0.102

As well as a candidate node 192.168.0.103 that we want to replace for the
role of c.local.

1. Shut down controller process on node .102 (to make sure we don't "go
back in time").
2. rsync state from leader to .103.
3. Start controller process on .103.
4. Point the c.local entry at .103.

I have a few questions about this sequence:

1. Would this sequence be safe against leadership changes?
2. Does it work
3. By "state", do we mean `metadata.log.dir`? Something else?
4. What are the effects on cluster availability? (I think this is the same
as asking what happens if a or b crashes during the process, or if network
partitions occur).

---

If this is considered the official way of handling controller node
replacements, does it make sense to improve documentation in this area? Is
there already a plan for this documentation layed out in some KIPs? This is
something I'd be happy to contribute to.

> To circle back to KIP-853, I think it stands a good chance of making it
into AK 4.0.

This sounds good, but the point I was making was if we could have a release
with both KRaft and ZK supporting this feature to ease the migration out of
ZK.

BR,
Anton

Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe :

> Hi Anton,
>
> It rarely makes sense to scale up and down the number of controller nodes
> in the cluster. Only one controller node will be active at any given time.
> The main reason to use 5 nodes would be to be able to tolerate 2 failures
> instead of 1.
>
> At Confluent, we generally run KRaft with 3 controllers. We have not seen
> problems with this setup, even with thousands of clusters. We have
> discussed using 5 node controller clusters on certain very big clusters,
> but we haven't done that yet. This is all very similar to ZK, where most
> deployments were 3 nodes as well.
>
> KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in
> several KIPs that happened this year and last year. The most notable was
> probably KIP-866, which was approved in May 2022.
>
> Many users these days run in a Kubernetes environment where Kubernetes
> actually controls the DNS. This makes changing the set of voters less
> important than it was historically.
>
> For example, in a world with static DNS, you might have to change the
> controller.quorum.voters setting from:
>
> 100@a.local:9073,101@b.local:9073,102@c.local:9073
>
> to:
>
> 100@a.local:9073,101@b.local:9073,102@d.local:9073
>
> In a world with k8s controlling the DNS, you simply remap c.local to point
> ot the IP address of your new pod for controller 102, and you're done. No
> need to update controller.quorum.voters.
>
> Another question is whether you re-create the pod data from scratch every
> time you add a new node. If you store the controller data on an EBS volume
> (or cloud-specific equivalent), you really only have to detach it from the
> previous pod and re-attach it to the new pod. k8s also handles this
> automatically, of course.
>
> If you want to reconstruct the full controller pod state each time you
> create a new pod (for example, so that you can use only instance storage),
> you should be able to rsync that state from the leader. In general, the
> invariant that we want to maintain is that the state should not "go back in
> time" -- if controller 102 promised to hold all log data up to offset X, it
> should come back with committed data at at least that offset.
>
> There are lots of new features we'd like to implement for KRaft, and Kafka
> in general. If you have some you really would like to see, I 

[jira] [Resolved] (KAFKA-15658) Zookeeper.jar | CVE-2023-44981

2023-11-14 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya resolved KAFKA-15658.
--
Resolution: Fixed

> Zookeeper.jar | CVE-2023-44981 
> ---
>
> Key: KAFKA-15658
> URL: https://issues.apache.org/jira/browse/KAFKA-15658
> Project: Kafka
>  Issue Type: Bug
>Reporter: masood
>Priority: Critical
> Fix For: 3.7.0, 3.6.1
>
>
> The 
> [CVE-2023-44981|https://www.mend.io/vulnerability-database/CVE-2023-44981]  
> vulnerability has been reported in the zookeeper.jar. 
> It's worth noting that the latest version of Kafka has a dependency on 
> version 3.8.2 of Zookeeper, which is also impacted by this vulnerability. 
> [https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper/3.8.2|https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper/3.8.2.]
> could you please verify its impact on the Kafka.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Important Security Notice for Apache Kafka Users

2023-11-14 Thread Divij Vaidya
Dear Apache Kafka Users,

We want to bring to your attention a security vulnerability affecting all
released versions of Apache Kafka that have a dependency on Zookeeper. The
vulnerability, identified as CVE-2023-44981 [1], specifically impacts users
utilizing SASL Quorum Peer authentication in Zookeeper.

Vulnerability Details:
- Affected Versions: All released versions of Apache Kafka with Zookeeper
dependency.
- CVE Identifier: CVE-2023-44981 [1]
- Impact: Limited to users employing SASL Quorum Peer authentication in
Zookeeper (quorum.auth.enableSasl=true)

Action Required:
Upcoming Apache Kafka versions, 3.6.1 (release date - tentative Dec '23)
and 3.7.0 (release date - Jan'23 [3]), will depend on Zookeeper versions
containing fixes for the vulnerability. In the interim, we highly advise
taking proactive steps to safeguard Zookeeper ensemble election/quorum
communication by implementing a firewall [2].

Future Updates:
We are diligently working on addressing this vulnerability in our upcoming
releases. We will keep you updated on any changes to our recommendations
and promptly inform you of the release dates for Apache Kafka versions
3.6.1 and 3.7.0.

If you have any further questions regarding this, please don't hesitate to
reach out to us at secur...@kafka.apache.org or post a comment at
https://issues.apache.org/jira/browse/KAFKA-15658

Best Regards,

Divij Vaidya
On behalf of Apache Kafka PMC

[1] https://zookeeper.apache.org/security.html#CVE-2023-44981
[2] https://lists.apache.org/thread/wf0yrk84dg1942z1o74kd8nycg6pgm5b
[3] https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.7.0


--


Re: [VOTE] KIP-978: Allow dynamic reloading of certificates with different DN / SANs

2023-11-14 Thread Jakub Scholz
Hi all,

Closing this vote as it has enough votes and was open for more than two
weeks.

KIP-978 has passed the vote with:
+3 binding votes
+1 non-binding vote

Thanks everyone for your votes. I will update the KIP page and open a PR
targeting Kafka 3.7.0.

Jakub

On Tue, Oct 24, 2023 at 10:07 PM Jakub Scholz  wrote:

> Hi all,
>
> I would like to start a vote for the KIP-978: Allow dynamic reloading of
> certificates with different DN / SANs
> 
> .
>
> Thanks & Regards
> Jakub
>


[jira] [Reopened] (KAFKA-15376) Explore options of removing data earlier to the current leader's leader epoch lineage for topics enabled with tiered storage.

2023-11-14 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya reopened KAFKA-15376:
--
  Assignee: (was: Satish Duggana)

> Explore options of removing data earlier to the current leader's leader epoch 
> lineage for topics enabled with tiered storage.
> -
>
> Key: KAFKA-15376
> URL: https://issues.apache.org/jira/browse/KAFKA-15376
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Satish Duggana
>Priority: Major
> Fix For: 3.7.0
>
>
> Followup on the discussion thread:
> [https://github.com/apache/kafka/pull/13561#discussion_r1288778006]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1002: Fetch remote segment indexes at once

2023-11-14 Thread Divij Vaidya
> Offset and Transaction indexes are probably the only ones that make sense
to cache as are used on every fetch.

I do not think (correct me if I am wrong) that the transaction index is
used on every fetch. It is only used when consumers want to include aborted
transactions [1] i.e. when they use "read_committed" isolation level. Also,
note that in such a situation, we retrieve the transaction index possibly
for all log segments past the fetchOffset until the end offset (or until
LSO) on every fetch [2]. Hence, fetching the transaction index for first
segments efficiently is nice but it is not going to make any major
difference in overall latency since the overall latency will be dominated
by sequential calls to RSM to fetch trx index for other segments.

IMO the best path forward is to implement an "intelligent index fetch from
remote" which determines what index to fetch and how much of those indices
to fetch based on signals such as fetch request args. For example, if
read_committed isolation level is required, we can fetch multiple trx
indices in parallel instead of sequentially (as is done today). We can also
choose to perform parallel fetch for time index and offset index. But this
approach assumes that RSM can support parallel fetches and they are not
expensive, which might not be true depending on the plugin. That is why, I
think it's best if we leave it upto the RSM to determine how much and which
index to fetch based on heuristics.

[1]
https://github.com/apache/kafka/blob/832627fc78484fdc7c8d6da8a2d20e7691dbf882/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L1310

[2]
https://github.com/apache/kafka/blob/832627fc78484fdc7c8d6da8a2d20e7691dbf882/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L1358


--
Divij Vaidya



On Tue, Nov 14, 2023 at 8:30 AM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Divij, thanks for your prompt feedback!
>
> 1. Agree, caching at the plugin level was my initial idea as well; though,
> keeping two caches for the same data both at the broker and at the plugin
> seems wasteful. (added this as a rejected alternative in the meantime)
>
> 2. Not necessarially. The API allows to request a set of indexes. In the
> case of the `RemoteIndexCache`, as it's currently implemented, it would be
> using: [offset, time, transaction] index types.
>
> However, I see your point that there may be scenarios where only 1 of the 3
> indexes are used:
> - Time index used mostly once when fetching sequentially by seeking offset
> by time.
> - Offset and Transaction indexes are probably the only ones that make sense
> to cache as are used on every fetch.
> Arguably, Transaction indexes are not as common, reducing the benefits of
> the proposed approach:
> from initially expecting to fetch 3 indexes at once, to potentially
> fetching only 2 (offset, txn), but most probably fetching 1 (offset).
>
> If there's value perceived from fetching Offset and Transaction together,
> we can keep discussing this KIP. In the meantime, I will look into the
> approach to lazily fetch indexes while waiting for additional feedback.
>
> Cheers,
> Jorge.
>
> On Mon, 13 Nov 2023 at 16:51, Divij Vaidya 
> wrote:
>
> > Hi Jorge
> >
> > 1. I don't think we need a new API here because alternatives solutions
> > exist even with the current API. As an example, when the first index is
> > fetched, the RSM plugin can choose to download all indexes and cache it
> > locally. On the next call to fetch an index from the remote tier, we will
> > hit the cache and retrieve the index from there.
> >
> > 2. The KIP assumes that all indexes are required at all times. However,
> > indexes such as transaction indexes are only required for read_committed
> > fetches and time index is only required when a fetch call wants to search
> > offset by timestamp. As a future step in Tiered Storage, I would actually
> > prefer to move towards a direction where we are lazily fetching indexes
> > on-demand instead of fetching them together as proposed in the KIP.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Fri, Nov 10, 2023 at 4:00 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Hello everyone,
> > >
> > > I would like to start the discussion on a KIP for Tiered Storage. It's
> > > about improving cross-segment latencies by reducing calls to fetch
> > indexes
> > > individually.
> > > Have a look:
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1002%3A+Fetch+remote+segment+indexes+at+once
> > >
> > > Cheers,
> > > Jorge
> > >
> >
>


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2381

2023-11-14 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 430965 lines...]
Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenNotProgressing() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenNotProgressing() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldNotFlushOnException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldNotFlushOnException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectProcessingDisabledByTaskExecutionMetadata() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectProcessingDisabledByTaskExecutionMetadata() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldLockAnEmptySetOfTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldLockAnEmptySetOfTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldAssignTasksThatCanBeSystemTimePunctuated() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldAssignTasksThatCanBeSystemTimePunctuated() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotUnassignNotOwnedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotUnassignNotOwnedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotSetUncaughtExceptionsTwice() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotSetUncaughtExceptionsTwice() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnAdding() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnAdding() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldShutdownTaskExecutors() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldShutdownTaskExecutors() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > 
shouldNotAssignTasksForPunctuationIfPunctuationDisabled() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > 
shouldNotAssignTasksForPunctuationIfPunctuationDisabled() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnSignalProcessableTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnSignalProcessableTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldAddTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldAddTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnUnassignment() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnUnassignment() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotAssignAnyLockedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotAssignAnyLockedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldRemoveTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldRemoveTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotRemoveAssignedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotRemoveAssignedTask() PASSED

Gradle Test Run :streams:test > Gradle 

[jira] [Resolved] (KAFKA-15376) Explore options of removing data earlier to the current leader's leader epoch lineage for topics enabled with tiered storage.

2023-11-14 Thread Kamal Chandraprakash (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Chandraprakash resolved KAFKA-15376.
--
Resolution: Fixed

This task was already addressed in the code, so closing the ticket:

https://sourcegraph.com/github.com/apache/kafka@3.6/-/blob/core/src/main/java/kafka/log/remote/RemoteLogManager.java?L1043-1061

> Explore options of removing data earlier to the current leader's leader epoch 
> lineage for topics enabled with tiered storage.
> -
>
> Key: KAFKA-15376
> URL: https://issues.apache.org/jira/browse/KAFKA-15376
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Satish Duggana
>Assignee: Kamal Chandraprakash
>Priority: Major
> Fix For: 3.7.0
>
>
> Followup on the discussion thread:
> [https://github.com/apache/kafka/pull/13561#discussion_r1288778006]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-892: Transactional StateStores

2023-11-14 Thread Bruno Cadonna

Hi Nick!

Thanks a lot for the KIP!

Looking forward to the implementation!

+1 (binding)

Best,
Bruno

On 11/14/23 2:23 AM, Sophie Blee-Goldman wrote:

+1 (binding)

Thanks a lot for this KIP!

On Mon, Nov 13, 2023 at 8:39 AM Lucas Brutschy
 wrote:


Hi Nick,

really happy with the final KIP. Thanks a lot for the hard work!

+1 (binding)

Lucas

On Mon, Nov 13, 2023 at 4:20 PM Colt McNealy  wrote:


+1 (non-binding).

Thank you, Nick, for making all of the changes (especially around the
`default.state.isolation.level` config).

Colt McNealy

*Founder, LittleHorse.dev*


On Mon, Nov 13, 2023 at 7:15 AM Nick Telford 

wrote:



Hi everyone,

I'd like to call a vote for KIP-892: Transactional StateStores[1],

which

makes Kafka Streams StateStores transactional under EOS.

Regards,

Nick

1:



https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores








[Question] About Kafka producer design decision making

2023-11-14 Thread Sean Sin
Dear Apache Kakfa Developers,

I'm 4-year SWE in South Korea.
I have some questions while watching Kafka Producer API.

*Why Use "Future" and Not "CompletableFuture"?*

In the case of "Future", blocking occurs when calling "*get()*", so I
thought "Computable Future" would be better when doing more asynchronous
operations.

I looked at the Java API document

based on the latest version, version 3.6.x.

If you look at that version, you can see that the Future object provides
the "toCompletionStage() "method, which can convert "KafkaFuture" to
"ComputableFuture".

In response to this, I think that in the initial design decision process,
we considered compatibility issues under JDK 1.8 and the level of knowledge
of the learning curve or developer when introducing ComputableFuture, but I
wonder if this is correct.

In addition, I wonder if it is recommended to use the "toCompletionStage()"
method to produce more non-blocking if we assume JDK 1.8 or higher.

Thanks.
Su-Ung Shin.