[jira] [Resolved] (KAFKA-16168) Implement GroupCoordinator.onPartitionsDeleted

2024-02-01 Thread David Jacot (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Jacot resolved KAFKA-16168.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

> Implement GroupCoordinator.onPartitionsDeleted
> --
>
> Key: KAFKA-16168
> URL: https://issues.apache.org/jira/browse/KAFKA-16168
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: David Jacot
>Assignee: David Jacot
>Priority: Major
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16189) Extend admin to support ConsumerGroupDescribe API

2024-02-01 Thread David Jacot (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Jacot resolved KAFKA-16189.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

> Extend admin to support ConsumerGroupDescribe API
> -
>
> Key: KAFKA-16189
> URL: https://issues.apache.org/jira/browse/KAFKA-16189
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: David Jacot
>Assignee: David Jacot
>Priority: Major
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2609

2024-02-01 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-16215) Consumer does not rejoin after fenced on delayed revocation

2024-02-01 Thread Lianet Magrans (Jira)
Lianet Magrans created KAFKA-16215:
--

 Summary: Consumer does not rejoin after fenced on delayed 
revocation
 Key: KAFKA-16215
 URL: https://issues.apache.org/jira/browse/KAFKA-16215
 Project: Kafka
  Issue Type: Sub-task
  Components: clients, consumer
Reporter: Lianet Magrans
Assignee: Lianet Magrans


Consumer with partitions assigned for T1, T1 gets deleted, consumer gets stuck 
attempting to commit offsets so the reconciliation does not complete. It gets 
fenced then but is not attempting to rejoin as it should. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-956: Tiered Storage Quotas

2024-02-01 Thread Francois Visconte
Hi,

I see that the ticket has been left untouched since a while now.
Should it be included in the tiered storage v1?
We've observed that lacking a way to throttle uploads to tiered storage has
a major impact on
producers and consumers when tiered storage access recovers (starving disk
IOps/throughput or CPU).
For this reason, I think this is an important feature and possibly worth
including in v1?

Regards,


On Tue, Dec 5, 2023 at 8:43 PM Jun Rao  wrote:

> Hi, Abhijeet,
>
> Thanks for the KIP. A few comments.
>
> 10. remote.log.manager.write.quota.default:
> 10.1 For other configs, we
> use replica.alter.log.dirs.io.max.bytes.per.second. To be consistent,
> perhaps this can be sth like remote.log.manager.write.max.bytes.per.second.
> 10.2 Could we list the new metrics associated with the new quota.
> 10.3 Is this dynamically configurable? If so, could we document the impact
> to tools like kafka-configs.sh and AdminClient?
>
> Jun
>
> On Tue, Nov 28, 2023 at 2:19 AM Luke Chen  wrote:
>
> > Hi Abhijeet,
> >
> > Thanks for the KIP!
> > This is an important feature for tiered storage.
> >
> > Some comments:
> > 1. Will we introduce new metrics for this tiered storage quotas?
> > This is important because the admin can know the throttling status by
> > checking the metrics while the remote write/read are slow, like the rate
> of
> > uploading/reading byte rate, the throttled time for upload/read... etc.
> >
> > 2. Could you give some examples for the throttling algorithm in the KIP
> to
> > explain it? That will make it much clearer.
> >
> > 3. To solve this problem, we can break down the RLMTask into two smaller
> > tasks - one for segment upload and the other for handling expired
> segments.
> > How do we handle the situation when a segment is still waiting for
> > offloading while this segment is expired and eligible to be deleted?
> > Maybe it'll be easier to not block the RLMTask when quota exceeded, and
> > just check it each time the RLMTask runs?
> >
> > Thank you.
> > Luke
> >
> > On Wed, Nov 22, 2023 at 6:27 PM Abhijeet Kumar <
> abhijeet.cse@gmail.com
> > >
> > wrote:
> >
> > > Hi All,
> > >
> > > I have created KIP-956 for defining read and write quota for tiered
> > > storage.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas
> > >
> > > Feedback and suggestions are welcome.
> > >
> > > Regards,
> > > Abhijeet.
> > >
> >
>


Re: Kafka-Streams-Scala for Scala 3

2024-02-01 Thread Lucas Brutschy
Hi Matthiases,

I know Scala 2 fairly well, so I'd be happy to review changes that add
Scala 3 support. However, as Matthias S. said, it has to be driven by
people who use Scala day-to-day, since I believe most Kafka Streams
committers are working with Java.

Rewriting the tests to not use EmbeddedKafkaCluster seems like a large
undertaking, so option 1 is the first thing we should explore.

I don't have any experience with Scala 3 migration topics, but on the
Scala website it says
> The first piece of good news is that the Scala 3 compiler is able to read the 
> Scala 2.13 Pickle format and thus it can type check code that depends on 
> modules or libraries compiled with Scala 2.13.
> One notable example is the Scala 2.13 library. We have indeed decided that 
> the Scala 2.13 library is the official standard library for Scala 3.
So wouldn't that mean that we are safe in terms of standard library
upgrades if we use core_2.13 in the tests?

Cheers,
Lucas


On Wed, Jan 31, 2024 at 9:20 PM Matthias J. Sax  wrote:
>
> Thanks for raising this. The `kafka-streams-scala` module seems to be an
> important feature for Kafka Streams and I am generally in favor of your
> proposal to add Scala 3 support. However, I am personally no Scala
> person and it sounds like quite some overhead.
>
> If you are willing to drive and own this initiative happy to support you
> to the extend I can.
>
> About the concrete proposal: my understanding is that :core will move
> off Scala long-term (not 100% sure what the timeline is, but new modules
> are written in Java only). Thus, down the road the compatibility issue
> would go away naturally, but it's unclear when.
>
> Thus, if we can test kafak-stream-scala_3 with core_2.13 it seems we
> could add support for Scala 3 now, taking a risk that it might break in
> the future assume that the migration off Scala from core is not fast enough.
>
> For proposal (2), I don't think that it would be easily possible for
> unit/integration tests. We could fall back to system tests though, but
> they would be much more heavy weight of course.
>
> Might be good to hear from others. We might actually also want to do a
> KIP for this?
>
>
> -Matthias
>
> On 1/20/24 10:34 AM, Matthias Berndt wrote:
> > Hey there,
> >
> > I'd like to discuss a Scala 3 port of the kafka-streams-scala library.
> > Currently, the build system is set up such that kafka-streams-scala
> > and core (i. e. kafka itself) are compiled with the same Scala
> > compiler versions. This is not an optimal situation because it means
> > that a Scala 3 release of kafka-streams-scala cannot happen
> > independently of kafka itself. I think this should be changed
> >
> > The production codebase of scala-streams-kafka actually compiles just
> > fine on Scala 3.3.1 with two lines of trivial syntax changes. The
> > problem is with the tests. These use the `EmbeddedKafkaCluster` class,
> > which means that kafka is pulled into the classpath, potentially
> > leading to binary compatibility issues.
> > I can see several approaches to fixing this:
> >
> > 1. Run the kafka-streams-scala tests using the compatible version of
> > :core if one is available. Currently, this means that everything can
> > be tested (test kafka-streams-scala_2.12 using core_2.12,
> > kafka-streams-scala_2.13 using core_2.13 and kafka-streams-scala_3
> > using core_2.13, as these should be compatible), but when a new
> > scala-library version is released that is no longer compatible with
> > 2.13, we won't be able to test that.
> > 2. Rewrite the tests to run without EmbeddedKafkaCluster, instead
> > running the test cluster in a separate JVM or perhaps even a
> > container.
> >
> > I'd be willing to get my hands dirty working on this, but before I
> > start I'd like to get some feedback from the Kafka team regarding the
> > approaches outlined above.
> >
> > All the best
> > Matthias Berndt


Re: ZK vs KRaft benchmarking - latency differences?

2024-02-01 Thread Doğuşcan Namal
Hi Paul,

I did some benchmarking as well and couldn't find a marginal difference
between KRaft and Zookeeper on end to end latency from producers to
consumers. I tested it on Kafka version 3.5.1 and used openmessaging's
benchmarking framework https://openmessaging.cloud/docs/benchmarks/ .

What I noticed was if you run the tests long enough(60 mins) the throughput
converges to the same value eventually. I also noticed some difference on
p99+ latencies between Zookeeper and KRaft clusters but the results were
not consistent on repetitive runs.

Which version did you make the tests on and what are your findings?

On Wed, 31 Jan 2024 at 22:57, Brebner, Paul 
wrote:

> Hi all,
>
> We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and
> found no difference in throughput (which we believed is also what theory
> predicted, as ZK/Kraft are only involved in Kafka meta-data operations, not
> data workloads).
>
> BUT – latest tests reveal improved producer and consumer latency for Kraft
> c.f. ZooKeeper.  So just wanted to check if Kraft is actually involved in
> any aspect of write/read workloads? For example, some documentation
> (possibly old) suggests that consumer offsets are stored in meta-data?  In
> which case this could explain better Kraft latencies. But if not, then I’m
> curious to understand the difference (and if it’s documented anywhere?)
>
> Also if anyone has noticed the same regarding latency in benchmarks.
>
> Regards, Paul Brebner
>


Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

2024-02-01 Thread Jack Vanlightly
Hi Jose,

I have a question about how voters and observers, which are far behind the
leader, catch-up when there are multiple reconfiguration commands in the
log between their position and the end of the log.

Here are some example situations that need clarification:

Example 1
Imagine a cluster of three voters: r1, r2, r3. Voter r3 goes offline for a
while. In the meantime, r1 dies and gets replaced with r4, and r2 dies
getting replaced with r5. Now the cluster is formed of r3, r4, r5. When r3
comes back online, it tries to fetch from dead nodes and finally starts
unending leader elections - stuck because it doesn't realise it's in a
stale configuration whose members are all dead except for itself.

Example 2
Imagine a cluster of three voters: r1, r2, r3. Voter r3 goes offline then
comes back and discovers the leader is r1. Again, there are many
reconfiguration commands between its LEO and the end of the leader's log.
It starts fetching, changing configurations as it goes until it reaches a
stale configuration (r3, r4, r5) where it is a member but none of its peers
are actually alive anymore. It continues to fetch from the r1, but then for
some reason the connection to r1 is interrupted. r3 starts leader elections
which don't get responses.

Example 3
Imagine a cluster of three voters: r1, r2, r3. Over time, many
reconfigurations have happened and now the voters are (r4, r5, r6). The
observer o1 starts fetching from the nodes in
'controller.quorum.bootstrap.servers' which includes r4. r4 responds with a
NotLeader and that r5 is the leader. o1 starts fetching and goes through
the motion of switching to each configuration as it learns of it in the
log. The connection to r5 gets interrupted while it is in the configuration
(r7, r8, r9). It attempts to fetch from these voters but none respond as
they are all long dead, as this is a stale configuration. Does the observer
fallback to 'controller.quorum.bootstrap.servers' for its list of voters it
can fetch from?

After thinking it through, it occurs to me that in examples 1 and 2, the
leader (of the latest configuration) should be sending BeginQuorumEpoch
requests to r3 after a certain timeout? r3 can start elections (based on
its stale configuration) which will go nowhere, until it eventually
receives a BeginQuorumEpoch from the leader and it will learn of the leader
and resume fetching.

In the case of an observer, I suppose it must fallback to
'controller.quorum.voters' or  'controller.quorum.bootstrap.servers' to
learn of the leader?

Thanks
Jack



On Fri, Jan 26, 2024 at 1:36 AM José Armando García Sancio
 wrote:

> Hi all,
>
> I have updated the KIP to include information on how KRaft controller
> automatic joining will work.
>
> Thanks,
> --
> -José
>


Re: Kafka-Streams-Scala for Scala 3

2024-02-01 Thread Josep Prat
Hi,

For reference, prior work on this:
https://github.com/apache/kafka/pull/11350
https://github.com/apache/kafka/pull/11432

Best,

On Thu, Feb 1, 2024, 15:55 Lucas Brutschy 
wrote:

> Hi Matthiases,
>
> I know Scala 2 fairly well, so I'd be happy to review changes that add
> Scala 3 support. However, as Matthias S. said, it has to be driven by
> people who use Scala day-to-day, since I believe most Kafka Streams
> committers are working with Java.
>
> Rewriting the tests to not use EmbeddedKafkaCluster seems like a large
> undertaking, so option 1 is the first thing we should explore.
>
> I don't have any experience with Scala 3 migration topics, but on the
> Scala website it says
> > The first piece of good news is that the Scala 3 compiler is able to
> read the Scala 2.13 Pickle format and thus it can type check code that
> depends on modules or libraries compiled with Scala 2.13.
> > One notable example is the Scala 2.13 library. We have indeed decided
> that the Scala 2.13 library is the official standard library for Scala 3.
> So wouldn't that mean that we are safe in terms of standard library
> upgrades if we use core_2.13 in the tests?
>
> Cheers,
> Lucas
>
>
> On Wed, Jan 31, 2024 at 9:20 PM Matthias J. Sax  wrote:
> >
> > Thanks for raising this. The `kafka-streams-scala` module seems to be an
> > important feature for Kafka Streams and I am generally in favor of your
> > proposal to add Scala 3 support. However, I am personally no Scala
> > person and it sounds like quite some overhead.
> >
> > If you are willing to drive and own this initiative happy to support you
> > to the extend I can.
> >
> > About the concrete proposal: my understanding is that :core will move
> > off Scala long-term (not 100% sure what the timeline is, but new modules
> > are written in Java only). Thus, down the road the compatibility issue
> > would go away naturally, but it's unclear when.
> >
> > Thus, if we can test kafak-stream-scala_3 with core_2.13 it seems we
> > could add support for Scala 3 now, taking a risk that it might break in
> > the future assume that the migration off Scala from core is not fast
> enough.
> >
> > For proposal (2), I don't think that it would be easily possible for
> > unit/integration tests. We could fall back to system tests though, but
> > they would be much more heavy weight of course.
> >
> > Might be good to hear from others. We might actually also want to do a
> > KIP for this?
> >
> >
> > -Matthias
> >
> > On 1/20/24 10:34 AM, Matthias Berndt wrote:
> > > Hey there,
> > >
> > > I'd like to discuss a Scala 3 port of the kafka-streams-scala library.
> > > Currently, the build system is set up such that kafka-streams-scala
> > > and core (i. e. kafka itself) are compiled with the same Scala
> > > compiler versions. This is not an optimal situation because it means
> > > that a Scala 3 release of kafka-streams-scala cannot happen
> > > independently of kafka itself. I think this should be changed
> > >
> > > The production codebase of scala-streams-kafka actually compiles just
> > > fine on Scala 3.3.1 with two lines of trivial syntax changes. The
> > > problem is with the tests. These use the `EmbeddedKafkaCluster` class,
> > > which means that kafka is pulled into the classpath, potentially
> > > leading to binary compatibility issues.
> > > I can see several approaches to fixing this:
> > >
> > > 1. Run the kafka-streams-scala tests using the compatible version of
> > > :core if one is available. Currently, this means that everything can
> > > be tested (test kafka-streams-scala_2.12 using core_2.12,
> > > kafka-streams-scala_2.13 using core_2.13 and kafka-streams-scala_3
> > > using core_2.13, as these should be compatible), but when a new
> > > scala-library version is released that is no longer compatible with
> > > 2.13, we won't be able to test that.
> > > 2. Rewrite the tests to run without EmbeddedKafkaCluster, instead
> > > running the test cluster in a separate JVM or perhaps even a
> > > container.
> > >
> > > I'd be willing to get my hands dirty working on this, but before I
> > > start I'd like to get some feedback from the Kafka team regarding the
> > > approaches outlined above.
> > >
> > > All the best
> > > Matthias Berndt
>
KJosep Prat
Open Source Engineering Director, aivenjosep.p...@aiven.io   |
+491715557497 | aiven.io
Aiven Deutschland GmbH
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


Re: [DISCUSS] KIP-1018: Introduce max remote fetch timeout config

2024-02-01 Thread Kamal Chandraprakash
Hi Jorge,

Thanks for the review! Added your suggestions to the KIP. PTAL.

The `fetch.max.wait.ms` config will be also applicable for topics enabled
with remote storage.
Updated the description to:

```
The maximum amount of time the server will block before answering the fetch
request
when it is reading near to the tail of the partition (high-watermark) and
there isn't
sufficient data to immediately satisfy the requirement given by
fetch.min.bytes.
```

--
Kamal

On Thu, Feb 1, 2024 at 12:12 AM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Hi Kamal,
>
> Thanks for this KIP! It should help to solve one of the main issues with
> tiered storage at the moment that is dealing with individual consumer
> configurations to avoid flooding logs with interrupted exceptions.
>
> One of the topics discussed in [1][2] was on the semantics of `
> fetch.max.wait.ms` and how it's affected by remote storage. Should we
> consider within this KIP the update of `fetch.max.wail.ms` docs to clarify
> it only applies to local storage?
>
> Otherwise, LGTM -- looking forward to see this KIP adopted.
>
> [1] https://issues.apache.org/jira/browse/KAFKA-15776
> [2] https://github.com/apache/kafka/pull/14778#issuecomment-1820588080
>
> On Tue, 30 Jan 2024 at 01:01, Kamal Chandraprakash <
> kamal.chandraprak...@gmail.com> wrote:
>
> > Hi all,
> >
> > I have opened a KIP-1018
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1018%3A+Introduce+max+remote+fetch+timeout+config+for+DelayedRemoteFetch+requests
> > >
> > to introduce dynamic max-remote-fetch-timeout broker config to give more
> > control to the operator.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1018%3A+Introduce+max+remote+fetch+timeout+config+for+DelayedRemoteFetch+requests
> >
> > Let me know if you have any feedback or suggestions.
> >
> > --
> > Kamal
> >
>


[jira] [Resolved] (KAFKA-15575) Prevent Connectors from exceeding tasks.max configuration

2024-02-01 Thread Chris Egerton (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Egerton resolved KAFKA-15575.
---
Fix Version/s: 3.8.0
   Resolution: Fixed

> Prevent Connectors from exceeding tasks.max configuration
> -
>
> Key: KAFKA-15575
> URL: https://issues.apache.org/jira/browse/KAFKA-15575
> Project: Kafka
>  Issue Type: Task
>  Components: connect
>Reporter: Greg Harris
>Assignee: Chris Egerton
>Priority: Minor
>  Labels: kip
> Fix For: 3.8.0
>
>
> The Connector::taskConfigs(int maxTasks) function is used by Connectors to 
> enumerate tasks configurations. This takes an argument which comes from the 
> tasks.max connector config. This is the Javadoc for that method:
> {noformat}
> /**
>  * Returns a set of configurations for Tasks based on the current 
> configuration,
>  * producing at most {@code maxTasks} configurations.
>  *
>  * @param maxTasks maximum number of configurations to generate
>  * @return configurations for Tasks
>  */
> public abstract List> taskConfigs(int maxTasks);
> {noformat}
> This includes the constraint that the number of tasks is at most maxTasks, 
> but this constraint is not enforced by the framework.
>  
> To enforce this constraint, we could begin dropping configs that exceed the 
> limit, and log a warning. For sink connectors this should harmlessly 
> rebalance the consumer subscriptions onto the remaining tasks. For source 
> connectors that distribute their work via task configs, this may result in an 
> interruption in data transfer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2610

2024-02-01 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-15691) Add new system tests to use new consumer

2024-02-01 Thread Kirk True (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True resolved KAFKA-15691.
---
Resolution: Duplicate

> Add new system tests to use new consumer
> 
>
> Key: KAFKA-15691
> URL: https://issues.apache.org/jira/browse/KAFKA-15691
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Major
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2611

2024-02-01 Thread Apache Jenkins Server
See 




Re: ZK vs KRaft benchmarking - latency differences?

2024-02-01 Thread Michael K. Edwards
The interesting numbers are the recovery times after 1) the Kafka broker
currently acting as the "active" controller (or the sole controller in a
ZooKeeper-based deployment) goes away; 2) the Kafka broker currently acting
as the consumer group coordinator for a consumer group with many partitions
and a high commit rate goes away.  Here "goes away" means as ugly a loss
mode as can realistically be simulated in your test environment; I suggest
forcing the to-be-impaired broker into heavy paging by running it inside a
cgroups container and progressively shrinking the memory cgroup.  It's also
fun to force high packet loss using iptables.

If you're serious about testing KRaft's survivability under load, then I
suggest you compare against a ZooKeeper deployment that's relatively
non-broken.  That means setting up a ZooKeeper observer
https://zookeeper.apache.org/doc/current/zookeeperObservers.html local to
each broker.  Personally I'd want to test with a large number of partitions
(840 or 2520 per topic, tens of thousands overall), especially in the
coordinator-failure scenario.  I haven't been following the horizontal
scaling work closely, but I suspect that still means porting forward the
Dropwizard-based metrics patch I wrote years ago.  If I were doing that,
I'd bring the shared dependencies of zookeeper and kafka up to current and
do a custom zookeeper build off of the 3.9.x branch (compare
https://github.com/mkedwards/zookeeper/commit/e608be61a3851c128088d9c9c54871f56aa05012
and consider backporting
https://github.com/apache/zookeeper/commit/5894dc88cce1f4675809fb347cc60d3e0ebf08d4).
Then I'd do https://github.com/mkedwards/kafka/tree/bitpusher-2.3 all over
again, starting from the kafka 3.6.x branch and synchronizing the shared
dependencies.

If you'd like to outsource that work, I'm available on a consulting basis
:D  Seriously, ZooKeeper itself has in my opinion never been the problem,
at least since it got revived after the sad 3.14.1x / 3.5.x-alpha days.
Inadequately resourced and improperly deployed ZooKeeper clusters have been
a problem, as has the use of JMX to do the job of a modern metrics
library.  The KRaft ship has sailed as far as upstream development is
concerned; but if you're not committed to that in your production
environment, there are other ways to scale up and out while retaining
ZooKeeper as your reliable configuration/metadata store.  (It's also
cost-effective and latency-feasible to run a cross-AZ ZooKeeper cluster,
which I would not attempt with Kafka brokers in any kind of large-scale
production setting.)

Cheers,
- Michael

On Thu, Feb 1, 2024 at 7:02 AM Doğuşcan Namal 
wrote:

> Hi Paul,
>
> I did some benchmarking as well and couldn't find a marginal difference
> between KRaft and Zookeeper on end to end latency from producers to
> consumers. I tested it on Kafka version 3.5.1 and used openmessaging's
> benchmarking framework https://openmessaging.cloud/docs/benchmarks/ .
>
> What I noticed was if you run the tests long enough(60 mins) the throughput
> converges to the same value eventually. I also noticed some difference on
> p99+ latencies between Zookeeper and KRaft clusters but the results were
> not consistent on repetitive runs.
>
> Which version did you make the tests on and what are your findings?
>
> On Wed, 31 Jan 2024 at 22:57, Brebner, Paul  .invalid>
> wrote:
>
> > Hi all,
> >
> > We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and
> > found no difference in throughput (which we believed is also what theory
> > predicted, as ZK/Kraft are only involved in Kafka meta-data operations,
> not
> > data workloads).
> >
> > BUT – latest tests reveal improved producer and consumer latency for
> Kraft
> > c.f. ZooKeeper.  So just wanted to check if Kraft is actually involved in
> > any aspect of write/read workloads? For example, some documentation
> > (possibly old) suggests that consumer offsets are stored in meta-data?
> In
> > which case this could explain better Kraft latencies. But if not, then
> I’m
> > curious to understand the difference (and if it’s documented anywhere?)
> >
> > Also if anyone has noticed the same regarding latency in benchmarks.
> >
> > Regards, Paul Brebner
> >
>


[jira] [Created] (KAFKA-16216) Reduce batch size for initial metadata load during ZK migration

2024-02-01 Thread Colin McCabe (Jira)
Colin McCabe created KAFKA-16216:


 Summary: Reduce batch size for initial metadata load during ZK 
migration
 Key: KAFKA-16216
 URL: https://issues.apache.org/jira/browse/KAFKA-16216
 Project: Kafka
  Issue Type: Bug
Reporter: Colin McCabe
Assignee: David Arthur






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close

2024-02-01 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-16217:
--

 Summary: Transactional producer stuck in IllegalStateException 
during close
 Key: KAFKA-16217
 URL: https://issues.apache.org/jira/browse/KAFKA-16217
 Project: Kafka
  Issue Type: Bug
  Components: clients
Reporter: Calvin Liu


The producer is stuck during the close. It keeps retrying to abort the 
transaction but it never succeeds. 
{code:java}
[ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | 
producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
org.apache.kafka.clients.producer.internals.Sender run - [Producer 
clientId=producer-transaction-ben
ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, 
transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] Error 
in kafka producer I/O thread while aborting transaction:
java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` 
because the previous call to `commitTransaction` timed out and must be retried
at 
org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274)
at java.base/java.lang.Thread.run(Thread.java:1583)
at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) 
{code}
With the additional log, I found the root cause. If the producer is in a bad 
transaction state(in my case, the TransactionManager.pendingTransition was set 
to commitTransaction and did not get cleaned), before the producer calls close 
and tries to abort the existing transaction, the producer will get stuck in the 
transaction abortion. It is related to the fix 
[https://github.com/apache/kafka/pull/13591].
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16216) Reduce batch size for initial metadata load during ZK migration

2024-02-01 Thread Colin McCabe (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin McCabe resolved KAFKA-16216.
--
Fix Version/s: 3.7.0
 Reviewer: Colin McCabe
 Assignee: David Arthur  (was: Colin McCabe)
   Resolution: Fixed

> Reduce batch size for initial metadata load during ZK migration
> ---
>
> Key: KAFKA-16216
> URL: https://issues.apache.org/jira/browse/KAFKA-16216
> Project: Kafka
>  Issue Type: Bug
>Reporter: Colin McCabe
>Assignee: David Arthur
>Priority: Major
> Fix For: 3.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.7 #83

2024-02-01 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2612

2024-02-01 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-16218) Partition reassignment can't complete if any target replica is out-of-sync

2024-02-01 Thread Drawxy (Jira)
Drawxy created KAFKA-16218:
--

 Summary: Partition reassignment can't complete if any target 
replica is out-of-sync
 Key: KAFKA-16218
 URL: https://issues.apache.org/jira/browse/KAFKA-16218
 Project: Kafka
  Issue Type: Bug
Reporter: Drawxy


Assumed that there were 4 brokers (1001,2001,3001,4001) and a topic partition 
_foo-0_ (replicas[1001,2001,3001], isr[1001,3001]). The replica 2001 can't 
catch up and become out-of-sync due to some issue.

If we launch a partition reassinment for this _foo-0_ (the target replica list 
is [1001,2001,4001]), the partition reassignment can't complete even if the 
adding replica 4001 already catches up. At that time, the partition state would 
be replicas[1001,2001,4001,3001] isr[1001,3001,4001].

 

The out-of-sync replica 2001 shouldn't make the partition reassignment stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.7 #84

2024-02-01 Thread Apache Jenkins Server
See