Hey Michael, thanks for your comments. I think the first of the
improvements you mentioned, the faster controller failover is a known
improvement to me. But the second one you suggest is a faster consumer
group failover, could you open that up a bit for me why do you think it
will be better on KRaft?

As you mentioned these are improvements on the recovery times, so from your
mail I understand you wouldn't expect an improvement on latencies as well.

On Thu, 1 Feb 2024 at 22:53, Michael K. Edwards <m.k.edwa...@gmail.com>
wrote:

> The interesting numbers are the recovery times after 1) the Kafka broker
> currently acting as the "active" controller (or the sole controller in a
> ZooKeeper-based deployment) goes away; 2) the Kafka broker currently acting
> as the consumer group coordinator for a consumer group with many partitions
> and a high commit rate goes away.  Here "goes away" means as ugly a loss
> mode as can realistically be simulated in your test environment; I suggest
> forcing the to-be-impaired broker into heavy paging by running it inside a
> cgroups container and progressively shrinking the memory cgroup.  It's also
> fun to force high packet loss using iptables.
>
> If you're serious about testing KRaft's survivability under load, then I
> suggest you compare against a ZooKeeper deployment that's relatively
> non-broken.  That means setting up a ZooKeeper observer
> https://zookeeper.apache.org/doc/current/zookeeperObservers.html local to
> each broker.  Personally I'd want to test with a large number of partitions
> (840 or 2520 per topic, tens of thousands overall), especially in the
> coordinator-failure scenario.  I haven't been following the horizontal
> scaling work closely, but I suspect that still means porting forward the
> Dropwizard-based metrics patch I wrote years ago.  If I were doing that,
> I'd bring the shared dependencies of zookeeper and kafka up to current and
> do a custom zookeeper build off of the 3.9.x branch (compare
>
> https://github.com/mkedwards/zookeeper/commit/e608be61a3851c128088d9c9c54871f56aa05012
> and consider backporting
>
> https://github.com/apache/zookeeper/commit/5894dc88cce1f4675809fb347cc60d3e0ebf08d4
> ).
> Then I'd do https://github.com/mkedwards/kafka/tree/bitpusher-2.3 all over
> again, starting from the kafka 3.6.x branch and synchronizing the shared
> dependencies.
>
> If you'd like to outsource that work, I'm available on a consulting basis
> :D  Seriously, ZooKeeper itself has in my opinion never been the problem,
> at least since it got revived after the sad 3.14.1x / 3.5.x-alpha days.
> Inadequately resourced and improperly deployed ZooKeeper clusters have been
> a problem, as has the use of JMX to do the job of a modern metrics
> library.  The KRaft ship has sailed as far as upstream development is
> concerned; but if you're not committed to that in your production
> environment, there are other ways to scale up and out while retaining
> ZooKeeper as your reliable configuration/metadata store.  (It's also
> cost-effective and latency-feasible to run a cross-AZ ZooKeeper cluster,
> which I would not attempt with Kafka brokers in any kind of large-scale
> production setting.)
>
> Cheers,
> - Michael
>
> On Thu, Feb 1, 2024 at 7:02 AM Doğuşcan Namal <namal.dogus...@gmail.com>
> wrote:
>
> > Hi Paul,
> >
> > I did some benchmarking as well and couldn't find a marginal difference
> > between KRaft and Zookeeper on end to end latency from producers to
> > consumers. I tested it on Kafka version 3.5.1 and used openmessaging's
> > benchmarking framework https://openmessaging.cloud/docs/benchmarks/ .
> >
> > What I noticed was if you run the tests long enough(60 mins) the
> throughput
> > converges to the same value eventually. I also noticed some difference on
> > p99+ latencies between Zookeeper and KRaft clusters but the results were
> > not consistent on repetitive runs.
> >
> > Which version did you make the tests on and what are your findings?
> >
> > On Wed, 31 Jan 2024 at 22:57, Brebner, Paul <paul.breb...@netapp.com
> > .invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and
> > > found no difference in throughput (which we believed is also what
> theory
> > > predicted, as ZK/Kraft are only involved in Kafka meta-data operations,
> > not
> > > data workloads).
> > >
> > > BUT – latest tests reveal improved producer and consumer latency for
> > Kraft
> > > c.f. ZooKeeper.  So just wanted to check if Kraft is actually involved
> in
> > > any aspect of write/read workloads? For example, some documentation
> > > (possibly old) suggests that consumer offsets are stored in meta-data?
> > In
> > > which case this could explain better Kraft latencies. But if not, then
> > I’m
> > > curious to understand the difference (and if it’s documented anywhere?)
> > >
> > > Also if anyone has noticed the same regarding latency in benchmarks.
> > >
> > > Regards, Paul Brebner
> > >
> >
>

Reply via email to