Re: ZK vs KRaft benchmarking - latency differences?

2024-02-02 Thread Michael K. Edwards
Take everything I say with a grain of salt; I haven't set up a new Kafka
cluster from scratch in several years.  And realistically, most users'
needs are better met with a simple deployment model with adequate
performance rather than a heavily tuned system.  My attention was just
drawn to the benchmarking discussion.

I don't have any firsthand data to support the contention that either
controller failover or large consumer group coordinator failover would be
faster under KRaft.  They're just the things that I know to place heavy
burst load on ZooKeeper.  I've had long-term satisfactory results with a
Kafka/ZooKeeper setup that has finely partitioned topics, and to get there,
I used ZK observers.  (Didn't need them to be on the same machine as the
Kafka brokers, but I think next time I'd probably set them up that way —
with the observer's data on its own ext4fs filesystem.)

Consumer group offsets are stored in a compacted Kafka topic.  There are
knobs that need turning to keep that compaction from lagging.  You want
fairly small segments for compacted topics, and you want space for them
preallocated on the filesystem.  I would recommend xfs for the Kafka
segment store rather than any of the ext*fs family.  And I'd recommend an
external journal partition and a careful set of XFS tunings to get good
distribution of write load across an LVM linear (not striped!) volume.

Not all of the details are in my head at the moment.  For full details I'd
probably want to refer back to notes that I kept while I was at BitPusher,
setting up the Kafka rig at Nexia.  All I'm really trying to say here is
that, if one wants to know whether KRaft is a win and by how much, one
might want not to sandbag the alternative setup.  Data and metadata have
different access patterns, and if you're not provisioning accordingly,
you're leaving throughput on the floor.

Cheers,
- Michael

On Fri, Feb 2, 2024, 3:33 AM Doğuşcan Namal 
wrote:

> Hey Michael, thanks for your comments. I think the first of the
> improvements you mentioned, the faster controller failover is a known
> improvement to me. But the second one you suggest is a faster consumer
> group failover, could you open that up a bit for me why do you think it
> will be better on KRaft?
>
> As you mentioned these are improvements on the recovery times, so from your
> mail I understand you wouldn't expect an improvement on latencies as well.
>
> On Thu, 1 Feb 2024 at 22:53, Michael K. Edwards 
> wrote:
>
> > The interesting numbers are the recovery times after 1) the Kafka broker
> > currently acting as the "active" controller (or the sole controller in a
> > ZooKeeper-based deployment) goes away; 2) the Kafka broker currently
> acting
> > as the consumer group coordinator for a consumer group with many
> partitions
> > and a high commit rate goes away.  Here "goes away" means as ugly a loss
> > mode as can realistically be simulated in your test environment; I
> suggest
> > forcing the to-be-impaired broker into heavy paging by running it inside
> a
> > cgroups container and progressively shrinking the memory cgroup.  It's
> also
> > fun to force high packet loss using iptables.
> >
> > If you're serious about testing KRaft's survivability under load, then I
> > suggest you compare against a ZooKeeper deployment that's relatively
> > non-broken.  That means setting up a ZooKeeper observer
> > https://zookeeper.apache.org/doc/current/zookeeperObservers.html local
> to
> > each broker.  Personally I'd want to test with a large number of
> partitions
> > (840 or 2520 per topic, tens of thousands overall), especially in the
> > coordinator-failure scenario.  I haven't been following the horizontal
> > scaling work closely, but I suspect that still means porting forward the
> > Dropwizard-based metrics patch I wrote years ago.  If I were doing that,
> > I'd bring the shared dependencies of zookeeper and kafka up to current
> and
> > do a custom zookeeper build off of the 3.9.x branch (compare
> >
> >
> https://github.com/mkedwards/zookeeper/commit/e608be61a3851c128088d9c9c54871f56aa05012
> > and consider backporting
> >
> >
> https://github.com/apache/zookeeper/commit/5894dc88cce1f4675809fb347cc60d3e0ebf08d4
> > ).
> > Then I'd do https://github.com/mkedwards/kafka/tree/bitpusher-2.3 all
> over
> > again, starting from the kafka 3.6.x branch and synchronizing the shared
> > dependencies.
> >
> > If you'd like to outsource that work, I'm available on a consulting basis
> > :D  Seriously, ZooKeeper itself has in my opinion never been the problem,
> > at least since it got revived after the sad 3.14.1x / 3.5.x-alpha days.
> > Inadequately resourced and improperly deployed ZooKeeper clusters have
> been
> > a problem, as has the use of JMX to do the job of a modern metrics
> > library.  The KRaft ship has sailed as far as upstream development is
> > concerned; but if you're not committed to that in your production
> > environment, there are other ways to scale up and out while 

Re: ZK vs KRaft benchmarking - latency differences?

2024-02-02 Thread Doğuşcan Namal
Hey Michael, thanks for your comments. I think the first of the
improvements you mentioned, the faster controller failover is a known
improvement to me. But the second one you suggest is a faster consumer
group failover, could you open that up a bit for me why do you think it
will be better on KRaft?

As you mentioned these are improvements on the recovery times, so from your
mail I understand you wouldn't expect an improvement on latencies as well.

On Thu, 1 Feb 2024 at 22:53, Michael K. Edwards 
wrote:

> The interesting numbers are the recovery times after 1) the Kafka broker
> currently acting as the "active" controller (or the sole controller in a
> ZooKeeper-based deployment) goes away; 2) the Kafka broker currently acting
> as the consumer group coordinator for a consumer group with many partitions
> and a high commit rate goes away.  Here "goes away" means as ugly a loss
> mode as can realistically be simulated in your test environment; I suggest
> forcing the to-be-impaired broker into heavy paging by running it inside a
> cgroups container and progressively shrinking the memory cgroup.  It's also
> fun to force high packet loss using iptables.
>
> If you're serious about testing KRaft's survivability under load, then I
> suggest you compare against a ZooKeeper deployment that's relatively
> non-broken.  That means setting up a ZooKeeper observer
> https://zookeeper.apache.org/doc/current/zookeeperObservers.html local to
> each broker.  Personally I'd want to test with a large number of partitions
> (840 or 2520 per topic, tens of thousands overall), especially in the
> coordinator-failure scenario.  I haven't been following the horizontal
> scaling work closely, but I suspect that still means porting forward the
> Dropwizard-based metrics patch I wrote years ago.  If I were doing that,
> I'd bring the shared dependencies of zookeeper and kafka up to current and
> do a custom zookeeper build off of the 3.9.x branch (compare
>
> https://github.com/mkedwards/zookeeper/commit/e608be61a3851c128088d9c9c54871f56aa05012
> and consider backporting
>
> https://github.com/apache/zookeeper/commit/5894dc88cce1f4675809fb347cc60d3e0ebf08d4
> ).
> Then I'd do https://github.com/mkedwards/kafka/tree/bitpusher-2.3 all over
> again, starting from the kafka 3.6.x branch and synchronizing the shared
> dependencies.
>
> If you'd like to outsource that work, I'm available on a consulting basis
> :D  Seriously, ZooKeeper itself has in my opinion never been the problem,
> at least since it got revived after the sad 3.14.1x / 3.5.x-alpha days.
> Inadequately resourced and improperly deployed ZooKeeper clusters have been
> a problem, as has the use of JMX to do the job of a modern metrics
> library.  The KRaft ship has sailed as far as upstream development is
> concerned; but if you're not committed to that in your production
> environment, there are other ways to scale up and out while retaining
> ZooKeeper as your reliable configuration/metadata store.  (It's also
> cost-effective and latency-feasible to run a cross-AZ ZooKeeper cluster,
> which I would not attempt with Kafka brokers in any kind of large-scale
> production setting.)
>
> Cheers,
> - Michael
>
> On Thu, Feb 1, 2024 at 7:02 AM Doğuşcan Namal 
> wrote:
>
> > Hi Paul,
> >
> > I did some benchmarking as well and couldn't find a marginal difference
> > between KRaft and Zookeeper on end to end latency from producers to
> > consumers. I tested it on Kafka version 3.5.1 and used openmessaging's
> > benchmarking framework https://openmessaging.cloud/docs/benchmarks/ .
> >
> > What I noticed was if you run the tests long enough(60 mins) the
> throughput
> > converges to the same value eventually. I also noticed some difference on
> > p99+ latencies between Zookeeper and KRaft clusters but the results were
> > not consistent on repetitive runs.
> >
> > Which version did you make the tests on and what are your findings?
> >
> > On Wed, 31 Jan 2024 at 22:57, Brebner, Paul  > .invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and
> > > found no difference in throughput (which we believed is also what
> theory
> > > predicted, as ZK/Kraft are only involved in Kafka meta-data operations,
> > not
> > > data workloads).
> > >
> > > BUT – latest tests reveal improved producer and consumer latency for
> > Kraft
> > > c.f. ZooKeeper.  So just wanted to check if Kraft is actually involved
> in
> > > any aspect of write/read workloads? For example, some documentation
> > > (possibly old) suggests that consumer offsets are stored in meta-data?
> > In
> > > which case this could explain better Kraft latencies. But if not, then
> > I’m
> > > curious to understand the difference (and if it’s documented anywhere?)
> > >
> > > Also if anyone has noticed the same regarding latency in benchmarks.
> > >
> > > Regards, Paul Brebner
> > >
> >
>


Re: ZK vs KRaft benchmarking - latency differences?

2024-02-01 Thread Michael K. Edwards
The interesting numbers are the recovery times after 1) the Kafka broker
currently acting as the "active" controller (or the sole controller in a
ZooKeeper-based deployment) goes away; 2) the Kafka broker currently acting
as the consumer group coordinator for a consumer group with many partitions
and a high commit rate goes away.  Here "goes away" means as ugly a loss
mode as can realistically be simulated in your test environment; I suggest
forcing the to-be-impaired broker into heavy paging by running it inside a
cgroups container and progressively shrinking the memory cgroup.  It's also
fun to force high packet loss using iptables.

If you're serious about testing KRaft's survivability under load, then I
suggest you compare against a ZooKeeper deployment that's relatively
non-broken.  That means setting up a ZooKeeper observer
https://zookeeper.apache.org/doc/current/zookeeperObservers.html local to
each broker.  Personally I'd want to test with a large number of partitions
(840 or 2520 per topic, tens of thousands overall), especially in the
coordinator-failure scenario.  I haven't been following the horizontal
scaling work closely, but I suspect that still means porting forward the
Dropwizard-based metrics patch I wrote years ago.  If I were doing that,
I'd bring the shared dependencies of zookeeper and kafka up to current and
do a custom zookeeper build off of the 3.9.x branch (compare
https://github.com/mkedwards/zookeeper/commit/e608be61a3851c128088d9c9c54871f56aa05012
and consider backporting
https://github.com/apache/zookeeper/commit/5894dc88cce1f4675809fb347cc60d3e0ebf08d4).
Then I'd do https://github.com/mkedwards/kafka/tree/bitpusher-2.3 all over
again, starting from the kafka 3.6.x branch and synchronizing the shared
dependencies.

If you'd like to outsource that work, I'm available on a consulting basis
:D  Seriously, ZooKeeper itself has in my opinion never been the problem,
at least since it got revived after the sad 3.14.1x / 3.5.x-alpha days.
Inadequately resourced and improperly deployed ZooKeeper clusters have been
a problem, as has the use of JMX to do the job of a modern metrics
library.  The KRaft ship has sailed as far as upstream development is
concerned; but if you're not committed to that in your production
environment, there are other ways to scale up and out while retaining
ZooKeeper as your reliable configuration/metadata store.  (It's also
cost-effective and latency-feasible to run a cross-AZ ZooKeeper cluster,
which I would not attempt with Kafka brokers in any kind of large-scale
production setting.)

Cheers,
- Michael

On Thu, Feb 1, 2024 at 7:02 AM Doğuşcan Namal 
wrote:

> Hi Paul,
>
> I did some benchmarking as well and couldn't find a marginal difference
> between KRaft and Zookeeper on end to end latency from producers to
> consumers. I tested it on Kafka version 3.5.1 and used openmessaging's
> benchmarking framework https://openmessaging.cloud/docs/benchmarks/ .
>
> What I noticed was if you run the tests long enough(60 mins) the throughput
> converges to the same value eventually. I also noticed some difference on
> p99+ latencies between Zookeeper and KRaft clusters but the results were
> not consistent on repetitive runs.
>
> Which version did you make the tests on and what are your findings?
>
> On Wed, 31 Jan 2024 at 22:57, Brebner, Paul  .invalid>
> wrote:
>
> > Hi all,
> >
> > We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and
> > found no difference in throughput (which we believed is also what theory
> > predicted, as ZK/Kraft are only involved in Kafka meta-data operations,
> not
> > data workloads).
> >
> > BUT – latest tests reveal improved producer and consumer latency for
> Kraft
> > c.f. ZooKeeper.  So just wanted to check if Kraft is actually involved in
> > any aspect of write/read workloads? For example, some documentation
> > (possibly old) suggests that consumer offsets are stored in meta-data?
> In
> > which case this could explain better Kraft latencies. But if not, then
> I’m
> > curious to understand the difference (and if it’s documented anywhere?)
> >
> > Also if anyone has noticed the same regarding latency in benchmarks.
> >
> > Regards, Paul Brebner
> >
>


Re: ZK vs KRaft benchmarking - latency differences?

2024-02-01 Thread Doğuşcan Namal
Hi Paul,

I did some benchmarking as well and couldn't find a marginal difference
between KRaft and Zookeeper on end to end latency from producers to
consumers. I tested it on Kafka version 3.5.1 and used openmessaging's
benchmarking framework https://openmessaging.cloud/docs/benchmarks/ .

What I noticed was if you run the tests long enough(60 mins) the throughput
converges to the same value eventually. I also noticed some difference on
p99+ latencies between Zookeeper and KRaft clusters but the results were
not consistent on repetitive runs.

Which version did you make the tests on and what are your findings?

On Wed, 31 Jan 2024 at 22:57, Brebner, Paul 
wrote:

> Hi all,
>
> We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and
> found no difference in throughput (which we believed is also what theory
> predicted, as ZK/Kraft are only involved in Kafka meta-data operations, not
> data workloads).
>
> BUT – latest tests reveal improved producer and consumer latency for Kraft
> c.f. ZooKeeper.  So just wanted to check if Kraft is actually involved in
> any aspect of write/read workloads? For example, some documentation
> (possibly old) suggests that consumer offsets are stored in meta-data?  In
> which case this could explain better Kraft latencies. But if not, then I’m
> curious to understand the difference (and if it’s documented anywhere?)
>
> Also if anyone has noticed the same regarding latency in benchmarks.
>
> Regards, Paul Brebner
>


ZK vs KRaft benchmarking - latency differences?

2024-01-31 Thread Brebner, Paul
Hi all,

We’ve previously done some benchmarking of Kafka ZooKeeper vs KRaft and found 
no difference in throughput (which we believed is also what theory predicted, 
as ZK/Kraft are only involved in Kafka meta-data operations, not data 
workloads).

BUT – latest tests reveal improved producer and consumer latency for Kraft c.f. 
ZooKeeper.  So just wanted to check if Kraft is actually involved in any aspect 
of write/read workloads? For example, some documentation (possibly old) 
suggests that consumer offsets are stored in meta-data?  In which case this 
could explain better Kraft latencies. But if not, then I’m curious to 
understand the difference (and if it’s documented anywhere?)

Also if anyone has noticed the same regarding latency in benchmarks.

Regards, Paul Brebner