We've started running our usual suite of performance tests against Kafka
0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
run a fairly normal mixed workload of producers and consumers (each
producer/consumer are just instances of kafka's inbuilt consumer/producer
perf tests). We've found about a 33% performance drop in the producer if
TLS is used (compared to 0.9.0.1)

We've seen notable producer performance degredations between 0.9.0.1 and
0.10.0.0 RC. We're running as of the commit 9404680 right now.

Our specific test case runs Kafka on 8 EC2 machines, with enhanced
networking. Nothing is changed between the instances, and I've reproduced
this over 4 different sets of clusters now. We're seeing about a 33%
performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
to note that this doesn't match up with
https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
tests are with compression off, and this seems to be an TLS only issue.

Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
and 13 producers max out at around 1 million 100 byte messages a second.
Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
tests were with TLS on. I've reproduced this on multiple clusters now (5 or
so of each version) to account for the inherent performance variance of
EC2. There's no notable performance difference without TLS on these runs -
it appears to be an TLS regression entirely.

A single producer with TLS under 0.10 does about 75k messages/s. Under
0.9.0.01 it does around 120k messages/s.

The exact producer-perf line we're using is this:

bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
--record-size "100" --throughput "100" --producer-props acks="-1"
bootstrap.servers=REDACTED ssl.keystore.location=client.jks
ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
ssl.truststore.password=REDACTED
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL

We're using the same setup, machine type etc for each test run.

We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
performance impact was there for both.

I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
anything that seemed to have this kind of impact - indeed the TLS code
doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.

Any thoughts? Should I file an issue and see about reproducing a more
minimal test case?

I don't think this is related to
https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
on and plaintext, and this is for TLS only.

Reply via email to