[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yu Yang updated KAFKA-7364: --------------------------- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. This seems to be related to inefficiency in jdk ssl related implementation. Switching to use https://github.com/netty/netty-tcnative to handle ssl handshake can be helpful. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, line=331 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshake() @bci=48, line=258 (Compiled frame) - org.apache.kafka.common.network.KafkaChannel.prepare() @bci=16, line=125 (Compiled frame) - org.apache.kafka.common.network.Selector.pollSelectionKeys(java.util.Set, boolean, long) @bci=251, line=494 (Compiled frame) - org.apache.kafka.common.network.Selector.poll(long) @bci=312, line=432 (Compiled frame) - kafka.network.Processor.poll() @bci=7, line=678 (Compiled frame) - kafka.network.Processor.run() @bci=31, line=583 (Compiled frame) - java.lang.Thread.run() @bci=11, line=748 (Interpreted frame) {code} was: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. This seems to be related to inefficiency in jdk ssl related implementation. Switching to use https://github.com/netty/netty-tcnative to handle ssl handshake might be helpful. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, line=331 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshake() @bci=48, line=258 (Compiled frame) - org.apache.kafka.common.network.KafkaChannel.prepare() @bci=16, line=125 (Compiled frame) - org.apache.kafka.common.network.Selector.pollSelectionKeys(java.util.Set, boolean, long) @bci=251, line=494 (Compiled frame) - org.apache.kafka.common.network.Selector.poll(long) @bci=312, line=432 (Compiled frame) - kafka.network.Processor.poll() @bci=7, line=678 (Compiled frame) - kafka.network.Processor.run() @bci=31, line=583 (Compiled frame) - java.lang.Thread.run() @bci=11, line=748 (Interpreted frame) {code} > kafka periodically run into high cpu usage with ssl writing > ----------------------------------------------------------- > > Key: KAFKA-7364 > URL: https://issues.apache.org/jira/browse/KAFKA-7364 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.0.0 > Reporter: Yu Yang > Priority: Major > Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png > > > while testing ssl writing to kafka, we found that kafka often run into high > cpu usage due to inefficiency in jdk ssl implementation. > In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka > 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write > to through ssl channel. We observed that the network threads often get 100% > cpu usage after enabling ssl writing to kafka. To improve kafka's > throughput, we have "num.network.threads=32" for the broker. Even with 32 > network threads, we see the broker cpu usage jump right after ssl writing is > enabled. The broker's cpu usage would drop immediately when we disabled ssl > writing. > !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! > When the broker's cpu usage is high, 'perf top' shows that kafka is busy with > executing code in libsunec.so. The following is a sample stack track that > we get when the broker's cpu usage was high. This seems to be related to > inefficiency in jdk ssl related implementation. Switching to use > https://github.com/netty/netty-tcnative to handle ssl handshake can be > helpful. > {code} > Thread 77562: (state = IN_NATIVE) > - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], > byte[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 > (Compiled frame) > - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 > (Compiled frame) > - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) > - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, > java.lang.String) @bci=136, line=444 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) > @bci=48, line=166 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, > java.util.Collection) @bci=24, line=147 (Compiled frame) > - > sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, > java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, > sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 > (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) > @bci=217, line=141 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) > - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) > - > sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], > java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) > - > sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=232, line=259 (Compiled frame) > - > sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=6, line=260 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, > java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, > java.lang.String) @bci=10, line=324 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 > (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) > - > sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) > @bci=190, line=1966 (Compiled frame) > - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, > line=237 (Compiled frame) > - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled > frame) > - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) > - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) > - > java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, > java.security.AccessControlContext) @bci=0 (Compiled frame) > - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 > (Compiled frame) > - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() > @bci=13, line=393 (Compiled frame) > - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) > @bci=88, line=473 (Compiled frame) > - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, > line=331 (Compiled frame) > - org.apache.kafka.common.network.SslTransportLayer.handshake() @bci=48, > line=258 (Compiled frame) > - org.apache.kafka.common.network.KafkaChannel.prepare() @bci=16, line=125 > (Compiled frame) > - org.apache.kafka.common.network.Selector.pollSelectionKeys(java.util.Set, > boolean, long) @bci=251, line=494 (Compiled frame) > - org.apache.kafka.common.network.Selector.poll(long) @bci=312, line=432 > (Compiled frame) > - kafka.network.Processor.poll() @bci=7, line=678 (Compiled frame) > - kafka.network.Processor.run() @bci=31, line=583 (Compiled frame) > - java.lang.Thread.run() @bci=11, line=748 (Interpreted frame) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)