[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:49 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap size for kafka process, and ~20k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chart on a broker using kafka 2.0 binary with commits up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chart on a broker using kafka 2.0 binary with commits up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:20 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chart on a broker using kafka 2.0 binary with commits up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:17 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 5:52 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster during this period of time: !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500px! > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:28 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS03LTYtMzU= Sometimes the cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. Sometimes the cluster still could not recover fully due to dramatic increase of heap size and high cpu usage when we turned on the ssl writing traffic again. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= Sometimes the cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. Sometimes the cluster still could not recover fully due to dramatic increase of heap size and high cpu usage when we turned on the ssl writing traffic again. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:26 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= Sometimes the cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. Sometimes the cluster still could not recover fully due to dramatic increase of heap size and high cpu usage when we turned on the ssl writing traffic again. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 5:43 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 1:01 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02 The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:10 PM: -- After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02 The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:08 PM: -- After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered after we turned off the ssl writing traffic to the cluster, let the broker to garbage collect the objects in the old gen, and resume the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584894#comment-16584894 ] Ted Yu edited comment on KAFKA-7304 at 8/18/18 7:38 PM: Through a bit more additional logging, here is what happens to testMuteOnOOM. There are two channels added to closingChannels: {code} adding org.apache.kafka.common.network.KafkaChannel@334b860c for clientX adding org.apache.kafka.common.network.KafkaChannel@334b860d for clientY {code} Later, when Selector.close() is called by tearDown, the channel for clientY is still in closingChannels : {code} There are 1 entries in closingChannels org.apache.kafka.common.network.KafkaChannel@334b860d {code} My change above would close the channel left in closingChannels, preventing memory leak. was (Author: yuzhih...@gmail.com): Through a bit more additional logging, here is what happens to testMuteOnOOM. There are two channels registered at the beginning of the test: {code} adding org.apache.kafka.common.network.KafkaChannel@334b860c for clientX adding org.apache.kafka.common.network.KafkaChannel@334b860d for clientY {code} Later, when Selector.close() is called by tearDown, the channel for clientY is in closingChannels : {code} There are 1 entries in closingChannels org.apache.kafka.common.network.KafkaChannel@334b860d {code} My change above would close the channel left in closingChannels, preventing memory leak. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584156#comment-16584156 ] Yu Yang edited comment on KAFKA-7304 at 8/17/18 4:51 PM: - [~ijuma] We have an internal build that cherry-picks 1.1.1 changes. I might miss some fixes. https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java shows that there were only two Selector.java related changes after 1.1.0 release date that was March 23rd. Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? kafka 1.1.0 has included that change. was (Author: yuyang08): [~ijuma] We have an internal build that cherry-picks 1.1.1 changes. I might miss some fixes. https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java shows that there were only two Selector.java related changes after 1.1.0 release date that was March 23rd. Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? We have included that change. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583570#comment-16583570 ] Ted Yu edited comment on KAFKA-7304 at 8/17/18 12:18 PM: - Looking at the close() method, I don't see where the channels in closingChannels are closed (if the given id is found in channels Map). {code} diff --git a/clients/src/main/java/org/apache/kafka/common/network/Selector.java b/clients/src/main/java/org/apache/kafka/common/network/Selector.java index 7e32509..2164a40 100644 --- a/clients/src/main/java/org/apache/kafka/common/network/Selector.java +++ b/clients/src/main/java/org/apache/kafka/common/network/Selector.java @@ -320,6 +320,10 @@ public class Selector implements Selectable, AutoCloseable { } sensors.close(); channelBuilder.close(); +for (Map.Entry entry : this.closingChannels.entrySet()) { +doClose(entry.getValue(), false); +} +this.closingChannels.clear(); } /** {code} I wonder if the above change would fix the leakage. was (Author: yuzhih...@gmail.com): Looking at the close() method, I don't see where the channels in closingChannels are closed. {code} diff --git a/clients/src/main/java/org/apache/kafka/common/network/Selector.java b/clients/src/main/java/org/apache/kafka/common/network/Selector.java index 7e32509..2164a40 100644 --- a/clients/src/main/java/org/apache/kafka/common/network/Selector.java +++ b/clients/src/main/java/org/apache/kafka/common/network/Selector.java @@ -320,6 +320,10 @@ public class Selector implements Selectable, AutoCloseable { } sensors.close(); channelBuilder.close(); +for (Map.Entry entry : this.closingChannels.entrySet()) { +doClose(entry.getValue(), false); +} +this.closingChannels.clear(); } /** {code} I wonder if the above change would fix the leakage. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583565#comment-16583565 ] Yu Yang edited comment on KAFKA-7304 at 8/17/18 8:01 AM: - [~yuzhih...@gmail.com] There was no exception in server.log before the broker hit frequent full gc. There was various errors in the log after the broker ran into full gc. But i think that those exceptions are not relevant to the root cause. was (Author: yuyang08): [~yuzhih...@gmail.com] There was no exception in server.log before we hit frequent full gc. There was various errors in the log after the broker ran into full gc. But i think that those exceptions are not relevant to the root cause. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)