[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-30 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:49 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap size for kafka process, and ~20k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}

The following is the gc chart on a broker using kafka 2.0 binary with commits 
up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}

The following is the gc chart on a broker using kafka 2.0 binary with commits 
up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-30 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:20 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}

The following is the gc chart on a broker using kafka 2.0 binary with commits 
up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-30 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:17 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

 
 The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 5:52 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

 
 The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364.  Did you see high cpu usage 
related issue in your case? 

 
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]


 
[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster during this period of time:

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500px!


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-28 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:28 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS03LTYtMzU=

Sometimes the cluster can be recovered by turning off the ssl writing traffic 
to the cluster, letting the broker to garbage collect the objects in the old 
gen, and resuming the ssl writing traffic.  Sometimes the cluster still  could 
not recover fully due to dramatic increase of heap size and high cpu usage when 
we turned on the ssl writing traffic again. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

Sometimes the cluster can be recovered by turning off the ssl writing traffic 
to the cluster, letting the broker to garbage collect the objects in the old 
gen, and resuming the ssl writing traffic.  Sometimes the cluster still  could 
not recover fully due to dramatic increase of heap size and high cpu usage when 
we turned on the ssl writing traffic again. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-28 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:26 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

Sometimes the cluster can be recovered by turning off the ssl writing traffic 
to the cluster, letting the broker to garbage collect the objects in the old 
gen, and resuming the ssl writing traffic.  Sometimes the cluster still  could 
not recover fully due to dramatic increase of heap size and high cpu usage when 
we turned on the ssl writing traffic again. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 5:43 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 1:01 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:10 PM:
--

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:08 PM:
--

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered after we turned off the ssl writing traffic to the 
cluster, let the broker to garbage collect the objects in the old gen, and 
resume the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-18 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584894#comment-16584894
 ] 

Ted Yu edited comment on KAFKA-7304 at 8/18/18 7:38 PM:


Through a bit more additional logging, here is what happens to testMuteOnOOM.
There are two channels added to closingChannels:
{code}
adding org.apache.kafka.common.network.KafkaChannel@334b860c for clientX
adding org.apache.kafka.common.network.KafkaChannel@334b860d for clientY
{code}
Later, when Selector.close() is called by tearDown, the channel for clientY is 
still in closingChannels :
{code}
There are 1 entries in closingChannels
org.apache.kafka.common.network.KafkaChannel@334b860d
{code}
My change above would close the channel left in closingChannels, preventing 
memory leak.


was (Author: yuzhih...@gmail.com):
Through a bit more additional logging, here is what happens to testMuteOnOOM.
There are two channels registered at the beginning of the test:
{code}
adding org.apache.kafka.common.network.KafkaChannel@334b860c for clientX
adding org.apache.kafka.common.network.KafkaChannel@334b860d for clientY
{code}
Later, when Selector.close() is called by tearDown, the channel for clientY is 
in closingChannels :
{code}
There are 1 entries in closingChannels
org.apache.kafka.common.network.KafkaChannel@334b860d
{code}
My change above would close the channel left in closingChannels, preventing 
memory leak.

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584156#comment-16584156
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/17/18 4:51 PM:
-

[~ijuma] We have an internal build that cherry-picks 1.1.1 changes.  I might 
miss some fixes. 

https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java
 shows that there were only two Selector.java related changes after 1.1.0 
release date that was March 23rd. 

 Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? 
kafka 1.1.0 has included that change. 


was (Author: yuyang08):
[~ijuma] We have an internal build that cherry-picks 1.1.1 changes.  I might 
miss some fixes. 

https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java
 shows that there were only two Selector.java related changes after 1.1.0 
release date that was March 23rd. 

 Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ?  We 
have included that change. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583570#comment-16583570
 ] 

Ted Yu edited comment on KAFKA-7304 at 8/17/18 12:18 PM:
-

Looking at the close() method, I don't see where the channels in 
closingChannels are closed (if the given id is found in channels Map).
{code}
diff --git 
a/clients/src/main/java/org/apache/kafka/common/network/Selector.java 
b/clients/src/main/java/org/apache/kafka/common/network/Selector.java
index 7e32509..2164a40 100644
--- a/clients/src/main/java/org/apache/kafka/common/network/Selector.java
+++ b/clients/src/main/java/org/apache/kafka/common/network/Selector.java
@@ -320,6 +320,10 @@ public class Selector implements Selectable, AutoCloseable 
{
 }
 sensors.close();
 channelBuilder.close();
+for (Map.Entry entry : 
this.closingChannels.entrySet()) {
+doClose(entry.getValue(), false);
+}
+this.closingChannels.clear();
 }

 /**
{code}
I wonder if the above change would fix the leakage.


was (Author: yuzhih...@gmail.com):
Looking at the close() method, I don't see where the channels in 
closingChannels are closed.
{code}
diff --git 
a/clients/src/main/java/org/apache/kafka/common/network/Selector.java 
b/clients/src/main/java/org/apache/kafka/common/network/Selector.java
index 7e32509..2164a40 100644
--- a/clients/src/main/java/org/apache/kafka/common/network/Selector.java
+++ b/clients/src/main/java/org/apache/kafka/common/network/Selector.java
@@ -320,6 +320,10 @@ public class Selector implements Selectable, AutoCloseable 
{
 }
 sensors.close();
 channelBuilder.close();
+for (Map.Entry entry : 
this.closingChannels.entrySet()) {
+doClose(entry.getValue(), false);
+}
+this.closingChannels.clear();
 }

 /**
{code}
I wonder if the above change would fix the leakage.

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583565#comment-16583565
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/17/18 8:01 AM:
-

[~yuzhih...@gmail.com]  There was no exception in server.log before the broker 
hit frequent full gc. There was various errors in the log after the broker ran 
into full gc. But i think that those exceptions are not relevant to the root 
cause. 


was (Author: yuyang08):
[~yuzhih...@gmail.com]  There was no exception in server.log before we hit 
frequent full gc. There was various errors in the log after the broker ran into 
full gc. But i think that those exceptions are not relevant to the root cause. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)