[jira] [Updated] (KAFKA-10731) have kafka producer & consumer auto-reload ssl certificate

2020-11-17 Thread Yu Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-10731:

Affects Version/s: 2.3.1

> have kafka producer & consumer  auto-reload ssl certificate 
> 
>
> Key: KAFKA-10731
> URL: https://issues.apache.org/jira/browse/KAFKA-10731
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 2.3.1
>Reporter: Yu Yang
>Priority: Major
>
> We use SSL in both brokers and kafka clients for authenticate and 
> authorization, and rotates the certificates every 12 hours.  Kafka producers 
> and consumer cannot pick up the rotated certs. This causes stream processing 
> interruption (e.g.  flink connector does not handle ssl exception, and the 
> flink applicatoin has to be restarted when we hit this error).  We need to 
> improve kafka producer & client to support ssl certificate dynamic loading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10731) have kafka producer & consumer auto-reload ssl certificate

2020-11-17 Thread Yu Yang (Jira)
Yu Yang created KAFKA-10731:
---

 Summary: have kafka producer & consumer  auto-reload ssl 
certificate 
 Key: KAFKA-10731
 URL: https://issues.apache.org/jira/browse/KAFKA-10731
 Project: Kafka
  Issue Type: Improvement
  Components: security
Reporter: Yu Yang


We use SSL in both brokers and kafka clients for authenticate and 
authorization, and rotates the certificates every 12 hours.  Kafka producers 
and consumer cannot pick up the rotated certs. This causes stream processing 
interruption (e.g.  flink connector does not handle ssl exception, and the 
flink applicatoin has to be restarted when we hit this error).  We need to 
improve kafka producer & client to support ssl certificate dynamic loading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10407) add linger.ms parameter support to KafkaLog4jAppender

2020-08-16 Thread Yu Yang (Jira)
Yu Yang created KAFKA-10407:
---

 Summary: add linger.ms parameter support to KafkaLog4jAppender
 Key: KAFKA-10407
 URL: https://issues.apache.org/jira/browse/KAFKA-10407
 Project: Kafka
  Issue Type: Improvement
  Components: logging
Reporter: Yu Yang


Currently  KafkaLog4jAppender does not accept `linger.ms` setting.   When a 
service has an outrage that cause excessively error logging,  the service can 
have too many producer requests to kafka brokers and overload the broker.  
Setting a non-zero 'linger.ms' will allow kafka producer to batch records and 
reduce # of producer request. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-08-07 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang resolved KAFKA-8716.

Resolution: Not A Problem

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-08-07 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902255#comment-16902255
 ] 

Yu Yang commented on KAFKA-8716:


Update: We verified that after upgrading zookeeper to 3.5.5,  nodes with kafka 
2.3 binary can re-join the cluster fine.  Thanks for looking into this issue! 

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894166#comment-16894166
 ] 

Yu Yang edited comment on KAFKA-8716 at 7/27/19 3:19 PM:
-

[~junrao]  Did not find much useful log from our zookeeper log. it seems that 
it is related to the zookeeper version that we uses. we are using zookeeper 
3.5.3 that is a beta version. will upgrade zookeeper to 3.5.5 that is stable 
release to see if that fixes the issue. 


was (Author: yuyang08):
[~junrao]  it seems that it is related to the zookeeper version that we uses. 
we are using zookeeper 3.5.3 that is a beta version. will upgrade zookeeper to 
3.5.5 that is stable release to see if that fixes the issue. 

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894166#comment-16894166
 ] 

Yu Yang commented on KAFKA-8716:


[~junrao]  it seems that it is related to the zookeeper version that we uses. 
we are using zookeeper 3.5.3 that is a beta version. will upgrade zookeeper to 
3.5.5 that is stable release to see if that fixes the issue. 

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894026#comment-16894026
 ] 

Yu Yang commented on KAFKA-8716:


The following is the log (with debug log) around the exception: 
{code}
[2019-07-26 17:45:44,476] INFO Creating /brokers/ids/85 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2019-07-26 17:45:44,484] DEBUG Reading reply sessionid:0x7593f202705, 
packet:: clientPath:null serverPath:null finished:false header:: 91,14  
replyHeader:: 91,234840046463,0  request:: 
org.apache.zookeeper.MultiTransactionRecord@3cd2650b response:: 
org.apache.zookeeper.MultiResponse@f554 (org.apache.zookeeper.ClientCnxn)
[2019-07-26 17:45:44,486] ERROR Error while creating ephemeral at 
/brokers/ids/85 with return code: SESSIONEXPIRED 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-26 17:45:44,491] ERROR [KafkaServer id=85] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
at 
kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1727)
{code}

The following is debug log from ZooKeeperClientWatcher:
{code}
[2019-07-26 17:45:43,296] DEBUG [ZooKeeperClient Kafka server] Received event: 
WatchedEvent state:SyncConnected type:None path:null 
(kafka.zookeeper.ZooKeeperClient)
[2019-07-26 17:45:43,449] DEBUG [ZooKeeperClient Kafka server] Received event: 
WatchedEvent state:Closed type:None path:null (kafka.zookeeper.ZooKeeperClient)
[2019-07-26 17:45:43,489] DEBUG [ZooKeeperClient Kafka server] Received event: 
WatchedEvent state:SyncConnected type:None path:null 
(kafka.zookeeper.ZooKeeperClient)
[2019-07-26 17:45:44,901] DEBUG [ZooKeeperClient Kafka server] Received event: 
WatchedEvent state:Closed type:None path:null (kafka.zookeeper.ZooKeeperClient)
{code}

The following is the log for the zookeeper session:
{code}
[2019-07-26 17:45:43,489] INFO Session establishment complete on server 
datazk007/10.1.16.191:2181, sessionid = 0x7593f202705, negotiated timeout = 
6000 (org.apache.zookeeper.ClientCnxn)
[2019-07-26 17:45:43,492] DEBUG Reading reply sessionid:0x7593f202705, 
packet:: clientPath:/consumers serverPath:/testkafka/consumers finished:false 
header:: 1,1  replyHeader:: 1,234840045921,-110  request:: 
'/testkafka/consumers,,v{s{31,s{'world,'anyone}}},0  response::   
(org.apache.zookeeper.ClientCnxn)
...

[2019-07-26 17:45:44,484] DEBUG Reading reply sessionid:0x7593f202705, 
packet:: clientPath:null serverPath:null finished:false header:: 91,14  
replyHeader:: 91,234840046463,0  request:: 
org.apache.zookeeper.MultiTransactionRecord@3cd2650b response:: 
org.apache.zookeeper.MultiResponse@f554 (org.apache.zookeeper.ClientCnxn)
[2019-07-26 17:45:44,800] DEBUG Closing session: 0x7593f202705 
(org.apache.zookeeper.ZooKeeper)
[2019-07-26 17:45:44,800] DEBUG Closing client for session: 0x7593f202705 
(org.apache.zookeeper.ClientCnxn)
...
[2019-07-26 17:45:44,800] DEBUG Reading reply sessionid:0x7593f202705, 
packet:: clientPath:null serverPath:null finished:false header:: 92,-11  
replyHeader:: 92,234840046569,0  request:: null response:: null 
(org.apache.zookeeper.ClientCnxn)
[2019-07-26 17:45:44,800] DEBUG Disconnecting client for session: 
0x7593f202705 (org.apache.zookeeper.ClientCnxn)
[2019-07-26 17:45:44,800] DEBUG An exception was thrown while closing send 
thread for session 0x7593f202705 : Unable to read additional data from 
server sessionid 0x7593f202705, likely server has closed socket 
(org.apache.zookeeper.ClientCnxn)
[2019-07-26 17:45:44,901] INFO Session: 0x7593f202705 closed 
(org.apache.zookeeper.ZooKeeper)
[2019-07-26 17:45:44,901] INFO EventThread shut down for session: 
0x7593f202705 (org.apache.zookeeper.ClientCnxn)
{code}




> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using 

[jira] [Comment Edited] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893956#comment-16893956
 ] 

Yu Yang edited comment on KAFKA-8716 at 7/26/19 4:42 PM:
-

Thank for checking [~junrao].  Added more information in the description 
section. 

The  "SESSIONExpiration" exception happened immediately after the 
"CheckedEphemeral.create" call, and happened repeatedly so that the broker 
could not get started properly.


was (Author: yuyang08):
Thank for checking [~junrao].  Added more information in the description 
session. 

The  "SESSIONExpiration" exception happened immediately after the 
"CheckedEphemeral.create" call, and happened repeatedly so that the broker 
could not get started properly.

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893956#comment-16893956
 ] 

Yu Yang edited comment on KAFKA-8716 at 7/26/19 4:22 PM:
-

Thank for checking [~junrao].  Added more information in the description 
session. 

The  "SESSIONExpiration" exception happened immediately after the 
"CheckedEphemeral.create" call, and happened repeatedly so that the broker 
could not get started properly.


was (Author: yuyang08):
Thank for checking [~junrao].  Added more information in the description 
session. 

The  "SESSIONExpiration" exception happened immediately after the 
"CheckedEphemeral.create" call, and happened repeatedly so that the broker 
could not get started. 

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Summary: broker cannot join the cluster after upgrading kafka binary from 
2.1.1 to 2.2.1 or 2.3.0  (was: broker cannot join the cluster after upgrading 
kafka binary from 2.1.0 to 2.2.1 or 2.3.0)

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.0 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Summary: broker cannot join the cluster after upgrading kafka binary from 
2.1.0 to 2.2.1 or 2.3.0  (was: broker cannot join the cluster after upgrading 
kafka binary from 2.1.1 to 2.2.1 or 2.3.0)

> broker cannot join the cluster after upgrading kafka binary from 2.1.0 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Component/s: zkclient

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Description: 
We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started 
due to zookeeper session expiration exception.   This error happens repeatedly 
and the broker could not start because of this. 

Below is our zk related setting in server.properties:
{code}
zookeeper.connection.timeout.ms=6000
zookeeper.session.timeout.ms=6000
{code}

The following is the stack trace, and we are using zookeeper 3.5.3. Instead of 
waiting for a few seconds, the SESSIONEXPIRED error returned immediately in 
CheckedEphemeral.create call.  Any insights? 

[2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
/brokers/ids/80 with return code: SESSIONEXPIRED 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
at kafka.Kafka$.main(Kafka.scala:75)
at kafka.Kafka.main(Kafka.scala)



  was:
We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started 
due to zookeeper session expiration exception.  

Below is our zk related setting in server.properties:
{code}
zookeeper.connection.timeout.ms=6000
zookeeper.session.timeout.ms=6000
{code}

The following is the stack trace, and we are using zookeeper 3.5.3. Instead of 
waiting for a few seconds, the SESSIONEXPIRED error returned immediately in 
CheckedEphemeral.create call.  Any insights? 

[2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
/brokers/ids/80 with return code: SESSIONEXPIRED 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
at kafka.Kafka$.main(Kafka.scala:75)
at kafka.Kafka.main(Kafka.scala)




> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.   This error happens 
> repeatedly and the broker could not start because of this. 
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session 

[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893956#comment-16893956
 ] 

Yu Yang commented on KAFKA-8716:


Thank for checking [~junrao].  Added more information in the description 
session. 

The  "SESSIONExpiration" exception happened immediately after the 
"CheckedEphemeral.create" call, and happened repeatedly so that the broker 
could not get started. 

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.  
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-26 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Description: 
We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started 
due to zookeeper session expiration exception.  

Below is our zk related setting in server.properties:
{code}
zookeeper.connection.timeout.ms=6000
zookeeper.session.timeout.ms=6000
{code}

The following is the stack trace, and we are using zookeeper 3.5.3. Instead of 
waiting for a few seconds, the SESSIONEXPIRED error returned immediately in 
CheckedEphemeral.create call.  Any insights? 

[2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
/brokers/ids/80 with return code: SESSIONEXPIRED 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
at kafka.Kafka$.main(Kafka.scala:75)
at kafka.Kafka.main(Kafka.scala)



  was:
We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started 
due to zookeeper session expiration exception.  

The following is the stack trace, and we are using zookeeper 3.5.3. Any 
insights? 

[2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
/brokers/ids/80 with return code: SESSIONEXPIRED 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
at kafka.Kafka$.main(Kafka.scala:75)
at kafka.Kafka.main(Kafka.scala)




> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.  
> Below is our zk related setting in server.properties:
> {code}
> zookeeper.connection.timeout.ms=6000
> zookeeper.session.timeout.ms=6000
> {code}
> The following is the stack trace, and we are using zookeeper 3.5.3. Instead 
> of waiting for a few seconds, the SESSIONEXPIRED error returned immediately 
> in CheckedEphemeral.create call.  Any insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at 

[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-25 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Priority: Critical  (was: Major)

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Critical
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.  
> The following is the stack trace, and we are using zookeeper 3.5.3. Any 
> insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0

2019-07-25 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8716:
---
Summary: broker cannot join the cluster after upgrading kafka binary from 
2.1.1 to 2.2.1 or 2.3.0  (was: broker cannot join the cluster after upgrading 
the binary from 2.1 to 2.2.1 or 2.3.0)

> broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 
> 2.2.1 or 2.3.0
> 
>
> Key: KAFKA-8716
> URL: https://issues.apache.org/jira/browse/KAFKA-8716
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Yu Yang
>Priority: Major
>
> We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
> versions, the broker with updated binary (2.2.1 or 2.3.0) could not get 
> started due to zookeeper session expiration exception.  
> The following is the stack trace, and we are using zookeeper 3.5.3. Any 
> insights? 
> [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
> (kafka.zk.KafkaZkClient)
> [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
> /brokers/ids/80 with return code: SESSIONEXPIRED 
> (kafka.zk.KafkaZkClient$CheckedEphemeral)
> [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
> at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
> at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
> at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
> at kafka.Kafka$.main(Kafka.scala:75)
> at kafka.Kafka.main(Kafka.scala)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (KAFKA-8716) broker cannot join the cluster after upgrading the binary from 2.1 to 2.2.1 or 2.3.0

2019-07-25 Thread Yu Yang (JIRA)
Yu Yang created KAFKA-8716:
--

 Summary: broker cannot join the cluster after upgrading the binary 
from 2.1 to 2.2.1 or 2.3.0
 Key: KAFKA-8716
 URL: https://issues.apache.org/jira/browse/KAFKA-8716
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.2.1, 2.3.0
Reporter: Yu Yang


We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both 
versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started 
due to zookeeper session expiration exception.  

The following is the stack trace, and we are using zookeeper 3.5.3. Any 
insights? 

[2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at 
/brokers/ids/80 with return code: SESSIONEXPIRED 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
at kafka.Kafka$.main(Kafka.scala:75)
at kafka.Kafka.main(Kafka.scala)





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (KAFKA-8300) kafka broker did not recovery from quota limit after quota setting is removed

2019-04-26 Thread Yu Yang (JIRA)
Yu Yang created KAFKA-8300:
--

 Summary: kafka broker did not recovery from quota limit after 
quota setting is removed
 Key: KAFKA-8300
 URL: https://issues.apache.org/jira/browse/KAFKA-8300
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.1.0
 Environment: Description:  Ubuntu 14.04.5 LTS
Release:14.04
Reporter: Yu Yang
 Attachments: Screen Shot 2019-04-26 at 4.02.03 PM.png

We applied quota management to one of our clusters. After applying quota, we 
saw the following errors in kafka server log. And the broker's network traffic 
did not recover, even after we removed the quota settings. Any insights on 
this? 

{code}
3097 [2019-04-26 20:59:42,359] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.1.239.72:9093-10.3.57.190:59846-4925637 (kafka.network.Proces sor)
3098 [2019-04-26 20:59:43,518] WARN Attempting to send response via channel for
which there is no open connection, connection id 
10.1.239.72:9093-10.3.230.92:49788-4925646 (kafka.network.Proces sor)
3099 [2019-04-26 20:59:44,343] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.1.239.72:9093-10.3.32.233:35714-4925663 (kafka.network.Proces sor)
3100 [2019-04-26 20:59:45,448] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.1.239.72:9093-10.3.55.250:52884-4925658 (kafka.network.Proces sor)
3101 [2019-04-26 20:59:45,544] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.1.239.72:9093-10.3.55.24:41608-4925687 (kafka.network.Process or)
{code}
 

!Screen Shot 2019-04-26 at 4.02.03 PM.png|width=640px!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8300) kafka broker did not recover from quota limit after quota setting is removed

2019-04-26 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-8300:
---
Summary: kafka broker did not recover from quota limit after quota setting 
is removed  (was: kafka broker did not recovery from quota limit after quota 
setting is removed)

> kafka broker did not recover from quota limit after quota setting is removed
> 
>
> Key: KAFKA-8300
> URL: https://issues.apache.org/jira/browse/KAFKA-8300
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
> Environment: Description: Ubuntu 14.04.5 LTS
> Release:  14.04
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2019-04-26 at 4.02.03 PM.png
>
>
> We applied quota management to one of our clusters. After applying quota, we 
> saw the following errors in kafka server log. And the broker's network 
> traffic did not recover, even after we removed the quota settings. Any 
> insights on this? 
> {code}
> 3097 [2019-04-26 20:59:42,359] WARN Attempting to send response via channel 
> for 
> which there is no open connection, connection id 
> 10.1.239.72:9093-10.3.57.190:59846-4925637 (kafka.network.Proces sor)
> 3098 [2019-04-26 20:59:43,518] WARN Attempting to send response via channel 
> for
> which there is no open connection, connection id 
> 10.1.239.72:9093-10.3.230.92:49788-4925646 (kafka.network.Proces sor)
> 3099 [2019-04-26 20:59:44,343] WARN Attempting to send response via channel 
> for 
> which there is no open connection, connection id 
> 10.1.239.72:9093-10.3.32.233:35714-4925663 (kafka.network.Proces sor)
> 3100 [2019-04-26 20:59:45,448] WARN Attempting to send response via channel 
> for 
> which there is no open connection, connection id 
> 10.1.239.72:9093-10.3.55.250:52884-4925658 (kafka.network.Proces sor)
> 3101 [2019-04-26 20:59:45,544] WARN Attempting to send response via channel 
> for 
> which there is no open connection, connection id 
> 10.1.239.72:9093-10.3.55.24:41608-4925687 (kafka.network.Process or)
> {code}
>  
> !Screen Shot 2019-04-26 at 4.02.03 PM.png|width=640px!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2019-02-12 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Priority: Major  (was: Critical)

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Fix For: 1.1.2, 2.2.0, 2.0.2
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7450) "Handshake message sequence violation" related ssl handshake failure leads to high cpu usage

2019-02-07 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7450:
---
Description: 
After updating security.inter.broker.protocol to SSL for our cluster, we 
observed that the controller can get into almost 100% cpu usage from time to 
time. 
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}
There is no obvious error in server.log. But in controller.log, there is 
repetitive SSL handshare failure error as below:
{code:java}
[2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] Controller 
6042's connection to broker datakafka06176.ec2.pin220.com:9093 (id: 6176 rack: 
null) was unsuccessful (kafka.controller.RequestSendThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
at 
sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
at 
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at 
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
at 
kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at 
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
at 
org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
... 10 more

{code}
{code:java}
[2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, 
maxBytes=4194304), the_test_topic-58=(offset=462312762, 
logStartOffset=462295078, maxBytes=4194304)}, isolationLevel=READ_UNCOMMITTED, 
toForget=, metadata=(sessionId=1991153671, epoch=INITIAL)) 
(kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538)
at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
at 
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at 
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:91)
at 

[jira] [Commented] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-05 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711009#comment-16711009
 ] 

Yu Yang commented on KAFKA-7704:


[~huxi_2b], [~junrao] I verified that  
https://github.com/apache/kafka/pull/5998 does fix the maxlag metric issue. 
Thanks for the quick fix!

> kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported 
> incorrectly
> ---
>
> Key: KAFKA-7704
> URL: https://issues.apache.org/jira/browse/KAFKA-7704
> Project: Kafka
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.1.0
>Reporter: Yu Yang
>Assignee: huxihx
>Priority: Major
> Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png
>
>
> We recently deployed kafka 2.1, and noticed a jump in 
> kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
> there is no under-replicated partitions for the cluster. 
> The initial analysis shows that kafka 2.1.0 does not report metric correctly 
> for topics that have no incoming traffic right now, but had traffic earlier. 
> For those topics, ReplicaFetcherManager will consider the maxLag be the 
> latest offset. 
> For instance, we have a topic named `test_topic`: 
> {code}
> [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
> total 8
> -rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
> -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
> -rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
> -rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
> -rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
> {code}
> kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579
>  !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-05 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7704:
---
Attachment: Screen Shot 2018-12-05 at 10.13.09 PM.png

> kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported 
> incorrectly
> ---
>
> Key: KAFKA-7704
> URL: https://issues.apache.org/jira/browse/KAFKA-7704
> Project: Kafka
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.1.0
>Reporter: Yu Yang
>Assignee: huxihx
>Priority: Major
> Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png, Screen Shot 
> 2018-12-05 at 10.13.09 PM.png
>
>
> We recently deployed kafka 2.1, and noticed a jump in 
> kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
> there is no under-replicated partitions for the cluster. 
> The initial analysis shows that kafka 2.1.0 does not report metric correctly 
> for topics that have no incoming traffic right now, but had traffic earlier. 
> For those topics, ReplicaFetcherManager will consider the maxLag be the 
> latest offset. 
> For instance, we have a topic named `test_topic`: 
> {code}
> [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
> total 8
> -rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
> -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
> -rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
> -rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
> -rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
> {code}
> kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579
>  !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-03 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7704:
---
Description: 
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic named `test_topic`: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



  was:
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic *test_topic*: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 




> kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported 
> incorrectly
> ---
>
> Key: KAFKA-7704
> URL: https://issues.apache.org/jira/browse/KAFKA-7704
> Project: Kafka
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png
>
>
> We recently deployed kafka 2.1, and noticed a jump in 
> kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
> there is no under-replicated partitions. 
> The initial analysis showed that kafka 2.1.0 does not report metric correctly 
> for topics that have no incoming traffic right now, but had traffic earlier. 
> For those topics, ReplicaFetcherManager will consider the maxLag be the 
> latest offset. 
> For instance, we have a topic named `test_topic`: 
> {code}
> [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
> total 8
> -rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
> -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
> -rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
> -rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
> -rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
> {code}
> kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579
>  !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-03 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7704:
---
Description: 
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions for the cluster. 

The initial analysis shows that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic named `test_topic`: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



  was:
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions for the cluster. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic named `test_topic`: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 




> kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported 
> incorrectly
> ---
>
> Key: KAFKA-7704
> URL: https://issues.apache.org/jira/browse/KAFKA-7704
> Project: Kafka
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png
>
>
> We recently deployed kafka 2.1, and noticed a jump in 
> kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
> there is no under-replicated partitions for the cluster. 
> The initial analysis shows that kafka 2.1.0 does not report metric correctly 
> for topics that have no incoming traffic right now, but had traffic earlier. 
> For those topics, ReplicaFetcherManager will consider the maxLag be the 
> latest offset. 
> For instance, we have a topic named `test_topic`: 
> {code}
> [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
> total 8
> -rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
> -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
> -rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
> -rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
> -rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
> {code}
> kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579
>  !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-03 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7704:
---
Description: 
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions for the cluster. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic named `test_topic`: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



  was:
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic named `test_topic`: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 




> kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported 
> incorrectly
> ---
>
> Key: KAFKA-7704
> URL: https://issues.apache.org/jira/browse/KAFKA-7704
> Project: Kafka
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png
>
>
> We recently deployed kafka 2.1, and noticed a jump in 
> kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
> there is no under-replicated partitions for the cluster. 
> The initial analysis showed that kafka 2.1.0 does not report metric correctly 
> for topics that have no incoming traffic right now, but had traffic earlier. 
> For those topics, ReplicaFetcherManager will consider the maxLag be the 
> latest offset. 
> For instance, we have a topic named `test_topic`: 
> {code}
> [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
> total 8
> -rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
> -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
> -rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
> -rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
> -rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
> {code}
> kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579
>  !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-03 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7704:
---
Description: 
We recently deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic *test_topic*: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



  was:
We deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic *test_topic*: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 




> kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported 
> incorrectly
> ---
>
> Key: KAFKA-7704
> URL: https://issues.apache.org/jira/browse/KAFKA-7704
> Project: Kafka
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png
>
>
> We recently deployed kafka 2.1, and noticed a jump in 
> kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
> there is no under-replicated partitions. 
> The initial analysis showed that kafka 2.1.0 does not report metric correctly 
> for topics that have no incoming traffic right now, but had traffic earlier. 
> For those topics, ReplicaFetcherManager will consider the maxLag be the 
> latest offset. 
> For instance, we have a topic *test_topic*: 
> {code}
> [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
> total 8
> -rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
> -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
> -rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
> -rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
> -rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
> {code}
> kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579
>  !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly

2018-12-03 Thread Yu Yang (JIRA)
Yu Yang created KAFKA-7704:
--

 Summary: kafka.server.ReplicaFetechManager.MaxLag.Replica metric 
is reported incorrectly
 Key: KAFKA-7704
 URL: https://issues.apache.org/jira/browse/KAFKA-7704
 Project: Kafka
  Issue Type: Bug
  Components: metrics
Affects Versions: 2.1.0
Reporter: Yu Yang
 Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png

We deployed kafka 2.1, and noticed a jump in 
kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, 
there is no under-replicated partitions. 

The initial analysis showed that kafka 2.1.0 does not report metric correctly 
for topics that have no incoming traffic right now, but had traffic earlier. 
For those topics, ReplicaFetcherManager will consider the maxLag be the latest 
offset. 

For instance, we have a topic *test_topic*: 

{code}
[root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l
total 8
-rw-rw-r-- 1 kafka kafka 10485760 Dec  4 00:13 099043947579.index
-rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log
-rw-rw-r-- 1 kafka kafka   10 Dec  4 00:13 099043947579.snapshot
-rw-rw-r-- 1 kafka kafka 10485756 Dec  4 00:13 099043947579.timeindex
-rw-rw-r-- 1 kafka kafka4 Dec  4 00:13 leader-epoch-checkpoint
{code}

kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579

 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! 





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-30 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:49 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap size for kafka process, and ~20k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}

The following is the gc chart on a broker using kafka 2.0 binary with commits 
up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}

The following is the gc chart on a broker using kafka 2.0 binary with commits 
up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap 

[jira] [Updated] (KAFKA-7450) "Handshake message sequence violation" related ssl handshake failure leads to high cpu usage

2018-09-30 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7450:
---
Summary: "Handshake message sequence violation" related ssl handshake 
failure leads to high cpu usage  (was: kafka "Handshake message sequence 
violation" leads to high cpu usage)

> "Handshake message sequence violation" related ssl handshake failure leads to 
> high cpu usage
> 
>
> Key: KAFKA-7450
> URL: https://issues.apache.org/jira/browse/KAFKA-7450
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
>
> After updating security.inter.broker.protocol to SSL for our cluster, we 
> observed that the controller can get into almost 100% cpu usage. 
> {code}
> listeners=PLAINTEXT://:9092,SSL://:9093
> security.inter.broker.protocol=SSL
> {code}
> There is no obvious error in server.log. But in controller.log, there is 
> repetitive SSL handshare failure error as below: 
> {code}
> [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] 
> Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 
> (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
> at 
> sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at 
> sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
> at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
> at 
> org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
> at 
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
> at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
> at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
> at 
> org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
> ... 10 more
> {code}
> {code}
> [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, 
> maxBytes=4194304), the_test_topic-58=(offset=462312762, 
> logStartOffset=462295078, maxBytes=4194304)}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538)
> at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-30 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:20 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}

The following is the gc chart on a broker using kafka 2.0 binary with commits 
up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-30 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:17 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

The following is our ssl related kafka setting:
{code:java}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=PLAINTEXT
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.endpoint.identification.algorithm=HTTPS
ssl.key.password=key_password
ssl.keystore.location=keystore_location
ssl.keystore.password=keystore_password
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=truststore_path
ssl.truststore.password=truststore_password
ssl.truststore.type=JKS
 {code}
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

There is another issue that we saw with the following setting. See KAFKA-7450 
for details. 
{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

 
 The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running 

[jira] [Updated] (KAFKA-7450) kafka "Handshake message sequence violation" leads to high cpu usage

2018-09-30 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7450:
---
Summary: kafka "Handshake message sequence violation" leads to high cpu 
usage  (was: kafka "Handshake message sequence violation" failure )

> kafka "Handshake message sequence violation" leads to high cpu usage
> 
>
> Key: KAFKA-7450
> URL: https://issues.apache.org/jira/browse/KAFKA-7450
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
>
> After updating security.inter.broker.protocol to SSL for our cluster, we 
> observed that the controller can get into almost 100% cpu usage. 
> {code}
> listeners=PLAINTEXT://:9092,SSL://:9093
> security.inter.broker.protocol=SSL
> {code}
> There is no obvious error in server.log. But in controller.log, there is 
> repetitive SSL handshare failure error as below: 
> {code}
> [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] 
> Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 
> (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
> at 
> sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at 
> sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
> at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
> at 
> org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
> at 
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
> at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
> at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
> at 
> org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
> ... 10 more
> {code}
> {code}
> [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, 
> maxBytes=4194304), the_test_topic-58=(offset=462312762, 
> logStartOffset=462295078, maxBytes=4194304)}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538)
> at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
> at 
> 

[jira] [Updated] (KAFKA-7450) kafka "Handshake message sequence violation" failure

2018-09-30 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7450:
---
Summary: kafka "Handshake message sequence violation" failure   (was: kafka 
 RequestSendThread stuck in infinite loop after SSL handshake failure with peer 
brokers)

> kafka "Handshake message sequence violation" failure 
> -
>
> Key: KAFKA-7450
> URL: https://issues.apache.org/jira/browse/KAFKA-7450
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
>
> After updating security.inter.broker.protocol to SSL for our cluster, we 
> observed that the controller can get into almost 100% cpu usage. 
> {code}
> listeners=PLAINTEXT://:9092,SSL://:9093
> security.inter.broker.protocol=SSL
> {code}
> There is no obvious error in server.log. But in controller.log, there is 
> repetitive SSL handshare failure error as below: 
> {code}
> [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] 
> Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 
> (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
> at 
> sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at 
> sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
> at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
> at 
> org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
> at 
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
> at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
> at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
> at 
> org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
> ... 10 more
> {code}
> {code}
> [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, 
> maxBytes=4194304), the_test_topic-58=(offset=462312762, 
> logStartOffset=462295078, maxBytes=4194304)}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538)
> at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
> at 
> 

[jira] [Updated] (KAFKA-7450) kafka RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers

2018-09-30 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7450:
---
Summary: kafka  RequestSendThread stuck in infinite loop after SSL 
handshake failure with peer brokers  (was: kafka controller RequestSendThread 
stuck in infinite loop after SSL handshake failure with peer brokers)

> kafka  RequestSendThread stuck in infinite loop after SSL handshake failure 
> with peer brokers
> -
>
> Key: KAFKA-7450
> URL: https://issues.apache.org/jira/browse/KAFKA-7450
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
>
> After updating security.inter.broker.protocol to SSL for our cluster, we 
> observed that the controller can get into almost 100% cpu usage. 
> {code}
> listeners=PLAINTEXT://:9092,SSL://:9093
> security.inter.broker.protocol=SSL
> {code}
> There is no obvious error in server.log. But in controller.log, there is 
> repetitive SSL handshare failure error as below: 
> {code}
> [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] 
> Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 
> (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
> at 
> sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at 
> sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
> at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
> at 
> org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
> at 
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
> at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
> at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
> at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
> at 
> org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
> at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
> ... 10 more
> {code}
> {code}
> [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, 
> maxBytes=4194304), the_test_topic-58=(offset=462312762, 
> logStartOffset=462295078, maxBytes=4194304)}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
> violation, 2
> at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538)
> at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
> at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at 
> 

[jira] [Updated] (KAFKA-7450) kafka controller RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers

2018-09-30 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7450:
---
Description: 
After updating security.inter.broker.protocol to SSL for our cluster, we 
observed that the controller can get into almost 100% cpu usage. 

{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}

There is no obvious error in server.log. But in controller.log, there is 
repetitive SSL handshare failure error as below: 

{code}
[2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] Controller 
6042's connection to broker datakafka06176.ec2.pin220.com:9093 (id: 6176 rack: 
null) was unsuccessful (kafka.controller.RequestSendThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
at 
sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
at 
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at 
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
at 
kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at 
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
at 
org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
... 10 more

{code}

{code}
[2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, 
maxBytes=4194304), the_test_topic-58=(offset=462312762, 
logStartOffset=462295078, maxBytes=4194304)}, isolationLevel=READ_UNCOMMITTED, 
toForget=, metadata=(sessionId=1991153671, epoch=INITIAL)) 
(kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538)
at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
at 
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at 
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:91)
at 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang edited comment on KAFKA-7304 at 9/30/18 5:52 AM:
-

[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364. Did you see high cpu usage 
related issue in your case?

 
 The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]

[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster. The cpu usage jumped to 
almost 100% after enabling TLS-based writing to the cluster. 

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500!


was (Author: yuyang08):
[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364.  Did you see high cpu usage 
related issue in your case? 

 
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]


 
[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster during this period of time:

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500px!


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> 

[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251
 ] 

Yu Yang commented on KAFKA-7304:


[~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 
16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, 
we did not see frequent gc. But at the same time, we still hit the high cpu 
usage issue that is documented in KAFKA-7364.  Did you see high cpu usage 
related issue in your case? 

 
The following is the gc chat on a broker with kafka 2.0 changes up to 
[https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4]


 
[http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3]

!Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500!

The following is the cpu usage chart of our cluster during this period of time:

!Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500px!


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-09-29 at 10.38.38 PM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 
> PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-09-29 at 10.38.12 PM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, 
> Screen Shot 2018-09-29 at 8.34.50 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-09-29 at 8.34.50 PM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 8.34.50 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-29 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633137#comment-16633137
 ] 

Yu Yang commented on KAFKA-7304:


Thanks [~rsivaram]!  We will try these fixes and let you know the result. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7450) kafka controller RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers

2018-09-28 Thread Yu Yang (JIRA)
Yu Yang created KAFKA-7450:
--

 Summary: kafka controller RequestSendThread stuck in infinite loop 
after SSL handshake failure with peer brokers
 Key: KAFKA-7450
 URL: https://issues.apache.org/jira/browse/KAFKA-7450
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 2.0.0
Reporter: Yu Yang


After updating security.inter.broker.protocol to SSL for our cluster, we 
observed that the controller can get into almost 100% cpu usage. 

{code}
listeners=PLAINTEXT://:9092,SSL://:9093
security.inter.broker.protocol=SSL
{code}

There is no obvious error in server.log. But in controller.log, there is 
repetitive SSL handshare failure error as below: 

{code}
[2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] Controller 
6042's connection to broker datakafka06176.ec2.pin220.com:9093 (id: 6176 rack: 
null) was unsuccessful (kafka.controller.RequestSendThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487)
at 
sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
at 
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at 
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
at 
kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence 
violation, 2
at 
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
at 
org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
... 10 more

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with high concurrent ssl writing

2018-09-10 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Summary: kafka periodically run into high cpu usage with high concurrent 
ssl writing  (was: kafka periodically run into high cpu usage with high 
concurent ssl writing)

> kafka periodically run into high cpu usage with high concurrent ssl writing
> ---
>
> Key: KAFKA-7364
> URL: https://issues.apache.org/jira/browse/KAFKA-7364
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png
>
>
> while testing ssl writing to kafka, we found that kafka often run into high 
> cpu usage due to inefficiency in jdk ssl implementation. 
> In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 
> 2.0.0,  jdk-10.0.2,  and hosts only one topic that have ~20k producers write 
> to through ssl channel. We observed that  the network threads often get 100% 
> cpu usage after enabling ssl writing to kafka.   To improve kafka's 
> throughput, we have "num.network.threads=32" for the broker.  Even with 32 
> network threads, we see the broker cpu usage jump right after ssl writing is 
> enabled.  The broker's cpu usage would drop immediately when we disabled ssl 
> writing.  
>  !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 
> When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
> executing code in  libsunec.so.  The following is a sample stack track that 
> we get when the broker's cpu usage was high. This seems to be related to 
> inefficiency in jdk ssl related implementation.  Switching to use 
> https://github.com/netty/netty-tcnative to handle ssl handshake can be 
> helpful. 
> {code}
> Thread 77562: (state = IN_NATIVE)
>  - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
> byte[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
> (Compiled frame)
>  - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
> (Compiled frame)
>  - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
>  - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
> java.lang.String) @bci=136, line=444 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
>  @bci=48, line=166 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
>  java.util.Collection) @bci=24, line=147 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
>  java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
>  sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
> (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
>  @bci=217, line=141 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
>  java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
>  - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
> java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
>  java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=232, line=259 (Compiled frame)
>  - 
> sun.security.validator.Validator.validate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=6, line=260 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
>  java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
> java.lang.String) @bci=10, line=324 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
> (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
>  - 
> 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with high concurent ssl writing

2018-09-10 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Summary: kafka periodically run into high cpu usage with high concurent ssl 
writing  (was: kafka periodically run into high cpu usage with ssl writing)

> kafka periodically run into high cpu usage with high concurent ssl writing
> --
>
> Key: KAFKA-7364
> URL: https://issues.apache.org/jira/browse/KAFKA-7364
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png
>
>
> while testing ssl writing to kafka, we found that kafka often run into high 
> cpu usage due to inefficiency in jdk ssl implementation. 
> In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 
> 2.0.0,  jdk-10.0.2,  and hosts only one topic that have ~20k producers write 
> to through ssl channel. We observed that  the network threads often get 100% 
> cpu usage after enabling ssl writing to kafka.   To improve kafka's 
> throughput, we have "num.network.threads=32" for the broker.  Even with 32 
> network threads, we see the broker cpu usage jump right after ssl writing is 
> enabled.  The broker's cpu usage would drop immediately when we disabled ssl 
> writing.  
>  !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 
> When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
> executing code in  libsunec.so.  The following is a sample stack track that 
> we get when the broker's cpu usage was high. This seems to be related to 
> inefficiency in jdk ssl related implementation.  Switching to use 
> https://github.com/netty/netty-tcnative to handle ssl handshake can be 
> helpful. 
> {code}
> Thread 77562: (state = IN_NATIVE)
>  - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
> byte[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
> (Compiled frame)
>  - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
> (Compiled frame)
>  - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
>  - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
> java.lang.String) @bci=136, line=444 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
>  @bci=48, line=166 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
>  java.util.Collection) @bci=24, line=147 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
>  java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
>  sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
> (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
>  @bci=217, line=141 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
>  java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
>  - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
> java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
>  java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=232, line=259 (Compiled frame)
>  - 
> sun.security.validator.Validator.validate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=6, line=260 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
>  java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
> java.lang.String) @bci=10, line=324 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
> (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
>  - 
> 

[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-09-03 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602415#comment-16602415
 ] 

Yu Yang commented on KAFKA-7304:


[~rsivaram]  Thanks for looking into the issue!   We are still evaluating 
whether Ted's patch makes a difference. I am testing ssl writing at a smaller 
scale now. gceasy reports that some brokers run with jdk 10.0.2 + kafka 2.0 
with Ted's patch has memory leakage: 
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMy8tLWdjLmxvZy5nei0tMTktMS01.
  Meanwhile  brokers running with jdk 1.8u172 seems fine.  
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMy8tLWdjLmxvZy4xLmN1cnJlbnQuZ3otLTE5LTktMjc=
 . 

We used the default value (10 minutes) for `connections.max.idle.ms`.  I also 
tried to set `connections.max.idle.ms` to 1 minute and 30 seconds. Setting a 
shorter connections.max.idle.ms did not help. 

When we did experiments with broker restart, all brokers that were not 
restarted were up for longer than `connections.max.idle.ms`.  The heap memory 
usages for those brokers was not drop. 

The failed authentications should not be expected. It is not clear to me how 
that happened. 
 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Description: 
while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 
2.0.0,  jdk-10.0.2,  and hosts only one topic that have ~20k producers write to 
through ssl channel. We observed that  the network threads often get 100% cpu 
usage after enabling ssl writing to kafka.   To improve kafka's throughput, we 
have "num.network.threads=32" for the broker.  Even with 32 network threads, we 
see the broker cpu usage jump right after ssl writing is enabled.  The broker's 
cpu usage would drop immediately when we disabled ssl writing.  

 !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. This seems to be related to 
inefficiency in jdk ssl related implementation.  Switching to use 
https://github.com/netty/netty-tcnative to handle ssl handshake can be helpful. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Description: 
while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 
2.0.0,  jdk-10.0.2,  and hosts only one topic that have ~20k producers write to 
through ssl channel. We observed that  the network threads often get 100% cpu 
usage after enabling ssl writing to kafka.   To improve kafka's throughput, we 
have "num.network.threads=32" for the broker.  Even with 32 network threads, we 
see the broker cpu usage jump right after ssl writing is enabled.  The broker's 
cpu usage would drop immediately when we disabled ssl writing.  

 !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. This seems to be related to 
inefficiency in jdk ssl related implementation.  Switching to use 
https://github.com/netty/netty-tcnative to handle ssl handshake might be 
helpful. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - 

[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-31 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598306#comment-16598306
 ] 

Yu Yang commented on KAFKA-7304:


We experimented with 1.8.0_171, and did not see obvious improvements on 
reducing ssl related memory usage etc.  We also experimented with jdk 10.0.2. 
Currently we are see two issues with ssl writing to kafka: 1) there is some 
potential resource leakage in kafka.  the leakage might have already been fixed 
with  [~yuzhih...@gmail.com]'s patch.  2)  kafka may get high cpu usage when a 
large number of clients writes to kafka through ssl channels.  see 
https://issues.apache.org/jira/browse/KAFKA-7364 for details. This seems to be 
related to inefficiency in jdk ssl related implementation.  Switching to use 
https://github.com/netty/netty-tcnative to handle ssl handshake might be 
helpful. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Description: 
while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 
2.0.0,  jdk-10.0.2,  and hosts only one topic that have ~20k producers write to 
through ssl channel. We observed that  the network threads often get 100% cpu 
usage after enabling ssl writing to kafka.   To improve kafka's throughput, we 
have "num.network.threads=32" for the broker.  Even with 32 network threads, we 
see the broker cpu usage jump right after ssl writing is enabled.  The broker's 
cpu usage would drop immediately when we disabled ssl writing.  

 !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) 
@bci=88, line=473 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, 
line=331 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Description: 
while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 
2.0.0,  jdk-10.0.2,  and hosts only one topic that have ~20k producers write to 
through ssl channel. We observed that  the network threads often get 100% cpu 
usage after enabling ssl writing to kafka.   To improve kafka's throughput, we 
have "num.network.threads=32" for the broker.  Even with 32 network threads, we 
see the broker cpu usage jump right after ssl writing is enabled.  The broker's 
cpu usage would drop immediately when we disabled ssl writing. 

 !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) 
@bci=88, line=473 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, 
line=331 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Description: 
while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster that have 12 d2.8xlarge instances, jdk-10.0.2, 
 and hosts only one topic that have ~20k producers write to through ssl 
channel. We observed that  the network threads often get 100% cpu usage after 
enabling ssl writing to kafka.   To improve kafka's throughput, we have 
"num.network.threads=32" for the broker.  Even with 32 network threads, we see 
the broker cpu usage jump right after ssl writing is enabled.  The broker's cpu 
usage would drop immediately when we disabled ssl writing. 

 !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) 
@bci=88, line=473 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, 
line=331 (Compiled frame)
 - 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Description: 
while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster that have 12 d2.8xlarge instances, jdk-10.0.2, 
 and hosts only one topic that have ~20k producers write to through ssl 
channel. We observed that  the network threads often get 100% cpu usage after 
enabling ssl writing to kafka.   To improve kafka's throughput, we have 
"num.network.threads=32" for the broker.  Even with 32 network threads, we see 
the broker cpu usage jump right after ssl writing is enabled.  The broker's cpu 
usage would drop immediately when we disabled ssl writing. 

 !Screen Shot 2018-08-30 at 10.57.32 PM.png! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) 
@bci=88, line=473 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, 
line=331 (Compiled frame)
 - 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Attachment: Screen Shot 2018-08-30 at 10.57.32 PM.png

> kafka periodically run into high cpu usage with ssl writing
> ---
>
> Key: KAFKA-7364
> URL: https://issues.apache.org/jira/browse/KAFKA-7364
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png
>
>
> while testing ssl writing to kafka, we found that kafka often run into high 
> cpu usage due to inefficiency in jdk ssl implementation. 
> In detail, we use a test cluster that have 12 d2.8xlarge instances, 
> jdk-10.0.2,  and hosts only one topic that have ~20k producers write to 
> through ssl channel. We observed that  the network threads often get 100% cpu 
> usage after enabling ssl writing to kafka.   To improve kafka's throughput, 
> we have "num.network.threads=32" for the broker.  Even with 32 network 
> threads, we see the broker cpu usage jump right after ssl writing is enabled. 
>  !Screen Shot 2018-08-30 at 10.57.32 PM.png! 
> When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
> executing code in  libsunec.so.  The following is a sample stack track that 
> we get when the broker's cpu usage was high. 
> {code}
> Thread 77562: (state = IN_NATIVE)
>  - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
> byte[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
> (Compiled frame)
>  - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
> (Compiled frame)
>  - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
>  - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
> java.lang.String) @bci=136, line=444 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
>  @bci=48, line=166 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
>  java.util.Collection) @bci=24, line=147 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
>  java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
>  sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
> (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
>  @bci=217, line=141 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
>  java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
>  - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
> java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
>  java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=232, line=259 (Compiled frame)
>  - 
> sun.security.validator.Validator.validate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=6, line=260 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
>  java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
> java.lang.String) @bci=10, line=324 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
> (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
>  - 
> sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
>  @bci=190, line=1966 (Compiled frame)
>  - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
> line=237 (Compiled frame)
>  - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled 
> frame)
>  - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
>  - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled 

[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7364:
---
Attachment: (was: Screen Shot 2018-08-30 at 10.57.32 PM.png)

> kafka periodically run into high cpu usage with ssl writing
> ---
>
> Key: KAFKA-7364
> URL: https://issues.apache.org/jira/browse/KAFKA-7364
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Yu Yang
>Priority: Major
>
> while testing ssl writing to kafka, we found that kafka often run into high 
> cpu usage due to inefficiency in jdk ssl implementation. 
> In detail, we use a test cluster that have 12 d2.8xlarge instances, 
> jdk-10.0.2,  and hosts only one topic that have ~20k producers write to 
> through ssl channel. We observed that  the network threads often get 100% cpu 
> usage after enabling ssl writing to kafka.   To improve kafka's throughput, 
> we have "num.network.threads=32" for the broker.  Even with 32 network 
> threads, we see the broker cpu usage jump right after ssl writing is enabled. 
>  !Screen Shot 2018-08-30 at 10.57.32 PM.png! 
> When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
> executing code in  libsunec.so.  The following is a sample stack track that 
> we get when the broker's cpu usage was high. 
> {code}
> Thread 77562: (state = IN_NATIVE)
>  - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
> byte[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
> (Compiled frame)
>  - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
> (Compiled frame)
>  - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
>  - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
> java.lang.String) @bci=136, line=444 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
>  @bci=48, line=166 (Compiled frame)
>  - 
> sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
>  java.util.Collection) @bci=24, line=147 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
>  java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
>  sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
> (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
>  @bci=217, line=141 (Compiled frame)
>  - 
> sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
>  java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
>  - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
> java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
>  java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
>  - 
> sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=232, line=259 (Compiled frame)
>  - 
> sun.security.validator.Validator.validate(java.security.cert.X509Certificate[],
>  java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
> @bci=6, line=260 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
>  java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
> java.lang.String) @bci=10, line=324 (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
> (Compiled frame)
>  - 
> sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
>  java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
>  - 
> sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
>  @bci=190, line=1966 (Compiled frame)
>  - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
> line=237 (Compiled frame)
>  - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled 
> frame)
>  - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
>  - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
>  - 
> 

[jira] [Created] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing

2018-08-31 Thread Yu Yang (JIRA)
Yu Yang created KAFKA-7364:
--

 Summary: kafka periodically run into high cpu usage with ssl 
writing
 Key: KAFKA-7364
 URL: https://issues.apache.org/jira/browse/KAFKA-7364
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Yu Yang


while testing ssl writing to kafka, we found that kafka often run into high cpu 
usage due to inefficiency in jdk ssl implementation. 

In detail, we use a test cluster that have 12 d2.8xlarge instances, jdk-10.0.2, 
 and hosts only one topic that have ~20k producers write to through ssl 
channel. We observed that  the network threads often get 100% cpu usage after 
enabling ssl writing to kafka.   To improve kafka's throughput, we have 
"num.network.threads=32" for the broker.  Even with 32 network threads, we see 
the broker cpu usage jump right after ssl writing is enabled. 

 !Screen Shot 2018-08-30 at 10.57.32 PM.png! 

When the broker's cpu usage is high, 'perf top' shows that kafka is busy with 
executing code in  libsunec.so.  The following is a sample stack track that we 
get when the broker's cpu usage was high. 

{code}
Thread 77562: (state = IN_NATIVE)
 - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], 
byte[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 
(Compiled frame)
 - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 
(Compiled frame)
 - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame)
 - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, 
java.lang.String) @bci=136, line=444 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate)
 @bci=48, line=166 (Compiled frame)
 - 
sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate,
 java.util.Collection) @bci=24, line=147 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath,
 java.util.List, java.util.List) @bci=316, line=125 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor,
 sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 
(Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams)
 @bci=217, line=141 (Compiled frame)
 - 
sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath,
 java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame)
 - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, 
java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[],
 java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame)
 - 
sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[],
 java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=232, line=259 (Compiled frame)
 - 
sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], 
java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) 
@bci=6, line=260 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator,
 java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, 
java.lang.String) @bci=10, line=324 (Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 
(Compiled frame)
 - 
sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[],
 java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame)
 - 
sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg)
 @bci=190, line=1966 (Compiled frame)
 - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, 
line=237 (Compiled frame)
 - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame)
 - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled 
frame)
 - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() 
@bci=13, line=393 (Compiled frame)
 - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) 
@bci=88, line=473 (Compiled frame)
 - 

[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-29 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596663#comment-16596663
 ] 

Yu Yang commented on KAFKA-7304:


[~yuzhih...@gmail.com] I applied 
https://issues.apache.org/jira/secure/attachment/12937151/7304.v7.txt  to our 
test cluster and did more experiments yesterday. We did not observed channel 
closing/removal related log messages that are added to the patch.  I took 
another memory dump on a test host. This time the memory analyzer reports 
memory suspects leakage in `sun.security.ssl.SSLSessionImpl`. 



> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-29 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-08-29 at 10.49.03 AM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-29 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-08-29 at 10.50.47 AM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot 
> 2018-08-29 at 10.50.47 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-28 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-08-28 at 11.09.45 AM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at 
> 11.09.45 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-28 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:28 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS03LTYtMzU=

Sometimes the cluster can be recovered by turning off the ssl writing traffic 
to the cluster, letting the broker to garbage collect the objects in the old 
gen, and resuming the ssl writing traffic.  Sometimes the cluster still  could 
not recover fully due to dramatic increase of heap size and high cpu usage when 
we turned on the ssl writing traffic again. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

Sometimes the cluster can be recovered by turning off the ssl writing traffic 
to the cluster, letting the broker to garbage collect the objects in the old 
gen, and resuming the ssl writing traffic.  Sometimes the cluster still  could 
not recover fully due to dramatic increase of heap size and high cpu usage when 
we turned on the ssl writing traffic again. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-28 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:26 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

Sometimes the cluster can be recovered by turning off the ssl writing traffic 
to the cluster, letting the broker to garbage collect the objects in the old 
gen, and resuming the ssl writing traffic.  Sometimes the cluster still  could 
not recover fully due to dramatic increase of heap size and high cpu usage when 
we turned on the ssl writing traffic again. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 5:43 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/28/18 1:01 AM:
-

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc=

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:10 PM:
--

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:08 PM:
--

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered by turning off the ssl writing traffic to the 
cluster, letting the broker to garbage collect the objects in the old gen, and 
resuming the ssl writing traffic. 


was (Author: yuyang08):
After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered after we turned off the ssl writing traffic to the 
cluster, let the broker to garbage collect the objects in the old gen, and 
resume the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> 

[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-27 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320
 ] 

Yu Yang commented on KAFKA-7304:


After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered after we turned off the ssl writing traffic to the 
cluster, let the broker to garbage collect the objects in the old gen, and 
resume the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-25 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592762#comment-16592762
 ] 

Yu Yang commented on KAFKA-7304:


Thanks for looking into this, [~yuzhih...@gmail.com]!  We also did more 
experiments on our side with various settings. Based on the initial experiment, 
it seems to me that your patch earlier has fixed the resource leakage in the 
closing channel.  Meanwhile with >40k concurrent ssl connections, sometimes the 
brokers may still run into OOM issue as the connections are not closed on time. 
Currently I am experimenting with increased heap size. Will report back to the 
thread if we have any findings. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-20 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Affects Version/s: 1.1.1

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Critical
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
> Attachments: 7304.v4.txt, Screen Shot 2018-08-16 at 11.04.16 PM.png, 
> Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 
> PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 
> 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 
> 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory 
issue with 4G memory setting. We have tried with increasing the heap size to 
10Gb, but encountered the same issue. 

We took a few heap dumps , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector objects.  There are two 
Channel maps field in Selector. It seems that somehow the objects is not 
deleted from the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory 
issue with 4G memory setting. We have tried with increasing the heap size to 
10Gb, but encountered the same issue. 

We took a few heap dumps , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: 

[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory 
issue with 4G memory setting. We have tried with increasing the heap size to 
10Gb, but encountered the same issue. 

We took a few heap dumps , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dumps , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: 

[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dumps , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: 

[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584156#comment-16584156
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/17/18 4:51 PM:
-

[~ijuma] We have an internal build that cherry-picks 1.1.1 changes.  I might 
miss some fixes. 

https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java
 shows that there were only two Selector.java related changes after 1.1.0 
release date that was March 23rd. 

 Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? 
kafka 1.1.0 has included that change. 


was (Author: yuyang08):
[~ijuma] We have an internal build that cherry-picks 1.1.1 changes.  I might 
miss some fixes. 

https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java
 shows that there were only two Selector.java related changes after 1.1.0 
release date that was March 23rd. 

 Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ?  We 
have included that change. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584156#comment-16584156
 ] 

Yu Yang commented on KAFKA-7304:


[~ijuma] We have an internal build that cherry-picks 1.1.1 changes.  I might 
miss some fixes. 

https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java
 shows that there were only two Selector.java related changes after 1.1.0 
release date that was March 23rd. 

 Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ?  We 
have included that change. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Affects Version/s: (was: 1.1.1)

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems related to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 

[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems relate to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

One observation is that the memory leak seems related to kafka partition leader 
changes. If there is broker restart etc. in the cluster that caused partition 
leadership change, the brokers may hit the OOM issue faster. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: 

[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-08-17 at 1.05.30 AM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at 
> 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-08-17 at 1.04.32 AM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Attachment: Screen Shot 2018-08-17 at 1.03.35 AM.png

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 
> AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583565#comment-16583565
 ] 

Yu Yang commented on KAFKA-7304:


[~yuzhih...@gmail.com]  There was no exception in server.log before we hit 
frequent full gc. There was various errors in the log after the broker ran into 
full gc. But i think that those exceptions are not relevant to the root cause. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583565#comment-16583565
 ] 

Yu Yang edited comment on KAFKA-7304 at 8/17/18 8:01 AM:
-

[~yuzhih...@gmail.com]  There was no exception in server.log before the broker 
hit frequent full gc. There was various errors in the log after the broker ran 
into full gc. But i think that those exceptions are not relevant to the root 
cause. 


was (Author: yuyang08):
[~yuzhih...@gmail.com]  There was no exception in server.log before we hit 
frequent full gc. There was various errors in the log after the broker ran into 
full gc. But i think that those exceptions are not relevant to the root cause. 

> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

We use java 1.8.0_102, and has applied a TLS patch on reducing 
X509Factory.certCache map size from 750 to 20. 

{code}
java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but 

[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


the command line for running kafka: 
{code}
java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
-Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
-XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
-XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
-Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/*  kafka.Kafka 
/etc/kafka/server.properties
{code}

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, but encountered the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> 

[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7304:
---
Description: 
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enabled ssl writing at a larger 
scale (>40k clients writes concurrently), the kafka brokers soon hit 
OutOfMemory issue with 4G memory setting. We have tried with increasing the 
heap size to 10Gb, and hit the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0

  was:
We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enable ssl writing at scale 
(>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory 
issue with 4G memory setting. We have tried with increasing the heap size to 
10Gb, and hit the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0


> memory leakage in org.apache.kafka.common.network.Selector
> --
>
> Key: KAFKA-7304
> URL: https://issues.apache.org/jira/browse/KAFKA-7304
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Yu Yang
>Priority: Major
> Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
> 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
> Screen Shot 2018-08-16 at 4.26.19 PM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients writes concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, and hit the same issue. 
> We took a few heap dump , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector object.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> {code}
> private final Map channels;
> private final Map closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector

2018-08-17 Thread Yu Yang (JIRA)
Yu Yang created KAFKA-7304:
--

 Summary: memory leakage in org.apache.kafka.common.network.Selector
 Key: KAFKA-7304
 URL: https://issues.apache.org/jira/browse/KAFKA-7304
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 1.1.1, 1.1.0
Reporter: Yu Yang
 Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 
2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, 
Screen Shot 2018-08-16 at 4.26.19 PM.png

We are testing secured writing to kafka through ssl. Testing at small scale, 
ssl writing to kafka was fine. However, when we enable ssl writing at scale 
(>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory 
issue with 4G memory setting. We have tried with increasing the heap size to 
10Gb, and hit the same issue. 

We took a few heap dump , and found that most of the heap memory is referenced 
through org.apache.kafka.common.network.Selector object.  There are two Channel 
maps field in Selector. It seems that somehow the objects is not deleted from 
the map in a timely manner. 

{code}
private final Map channels;
private final Map closingChannels;
{code}

Please see the  attached images and the following link for sample gc analysis. 

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7229) Failed to dynamically update kafka certificate in kafka 2.0.0

2018-08-01 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7229:
---
Priority: Major  (was: Critical)

> Failed to dynamically update kafka certificate in kafka 2.0.0
> -
>
> Key: KAFKA-7229
> URL: https://issues.apache.org/jira/browse/KAFKA-7229
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04.5 LTS
>Reporter: Yu Yang
>Priority: Major
>
> In kafka 1.1, we use the following command in a cron job to dynamically 
> update the certificate that kafka uses :
> kafka-configs.sh --bootstrap-server localhost:9093 --command-config 
> /var/pinterest/kafka/client.properties --alter --add-config 
> listener.name.ssl.ssl.keystore.location=/var/certs/kafka/kafka.keystore.jks.1533141082.38
>  --entity-type brokers --entity-name 9 
> In kafka 2.0.0, the command fails with the following exception: 
> [2018-08-01 16:38:01,480] ERROR [AdminClient clientId=adminclient-1] 
> Connection to node -1 failed authentication due to: SSL handshake failed 
> (org.apache.kafka.clients.NetworkClient)
> Error while executing config command with args '--bootstrap-server 
> localhost:9093 --command-config /var/pinterest/kafka/client.properties 
> --alter --add-config 
> listener.name.ssl.ssl.keystore.location=/var/pinterest/kafka/kafka.keystore.jks.1533141082.38
>  --entity-type brokers --entity-name 9'
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
>   at kafka.admin.ConfigCommand$.brokerConfig(ConfigCommand.scala:346)
>   at kafka.admin.ConfigCommand$.alterBrokerConfig(ConfigCommand.scala:304)
>   at 
> kafka.admin.ConfigCommand$.processBrokerConfig(ConfigCommand.scala:290)
>   at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:83)
>   at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
> Caused by: org.apache.kafka.common.errors.SslAuthenticationException: SSL 
> handshake failed
> Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
>   at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478)
>   at 
> sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
>   at 
> sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214)
>   at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
>   at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeWrap(SslTransportLayer.java:439)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:304)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
>   at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1116)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
>   at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
>   at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)
>   at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)
>   at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
>   at 
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
>   at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
>   at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
>   at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
>   at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
>   at 
> 

[jira] [Updated] (KAFKA-7229) Failed to dynamically update kafka certificate in kafka 2.0.0

2018-08-01 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7229:
---
Priority: Critical  (was: Major)

> Failed to dynamically update kafka certificate in kafka 2.0.0
> -
>
> Key: KAFKA-7229
> URL: https://issues.apache.org/jira/browse/KAFKA-7229
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04.5 LTS
>Reporter: Yu Yang
>Priority: Critical
>
> In kafka 1.1, we use the following command in a cron job to dynamically 
> update the certificate that kafka uses :
> kafka-configs.sh --bootstrap-server localhost:9093 --command-config 
> /var/pinterest/kafka/client.properties --alter --add-config 
> listener.name.ssl.ssl.keystore.location=/var/certs/kafka/kafka.keystore.jks.1533141082.38
>  --entity-type brokers --entity-name 9 
> In kafka 2.0.0, the command fails with the following exception: 
> [2018-08-01 16:38:01,480] ERROR [AdminClient clientId=adminclient-1] 
> Connection to node -1 failed authentication due to: SSL handshake failed 
> (org.apache.kafka.clients.NetworkClient)
> Error while executing config command with args '--bootstrap-server 
> localhost:9093 --command-config /var/pinterest/kafka/client.properties 
> --alter --add-config 
> listener.name.ssl.ssl.keystore.location=/var/pinterest/kafka/kafka.keystore.jks.1533141082.38
>  --entity-type brokers --entity-name 9'
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
>   at kafka.admin.ConfigCommand$.brokerConfig(ConfigCommand.scala:346)
>   at kafka.admin.ConfigCommand$.alterBrokerConfig(ConfigCommand.scala:304)
>   at 
> kafka.admin.ConfigCommand$.processBrokerConfig(ConfigCommand.scala:290)
>   at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:83)
>   at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
> Caused by: org.apache.kafka.common.errors.SslAuthenticationException: SSL 
> handshake failed
> Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
>   at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478)
>   at 
> sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
>   at 
> sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214)
>   at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
>   at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeWrap(SslTransportLayer.java:439)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:304)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
>   at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1116)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
>   at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
>   at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)
>   at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)
>   at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
>   at 
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
>   at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
>   at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
>   at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
>   at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
>   at 
> 

[jira] [Updated] (KAFKA-7229) Failed to dynamically update kafka certificate in kafka 2.0.0

2018-08-01 Thread Yu Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-7229:
---
Description: 
In kafka 1.1, we use the following command in a cron job to dynamically update 
the certificate that kafka uses :

kafka-configs.sh --bootstrap-server localhost:9093 --command-config 
/var/pinterest/kafka/client.properties --alter --add-config 
listener.name.ssl.ssl.keystore.location=/var/certs/kafka/kafka.keystore.jks.1533141082.38
 --entity-type brokers --entity-name 9 

In kafka 2.0.0, the command fails with the following exception: 



[2018-08-01 16:38:01,480] ERROR [AdminClient clientId=adminclient-1] Connection 
to node -1 failed authentication due to: SSL handshake failed 
(org.apache.kafka.clients.NetworkClient)
Error while executing config command with args '--bootstrap-server 
localhost:9093 --command-config /var/pinterest/kafka/client.properties --alter 
--add-config 
listener.name.ssl.ssl.keystore.location=/var/pinterest/kafka/kafka.keystore.jks.1533141082.38
 --entity-type brokers --entity-name 9'
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
at 
org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at 
org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at 
org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
at kafka.admin.ConfigCommand$.brokerConfig(ConfigCommand.scala:346)
at kafka.admin.ConfigCommand$.alterBrokerConfig(ConfigCommand.scala:304)
at 
kafka.admin.ConfigCommand$.processBrokerConfig(ConfigCommand.scala:290)
at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:83)
at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
Caused by: org.apache.kafka.common.errors.SslAuthenticationException: SSL 
handshake failed
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478)
at 
sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
at 
sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214)
at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeWrap(SslTransportLayer.java:439)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:304)
at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258)
at 
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at 
org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1116)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
at 
sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
at 
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416)
at 
org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393)
at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473)
at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331)
... 7 more
Caused by: java.security.cert.CertificateException: No subject alternative DNS 
name matching localhost found.
at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:204)
at sun.security.util.HostnameChecker.match(HostnameChecker.java:95)
at 
sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455)
at 

[jira] [Commented] (KAFKA-5886) Introduce delivery.timeout.ms producer config (KIP-91)

2018-07-11 Thread Yu Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540741#comment-16540741
 ] 

Yu Yang commented on KAFKA-5886:


[~ashsskum]   The pull request [https://github.com/apache/kafka/pull/5270] is 
currently under review. 

[~becket_qin], [~guozhang]  can you help to assign the ticket to me? 

> Introduce delivery.timeout.ms producer config (KIP-91)
> --
>
> Key: KAFKA-5886
> URL: https://issues.apache.org/jira/browse/KAFKA-5886
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Reporter: Sumant Tambe
>Assignee: Sumant Tambe
>Priority: Major
>
> We propose adding a new timeout delivery.timeout.ms. The window of 
> enforcement includes batching in the accumulator, retries, and the inflight 
> segments of the batch. With this config, the user has a guaranteed upper 
> bound on when a record will either get sent, fail or expire from the point 
> when send returns. In other words we no longer overload request.timeout.ms to 
> act as a weak proxy for accumulator timeout and instead introduce an explicit 
> timeout that users can rely on without exposing any internals of the producer 
> such as the accumulator. 
> See 
> [KIP-91|https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer]
>  for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6544) kafka process should exit when it encounters "java.io.IOException: Too many open files"

2018-02-08 Thread Yu Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated KAFKA-6544:
---
Description: 
Our kafka cluster encountered a few disk/xfs failures in the cloud vm 
instances. When a disk/xfs failure happens, kafka process did not exit 
gracefully. Instead, it ran into  "" status, with port 9092 still be 
reachable.  when failures like this happens, kafka should shutdown all threads 
and exit. The following is the kafka logs when the failure happens:

{code:java}
[2018-02-08 12:52:31,764] ERROR Error while accepting connection 
(kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:340)
at kafka.network.Acceptor.run(SocketServer.scala:283)
at java.lang.Thread.run(Thread.java:748)
[2018-02-08 12:52:31,772] ERROR Error while accepting connection 
(kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:340)
at kafka.network.Acceptor.run(SocketServer.scala:283)
at java.lang.Thread.run(Thread.java:748)
[2018-02-08 12:52:31,772] ERROR Error while accepting connection 
(kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:340)
at kafka.network.Acceptor.run(SocketServer.scala:283)
at java.lang.Thread.run(Thread.java:748)
 {code}

  was:
Our kafka cluster encountered a few disk/xfs failures in the cloud vm 
instances. When a disk/xfs failure happens, kafka process did not exit 
gracefully. Instead, it run into  "" status, with port 9092 still be 
reachable.  when failures like this happens, kafka should shutdown all threads 
and exit. The following is the kafka logs when the failure happens:

{code:java}
[2018-02-08 12:52:31,764] ERROR Error while accepting connection 
(kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:340)
at kafka.network.Acceptor.run(SocketServer.scala:283)
at java.lang.Thread.run(Thread.java:748)
[2018-02-08 12:52:31,772] ERROR Error while accepting connection 
(kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:340)
at kafka.network.Acceptor.run(SocketServer.scala:283)
at java.lang.Thread.run(Thread.java:748)
[2018-02-08 12:52:31,772] ERROR Error while accepting connection 
(kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:340)
at kafka.network.Acceptor.run(SocketServer.scala:283)
at java.lang.Thread.run(Thread.java:748)
 {code}


> kafka process should exit when it encounters "java.io.IOException: Too many 
> open files"  
> -
>
> Key: KAFKA-6544
> URL: https://issues.apache.org/jira/browse/KAFKA-6544
> Project: Kafka
>  Issue Type: Bug
>  Components: admin, network
>Affects Versions: 0.10.2.1
>Reporter: Yu Yang
>Priority: Major
>
> Our kafka cluster encountered a few disk/xfs failures in the cloud vm 
> instances. When a disk/xfs failure happens, kafka process did not exit 
> gracefully. Instead, it ran into  "" status, with port 9092 still be 
> 

[jira] [Commented] (KAFKA-6544) kafka process should exit when it encounters "java.io.IOException: Too many open files"

2018-02-08 Thread Yu Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357483#comment-16357483
 ] 

Yu Yang commented on KAFKA-6544:


[~cmccabe]  The kafka process is in `` status.  sudo ls -l 
/proc/$kafka_pid/fd returns 0.   I am also including  "netstat -pnt" output 
here. Connections are either in ESTABLISHED or CLOSE_WAIT status. 

{code}
proc/30413/fd]# sudo ls -l /proc/30413/fd
total 0
{code} 

{code}
netstat -pnt | grep "10.1.160.124:9092" | wc
116 812   11252
{code} 


{code}
netstat -pnt | grep "10.1.160.124:9092"
tcp   29  0 10.1.160.124:9092   10.1.25.241:55616   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:58624   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.9.121:33894CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:53886   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:43122   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:50766   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.26.165:34282   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.79.149:47682   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.163.135:44008  CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.66.116:52398   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.64.116:36656   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.207.247:51904  CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.9.16:45942 CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.131.15:57118   CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:55974   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.214.5:33040CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:33494   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.201.139:60230  CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.207.247:51792  CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:42858   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:44246   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.194.26:42406   CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:32902   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.169.94:35532   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.193.101:48832  CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.204.225:60946  CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:35772   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:46972   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:56226   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:46432   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:44436   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:4   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:47364   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:44908   ESTABLISHED 
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:43060   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.10.15:39282CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.181.86:55500   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.17.191:32812   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.141.30:52024   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.76.141:51366   CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:50940   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092   10.1.11.196:44064   CLOSE_WAIT  
-   
tcp   65  0 10.1.160.124:9092   10.1.143.107:37116  CLOSE_WAIT  
-   
tcp   29  0 10.1.160.124:9092   10.1.25.241:37416   ESTABLISHED 
-   
tcp   65  0 10.1.160.124:9092