[jira] [Updated] (KAFKA-10731) have kafka producer & consumer auto-reload ssl certificate
[ https://issues.apache.org/jira/browse/KAFKA-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-10731: Affects Version/s: 2.3.1 > have kafka producer & consumer auto-reload ssl certificate > > > Key: KAFKA-10731 > URL: https://issues.apache.org/jira/browse/KAFKA-10731 > Project: Kafka > Issue Type: Improvement > Components: security >Affects Versions: 2.3.1 >Reporter: Yu Yang >Priority: Major > > We use SSL in both brokers and kafka clients for authenticate and > authorization, and rotates the certificates every 12 hours. Kafka producers > and consumer cannot pick up the rotated certs. This causes stream processing > interruption (e.g. flink connector does not handle ssl exception, and the > flink applicatoin has to be restarted when we hit this error). We need to > improve kafka producer & client to support ssl certificate dynamic loading. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10731) have kafka producer & consumer auto-reload ssl certificate
Yu Yang created KAFKA-10731: --- Summary: have kafka producer & consumer auto-reload ssl certificate Key: KAFKA-10731 URL: https://issues.apache.org/jira/browse/KAFKA-10731 Project: Kafka Issue Type: Improvement Components: security Reporter: Yu Yang We use SSL in both brokers and kafka clients for authenticate and authorization, and rotates the certificates every 12 hours. Kafka producers and consumer cannot pick up the rotated certs. This causes stream processing interruption (e.g. flink connector does not handle ssl exception, and the flink applicatoin has to be restarted when we hit this error). We need to improve kafka producer & client to support ssl certificate dynamic loading. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10407) add linger.ms parameter support to KafkaLog4jAppender
Yu Yang created KAFKA-10407: --- Summary: add linger.ms parameter support to KafkaLog4jAppender Key: KAFKA-10407 URL: https://issues.apache.org/jira/browse/KAFKA-10407 Project: Kafka Issue Type: Improvement Components: logging Reporter: Yu Yang Currently KafkaLog4jAppender does not accept `linger.ms` setting. When a service has an outrage that cause excessively error logging, the service can have too many producer requests to kafka brokers and overload the broker. Setting a non-zero 'linger.ms' will allow kafka producer to batch records and reduce # of producer request. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang resolved KAFKA-8716. Resolution: Not A Problem > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902255#comment-16902255 ] Yu Yang commented on KAFKA-8716: Update: We verified that after upgrading zookeeper to 3.5.5, nodes with kafka 2.3 binary can re-join the cluster fine. Thanks for looking into this issue! > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894166#comment-16894166 ] Yu Yang edited comment on KAFKA-8716 at 7/27/19 3:19 PM: - [~junrao] Did not find much useful log from our zookeeper log. it seems that it is related to the zookeeper version that we uses. we are using zookeeper 3.5.3 that is a beta version. will upgrade zookeeper to 3.5.5 that is stable release to see if that fixes the issue. was (Author: yuyang08): [~junrao] it seems that it is related to the zookeeper version that we uses. we are using zookeeper 3.5.3 that is a beta version. will upgrade zookeeper to 3.5.5 that is stable release to see if that fixes the issue. > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894166#comment-16894166 ] Yu Yang commented on KAFKA-8716: [~junrao] it seems that it is related to the zookeeper version that we uses. we are using zookeeper 3.5.3 that is a beta version. will upgrade zookeeper to 3.5.5 that is stable release to see if that fixes the issue. > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894026#comment-16894026 ] Yu Yang commented on KAFKA-8716: The following is the log (with debug log) around the exception: {code} [2019-07-26 17:45:44,476] INFO Creating /brokers/ids/85 (is it secure? false) (kafka.zk.KafkaZkClient) [2019-07-26 17:45:44,484] DEBUG Reading reply sessionid:0x7593f202705, packet:: clientPath:null serverPath:null finished:false header:: 91,14 replyHeader:: 91,234840046463,0 request:: org.apache.zookeeper.MultiTransactionRecord@3cd2650b response:: org.apache.zookeeper.MultiResponse@f554 (org.apache.zookeeper.ClientCnxn) [2019-07-26 17:45:44,486] ERROR Error while creating ephemeral at /brokers/ids/85 with return code: SESSIONEXPIRED (kafka.zk.KafkaZkClient$CheckedEphemeral) [2019-07-26 17:45:44,491] ERROR [KafkaServer id=85] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:134) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1727) {code} The following is debug log from ZooKeeperClientWatcher: {code} [2019-07-26 17:45:43,296] DEBUG [ZooKeeperClient Kafka server] Received event: WatchedEvent state:SyncConnected type:None path:null (kafka.zookeeper.ZooKeeperClient) [2019-07-26 17:45:43,449] DEBUG [ZooKeeperClient Kafka server] Received event: WatchedEvent state:Closed type:None path:null (kafka.zookeeper.ZooKeeperClient) [2019-07-26 17:45:43,489] DEBUG [ZooKeeperClient Kafka server] Received event: WatchedEvent state:SyncConnected type:None path:null (kafka.zookeeper.ZooKeeperClient) [2019-07-26 17:45:44,901] DEBUG [ZooKeeperClient Kafka server] Received event: WatchedEvent state:Closed type:None path:null (kafka.zookeeper.ZooKeeperClient) {code} The following is the log for the zookeeper session: {code} [2019-07-26 17:45:43,489] INFO Session establishment complete on server datazk007/10.1.16.191:2181, sessionid = 0x7593f202705, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2019-07-26 17:45:43,492] DEBUG Reading reply sessionid:0x7593f202705, packet:: clientPath:/consumers serverPath:/testkafka/consumers finished:false header:: 1,1 replyHeader:: 1,234840045921,-110 request:: '/testkafka/consumers,,v{s{31,s{'world,'anyone}}},0 response:: (org.apache.zookeeper.ClientCnxn) ... [2019-07-26 17:45:44,484] DEBUG Reading reply sessionid:0x7593f202705, packet:: clientPath:null serverPath:null finished:false header:: 91,14 replyHeader:: 91,234840046463,0 request:: org.apache.zookeeper.MultiTransactionRecord@3cd2650b response:: org.apache.zookeeper.MultiResponse@f554 (org.apache.zookeeper.ClientCnxn) [2019-07-26 17:45:44,800] DEBUG Closing session: 0x7593f202705 (org.apache.zookeeper.ZooKeeper) [2019-07-26 17:45:44,800] DEBUG Closing client for session: 0x7593f202705 (org.apache.zookeeper.ClientCnxn) ... [2019-07-26 17:45:44,800] DEBUG Reading reply sessionid:0x7593f202705, packet:: clientPath:null serverPath:null finished:false header:: 92,-11 replyHeader:: 92,234840046569,0 request:: null response:: null (org.apache.zookeeper.ClientCnxn) [2019-07-26 17:45:44,800] DEBUG Disconnecting client for session: 0x7593f202705 (org.apache.zookeeper.ClientCnxn) [2019-07-26 17:45:44,800] DEBUG An exception was thrown while closing send thread for session 0x7593f202705 : Unable to read additional data from server sessionid 0x7593f202705, likely server has closed socket (org.apache.zookeeper.ClientCnxn) [2019-07-26 17:45:44,901] INFO Session: 0x7593f202705 closed (org.apache.zookeeper.ZooKeeper) [2019-07-26 17:45:44,901] INFO EventThread shut down for session: 0x7593f202705 (org.apache.zookeeper.ClientCnxn) {code} > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using
[jira] [Comment Edited] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893956#comment-16893956 ] Yu Yang edited comment on KAFKA-8716 at 7/26/19 4:42 PM: - Thank for checking [~junrao]. Added more information in the description section. The "SESSIONExpiration" exception happened immediately after the "CheckedEphemeral.create" call, and happened repeatedly so that the broker could not get started properly. was (Author: yuyang08): Thank for checking [~junrao]. Added more information in the description session. The "SESSIONExpiration" exception happened immediately after the "CheckedEphemeral.create" call, and happened repeatedly so that the broker could not get started properly. > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893956#comment-16893956 ] Yu Yang edited comment on KAFKA-8716 at 7/26/19 4:22 PM: - Thank for checking [~junrao]. Added more information in the description session. The "SESSIONExpiration" exception happened immediately after the "CheckedEphemeral.create" call, and happened repeatedly so that the broker could not get started properly. was (Author: yuyang08): Thank for checking [~junrao]. Added more information in the description session. The "SESSIONExpiration" exception happened immediately after the "CheckedEphemeral.create" call, and happened repeatedly so that the broker could not get started. > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Summary: broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0 (was: broker cannot join the cluster after upgrading kafka binary from 2.1.0 to 2.2.1 or 2.3.0) > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.0 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Summary: broker cannot join the cluster after upgrading kafka binary from 2.1.0 to 2.2.1 or 2.3.0 (was: broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0) > broker cannot join the cluster after upgrading kafka binary from 2.1.0 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Component/s: zkclient > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Description: We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started due to zookeeper session expiration exception. This error happens repeatedly and the broker could not start because of this. Below is our zk related setting in server.properties: {code} zookeeper.connection.timeout.ms=6000 zookeeper.session.timeout.ms=6000 {code} The following is the stack trace, and we are using zookeeper 3.5.3. Instead of waiting for a few seconds, the SESSIONEXPIRED error returned immediately in CheckedEphemeral.create call. Any insights? [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) (kafka.zk.KafkaZkClient) [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at /brokers/ids/80 with return code: SESSIONEXPIRED (kafka.zk.KafkaZkClient$CheckedEphemeral) [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) at kafka.server.KafkaServer.startup(KafkaServer.scala:260) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) at kafka.Kafka$.main(Kafka.scala:75) at kafka.Kafka.main(Kafka.scala) was: We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started due to zookeeper session expiration exception. Below is our zk related setting in server.properties: {code} zookeeper.connection.timeout.ms=6000 zookeeper.session.timeout.ms=6000 {code} The following is the stack trace, and we are using zookeeper 3.5.3. Instead of waiting for a few seconds, the SESSIONEXPIRED error returned immediately in CheckedEphemeral.create call. Any insights? [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) (kafka.zk.KafkaZkClient) [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at /brokers/ids/80 with return code: SESSIONEXPIRED (kafka.zk.KafkaZkClient$CheckedEphemeral) [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) at kafka.server.KafkaServer.startup(KafkaServer.scala:260) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) at kafka.Kafka$.main(Kafka.scala:75) at kafka.Kafka.main(Kafka.scala) > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. This error happens > repeatedly and the broker could not start because of this. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session
[jira] [Commented] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893956#comment-16893956 ] Yu Yang commented on KAFKA-8716: Thank for checking [~junrao]. Added more information in the description session. The "SESSIONExpiration" exception happened immediately after the "CheckedEphemeral.create" call, and happened repeatedly so that the broker could not get started. > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Description: We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started due to zookeeper session expiration exception. Below is our zk related setting in server.properties: {code} zookeeper.connection.timeout.ms=6000 zookeeper.session.timeout.ms=6000 {code} The following is the stack trace, and we are using zookeeper 3.5.3. Instead of waiting for a few seconds, the SESSIONEXPIRED error returned immediately in CheckedEphemeral.create call. Any insights? [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) (kafka.zk.KafkaZkClient) [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at /brokers/ids/80 with return code: SESSIONEXPIRED (kafka.zk.KafkaZkClient$CheckedEphemeral) [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) at kafka.server.KafkaServer.startup(KafkaServer.scala:260) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) at kafka.Kafka$.main(Kafka.scala:75) at kafka.Kafka.main(Kafka.scala) was: We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started due to zookeeper session expiration exception. The following is the stack trace, and we are using zookeeper 3.5.3. Any insights? [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) (kafka.zk.KafkaZkClient) [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at /brokers/ids/80 with return code: SESSIONEXPIRED (kafka.zk.KafkaZkClient$CheckedEphemeral) [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) at kafka.server.KafkaServer.startup(KafkaServer.scala:260) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) at kafka.Kafka$.main(Kafka.scala:75) at kafka.Kafka.main(Kafka.scala) > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. > Below is our zk related setting in server.properties: > {code} > zookeeper.connection.timeout.ms=6000 > zookeeper.session.timeout.ms=6000 > {code} > The following is the stack trace, and we are using zookeeper 3.5.3. Instead > of waiting for a few seconds, the SESSIONEXPIRED error returned immediately > in CheckedEphemeral.create call. Any insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Priority: Critical (was: Major) > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Critical > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. > The following is the stack trace, and we are using zookeeper 3.5.3. Any > insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8716) broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0
[ https://issues.apache.org/jira/browse/KAFKA-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8716: --- Summary: broker cannot join the cluster after upgrading kafka binary from 2.1.1 to 2.2.1 or 2.3.0 (was: broker cannot join the cluster after upgrading the binary from 2.1 to 2.2.1 or 2.3.0) > broker cannot join the cluster after upgrading kafka binary from 2.1.1 to > 2.2.1 or 2.3.0 > > > Key: KAFKA-8716 > URL: https://issues.apache.org/jira/browse/KAFKA-8716 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.3.0, 2.2.1 >Reporter: Yu Yang >Priority: Major > > We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both > versions, the broker with updated binary (2.2.1 or 2.3.0) could not get > started due to zookeeper session expiration exception. > The following is the stack trace, and we are using zookeeper 3.5.3. Any > insights? > [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) > (kafka.zk.KafkaZkClient) > [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at > /brokers/ids/80 with return code: SESSIONEXPIRED > (kafka.zk.KafkaZkClient$CheckedEphemeral) > [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during > KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) > at kafka.server.KafkaServer.startup(KafkaServer.scala:260) > at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) > at kafka.Kafka$.main(Kafka.scala:75) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (KAFKA-8716) broker cannot join the cluster after upgrading the binary from 2.1 to 2.2.1 or 2.3.0
Yu Yang created KAFKA-8716: -- Summary: broker cannot join the cluster after upgrading the binary from 2.1 to 2.2.1 or 2.3.0 Key: KAFKA-8716 URL: https://issues.apache.org/jira/browse/KAFKA-8716 Project: Kafka Issue Type: Bug Affects Versions: 2.2.1, 2.3.0 Reporter: Yu Yang We are trying to upgrade kafka binary from 2.1 to 2.2.1 or 2.3.0. For both versions, the broker with updated binary (2.2.1 or 2.3.0) could not get started due to zookeeper session expiration exception. The following is the stack trace, and we are using zookeeper 3.5.3. Any insights? [2019-07-25 18:07:35,712] INFO Creating /brokers/ids/80 (is it secure? false) (kafka.zk.KafkaZkClient) [2019-07-25 18:07:35,724] ERROR Error while creating ephemeral at /brokers/ids/80 with return code: SESSIONEXPIRED (kafka.zk.KafkaZkClient$CheckedEphemeral) [2019-07-25 18:07:35,731] ERROR [KafkaServer id=80] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97) at kafka.server.KafkaServer.startup(KafkaServer.scala:260) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) at kafka.Kafka$.main(Kafka.scala:75) at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (KAFKA-8300) kafka broker did not recovery from quota limit after quota setting is removed
Yu Yang created KAFKA-8300: -- Summary: kafka broker did not recovery from quota limit after quota setting is removed Key: KAFKA-8300 URL: https://issues.apache.org/jira/browse/KAFKA-8300 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.1.0 Environment: Description: Ubuntu 14.04.5 LTS Release:14.04 Reporter: Yu Yang Attachments: Screen Shot 2019-04-26 at 4.02.03 PM.png We applied quota management to one of our clusters. After applying quota, we saw the following errors in kafka server log. And the broker's network traffic did not recover, even after we removed the quota settings. Any insights on this? {code} 3097 [2019-04-26 20:59:42,359] WARN Attempting to send response via channel for which there is no open connection, connection id 10.1.239.72:9093-10.3.57.190:59846-4925637 (kafka.network.Proces sor) 3098 [2019-04-26 20:59:43,518] WARN Attempting to send response via channel for which there is no open connection, connection id 10.1.239.72:9093-10.3.230.92:49788-4925646 (kafka.network.Proces sor) 3099 [2019-04-26 20:59:44,343] WARN Attempting to send response via channel for which there is no open connection, connection id 10.1.239.72:9093-10.3.32.233:35714-4925663 (kafka.network.Proces sor) 3100 [2019-04-26 20:59:45,448] WARN Attempting to send response via channel for which there is no open connection, connection id 10.1.239.72:9093-10.3.55.250:52884-4925658 (kafka.network.Proces sor) 3101 [2019-04-26 20:59:45,544] WARN Attempting to send response via channel for which there is no open connection, connection id 10.1.239.72:9093-10.3.55.24:41608-4925687 (kafka.network.Process or) {code} !Screen Shot 2019-04-26 at 4.02.03 PM.png|width=640px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-8300) kafka broker did not recover from quota limit after quota setting is removed
[ https://issues.apache.org/jira/browse/KAFKA-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-8300: --- Summary: kafka broker did not recover from quota limit after quota setting is removed (was: kafka broker did not recovery from quota limit after quota setting is removed) > kafka broker did not recover from quota limit after quota setting is removed > > > Key: KAFKA-8300 > URL: https://issues.apache.org/jira/browse/KAFKA-8300 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.0 > Environment: Description: Ubuntu 14.04.5 LTS > Release: 14.04 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2019-04-26 at 4.02.03 PM.png > > > We applied quota management to one of our clusters. After applying quota, we > saw the following errors in kafka server log. And the broker's network > traffic did not recover, even after we removed the quota settings. Any > insights on this? > {code} > 3097 [2019-04-26 20:59:42,359] WARN Attempting to send response via channel > for > which there is no open connection, connection id > 10.1.239.72:9093-10.3.57.190:59846-4925637 (kafka.network.Proces sor) > 3098 [2019-04-26 20:59:43,518] WARN Attempting to send response via channel > for > which there is no open connection, connection id > 10.1.239.72:9093-10.3.230.92:49788-4925646 (kafka.network.Proces sor) > 3099 [2019-04-26 20:59:44,343] WARN Attempting to send response via channel > for > which there is no open connection, connection id > 10.1.239.72:9093-10.3.32.233:35714-4925663 (kafka.network.Proces sor) > 3100 [2019-04-26 20:59:45,448] WARN Attempting to send response via channel > for > which there is no open connection, connection id > 10.1.239.72:9093-10.3.55.250:52884-4925658 (kafka.network.Proces sor) > 3101 [2019-04-26 20:59:45,544] WARN Attempting to send response via channel > for > which there is no open connection, connection id > 10.1.239.72:9093-10.3.55.24:41608-4925687 (kafka.network.Process or) > {code} > > !Screen Shot 2019-04-26 at 4.02.03 PM.png|width=640px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Priority: Major (was: Critical) > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Fix For: 1.1.2, 2.2.0, 2.0.2 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7450) "Handshake message sequence violation" related ssl handshake failure leads to high cpu usage
[ https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7450: --- Description: After updating security.inter.broker.protocol to SSL for our cluster, we observed that the controller can get into almost 100% cpu usage from time to time. {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} There is no obvious error in server.log. But in controller.log, there is repetitive SSL handshare failure error as below: {code:java} [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) at org.apache.kafka.common.network.Selector.poll(Selector.java:425) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) at java.security.AccessController.doPrivileged(Native Method) at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) at org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) ... 10 more {code} {code:java} [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, maxBytes=4194304), the_test_topic-58=(offset=462312762, logStartOffset=462295078, maxBytes=4194304)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) at org.apache.kafka.common.network.Selector.poll(Selector.java:425) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:91) at
[jira] [Commented] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711009#comment-16711009 ] Yu Yang commented on KAFKA-7704: [~huxi_2b], [~junrao] I verified that https://github.com/apache/kafka/pull/5998 does fix the maxlag metric issue. Thanks for the quick fix! > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Assignee: huxihx >Priority: Major > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions for the cluster. > The initial analysis shows that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic named `test_topic`: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7704: --- Attachment: Screen Shot 2018-12-05 at 10.13.09 PM.png > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Assignee: huxihx >Priority: Major > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png, Screen Shot > 2018-12-05 at 10.13.09 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions for the cluster. > The initial analysis shows that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic named `test_topic`: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7704: --- Description: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic named `test_topic`: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! was: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic *test_topic*: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions. > The initial analysis showed that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic named `test_topic`: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7704: --- Description: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions for the cluster. The initial analysis shows that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic named `test_topic`: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! was: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions for the cluster. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic named `test_topic`: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions for the cluster. > The initial analysis shows that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic named `test_topic`: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7704: --- Description: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions for the cluster. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic named `test_topic`: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! was: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic named `test_topic`: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions for the cluster. > The initial analysis showed that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic named `test_topic`: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7704: --- Description: We recently deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic *test_topic*: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! was: We deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic *test_topic*: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions. > The initial analysis showed that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic *test_topic*: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
Yu Yang created KAFKA-7704: -- Summary: kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly Key: KAFKA-7704 URL: https://issues.apache.org/jira/browse/KAFKA-7704 Project: Kafka Issue Type: Bug Components: metrics Affects Versions: 2.1.0 Reporter: Yu Yang Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png We deployed kafka 2.1, and noticed a jump in kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, there is no under-replicated partitions. The initial analysis showed that kafka 2.1.0 does not report metric correctly for topics that have no incoming traffic right now, but had traffic earlier. For those topics, ReplicaFetcherManager will consider the maxLag be the latest offset. For instance, we have a topic *test_topic*: {code} [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l total 8 -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint {code} kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:49 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap size for kafka process, and ~20k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chart on a broker using kafka 2.0 binary with commits up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chart on a broker using kafka 2.0 binary with commits up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap
[jira] [Updated] (KAFKA-7450) "Handshake message sequence violation" related ssl handshake failure leads to high cpu usage
[ https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7450: --- Summary: "Handshake message sequence violation" related ssl handshake failure leads to high cpu usage (was: kafka "Handshake message sequence violation" leads to high cpu usage) > "Handshake message sequence violation" related ssl handshake failure leads to > high cpu usage > > > Key: KAFKA-7450 > URL: https://issues.apache.org/jira/browse/KAFKA-7450 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > > After updating security.inter.broker.protocol to SSL for our cluster, we > observed that the controller can get into almost 100% cpu usage. > {code} > listeners=PLAINTEXT://:9092,SSL://:9093 > security.inter.broker.protocol=SSL > {code} > There is no obvious error in server.log. But in controller.log, there is > repetitive SSL handshare failure error as below: > {code} > [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] > Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 > (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) > at > sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) > at > org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) > at > org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) > at org.apache.kafka.common.network.Selector.poll(Selector.java:425) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) > at > org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at > sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) > at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) > at java.security.AccessController.doPrivileged(Native Method) > at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) > at > org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) > ... 10 more > {code} > {code} > [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, > fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, > maxBytes=4194304), the_test_topic-58=(offset=462312762, > logStartOffset=462295078, maxBytes=4194304)}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, > epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538) > at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:20 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chart on a broker using kafka 2.0 binary with commits up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 6:17 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is our ssl related kafka setting: {code:java} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=PLAINTEXT ssl.client.auth=required ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 ssl.endpoint.identification.algorithm=HTTPS ssl.key.password=key_password ssl.keystore.location=keystore_location ssl.keystore.password=keystore_password ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=truststore_path ssl.truststore.password=truststore_password ssl.truststore.type=JKS {code} The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! There is another issue that we saw with the following setting. See KAFKA-7450 for details. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running
[jira] [Updated] (KAFKA-7450) kafka "Handshake message sequence violation" leads to high cpu usage
[ https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7450: --- Summary: kafka "Handshake message sequence violation" leads to high cpu usage (was: kafka "Handshake message sequence violation" failure ) > kafka "Handshake message sequence violation" leads to high cpu usage > > > Key: KAFKA-7450 > URL: https://issues.apache.org/jira/browse/KAFKA-7450 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > > After updating security.inter.broker.protocol to SSL for our cluster, we > observed that the controller can get into almost 100% cpu usage. > {code} > listeners=PLAINTEXT://:9092,SSL://:9093 > security.inter.broker.protocol=SSL > {code} > There is no obvious error in server.log. But in controller.log, there is > repetitive SSL handshare failure error as below: > {code} > [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] > Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 > (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) > at > sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) > at > org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) > at > org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) > at org.apache.kafka.common.network.Selector.poll(Selector.java:425) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) > at > org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at > sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) > at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) > at java.security.AccessController.doPrivileged(Native Method) > at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) > at > org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) > ... 10 more > {code} > {code} > [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, > fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, > maxBytes=4194304), the_test_topic-58=(offset=462312762, > logStartOffset=462295078, maxBytes=4194304)}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, > epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538) > at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) > at >
[jira] [Updated] (KAFKA-7450) kafka "Handshake message sequence violation" failure
[ https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7450: --- Summary: kafka "Handshake message sequence violation" failure (was: kafka RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers) > kafka "Handshake message sequence violation" failure > - > > Key: KAFKA-7450 > URL: https://issues.apache.org/jira/browse/KAFKA-7450 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > > After updating security.inter.broker.protocol to SSL for our cluster, we > observed that the controller can get into almost 100% cpu usage. > {code} > listeners=PLAINTEXT://:9092,SSL://:9093 > security.inter.broker.protocol=SSL > {code} > There is no obvious error in server.log. But in controller.log, there is > repetitive SSL handshare failure error as below: > {code} > [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] > Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 > (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) > at > sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) > at > org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) > at > org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) > at org.apache.kafka.common.network.Selector.poll(Selector.java:425) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) > at > org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at > sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) > at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) > at java.security.AccessController.doPrivileged(Native Method) > at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) > at > org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) > ... 10 more > {code} > {code} > [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, > fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, > maxBytes=4194304), the_test_topic-58=(offset=462312762, > logStartOffset=462295078, maxBytes=4194304)}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, > epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538) > at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) > at >
[jira] [Updated] (KAFKA-7450) kafka RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers
[ https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7450: --- Summary: kafka RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers (was: kafka controller RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers) > kafka RequestSendThread stuck in infinite loop after SSL handshake failure > with peer brokers > - > > Key: KAFKA-7450 > URL: https://issues.apache.org/jira/browse/KAFKA-7450 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > > After updating security.inter.broker.protocol to SSL for our cluster, we > observed that the controller can get into almost 100% cpu usage. > {code} > listeners=PLAINTEXT://:9092,SSL://:9093 > security.inter.broker.protocol=SSL > {code} > There is no obvious error in server.log. But in controller.log, there is > repetitive SSL handshare failure error as below: > {code} > [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] > Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 > (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) > at > sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) > at > org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) > at > org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) > at org.apache.kafka.common.network.Selector.poll(Selector.java:425) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) > at > org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at > sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) > at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) > at java.security.AccessController.doPrivileged(Native Method) > at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) > at > org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) > ... 10 more > {code} > {code} > [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, > fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, > maxBytes=4194304), the_test_topic-58=(offset=462312762, > logStartOffset=462295078, maxBytes=4194304)}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, > epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence > violation, 2 > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538) > at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) > at >
[jira] [Updated] (KAFKA-7450) kafka controller RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers
[ https://issues.apache.org/jira/browse/KAFKA-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7450: --- Description: After updating security.inter.broker.protocol to SSL for our cluster, we observed that the controller can get into almost 100% cpu usage. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} There is no obvious error in server.log. But in controller.log, there is repetitive SSL handshare failure error as below: {code} [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) at org.apache.kafka.common.network.Selector.poll(Selector.java:425) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) at java.security.AccessController.doPrivileged(Native Method) at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) at org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) ... 10 more {code} {code} [2018-09-30 00:30:13,609] WARN [ReplicaFetcher replicaId=59, leaderId=66, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=59, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={the_test_topic-18=(offset=462333447, logStartOffset=462286948, maxBytes=4194304), the_test_topic-58=(offset=462312762, logStartOffset=462295078, maxBytes=4194304)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1991153671, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1538) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) at org.apache.kafka.common.network.Selector.poll(Selector.java:425) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:91) at
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang edited comment on KAFKA-7304 at 9/30/18 5:52 AM: - [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster. The cpu usage jumped to almost 100% after enabling TLS-based writing to the cluster. !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500! was (Author: yuyang08): [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster during this period of time: !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500px! > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= >
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633251#comment-16633251 ] Yu Yang commented on KAFKA-7304: [~rsivaram] Tested with latest kafka 2.0 branch code, using d2.2x instances, 16g max heap siz~e for kafka process, and ~30k producers. Using 16gb heap size, we did not see frequent gc. But at the same time, we still hit the high cpu usage issue that is documented in KAFKA-7364. Did you see high cpu usage related issue in your case? The following is the gc chat on a broker with kafka 2.0 changes up to [https://github.com/apache/kafka/commit/74c8b831472ed07e10ceda660e0e504a6a6821c4] [http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMzAvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTM3LTQ3] !Screen Shot 2018-09-29 at 10.38.12 PM.png|width=500! The following is the cpu usage chart of our cluster during this period of time: !Screen Shot 2018-09-29 at 10.38.38 PM.png|width=500px! > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-09-29 at 10.38.38 PM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 10.38.38 PM.png, Screen Shot 2018-09-29 at 8.34.50 > PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-09-29 at 10.38.12 PM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 10.38.12 PM.png, > Screen Shot 2018-09-29 at 8.34.50 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-09-29 at 8.34.50 PM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png, Screen Shot 2018-09-29 at 8.34.50 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633137#comment-16633137 ] Yu Yang commented on KAFKA-7304: Thanks [~rsivaram]! We will try these fixes and let you know the result. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7450) kafka controller RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers
Yu Yang created KAFKA-7450: -- Summary: kafka controller RequestSendThread stuck in infinite loop after SSL handshake failure with peer brokers Key: KAFKA-7450 URL: https://issues.apache.org/jira/browse/KAFKA-7450 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 2.0.0 Reporter: Yu Yang After updating security.inter.broker.protocol to SSL for our cluster, we observed that the controller can get into almost 100% cpu usage. {code} listeners=PLAINTEXT://:9092,SSL://:9093 security.inter.broker.protocol=SSL {code} There is no obvious error in server.log. But in controller.log, there is repetitive SSL handshare failure error as below: {code} [2018-09-28 05:53:10,821] WARN [RequestSendThread controllerId=6042] Controller 6042's connection to broker datakafka06176.ec2.pin220.com:9093 (id: 6176 rack: null) was unsuccessful (kafka.controller.RequestSendThread) org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1487) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) at org.apache.kafka.common.network.Selector.poll(Selector.java:425) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73) at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Caused by: javax.net.ssl.SSLProtocolException: Handshake message sequence violation, 2 at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:196) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) at java.security.AccessController.doPrivileged(Native Method) at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) at org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) ... 10 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with high concurrent ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Summary: kafka periodically run into high cpu usage with high concurrent ssl writing (was: kafka periodically run into high cpu usage with high concurent ssl writing) > kafka periodically run into high cpu usage with high concurrent ssl writing > --- > > Key: KAFKA-7364 > URL: https://issues.apache.org/jira/browse/KAFKA-7364 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png > > > while testing ssl writing to kafka, we found that kafka often run into high > cpu usage due to inefficiency in jdk ssl implementation. > In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka > 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write > to through ssl channel. We observed that the network threads often get 100% > cpu usage after enabling ssl writing to kafka. To improve kafka's > throughput, we have "num.network.threads=32" for the broker. Even with 32 > network threads, we see the broker cpu usage jump right after ssl writing is > enabled. The broker's cpu usage would drop immediately when we disabled ssl > writing. > !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! > When the broker's cpu usage is high, 'perf top' shows that kafka is busy with > executing code in libsunec.so. The following is a sample stack track that > we get when the broker's cpu usage was high. This seems to be related to > inefficiency in jdk ssl related implementation. Switching to use > https://github.com/netty/netty-tcnative to handle ssl handshake can be > helpful. > {code} > Thread 77562: (state = IN_NATIVE) > - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], > byte[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 > (Compiled frame) > - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 > (Compiled frame) > - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) > - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, > java.lang.String) @bci=136, line=444 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) > @bci=48, line=166 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, > java.util.Collection) @bci=24, line=147 (Compiled frame) > - > sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, > java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, > sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 > (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) > @bci=217, line=141 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) > - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) > - > sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], > java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) > - > sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=232, line=259 (Compiled frame) > - > sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=6, line=260 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, > java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, > java.lang.String) @bci=10, line=324 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 > (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) > - >
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with high concurent ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Summary: kafka periodically run into high cpu usage with high concurent ssl writing (was: kafka periodically run into high cpu usage with ssl writing) > kafka periodically run into high cpu usage with high concurent ssl writing > -- > > Key: KAFKA-7364 > URL: https://issues.apache.org/jira/browse/KAFKA-7364 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png > > > while testing ssl writing to kafka, we found that kafka often run into high > cpu usage due to inefficiency in jdk ssl implementation. > In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka > 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write > to through ssl channel. We observed that the network threads often get 100% > cpu usage after enabling ssl writing to kafka. To improve kafka's > throughput, we have "num.network.threads=32" for the broker. Even with 32 > network threads, we see the broker cpu usage jump right after ssl writing is > enabled. The broker's cpu usage would drop immediately when we disabled ssl > writing. > !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! > When the broker's cpu usage is high, 'perf top' shows that kafka is busy with > executing code in libsunec.so. The following is a sample stack track that > we get when the broker's cpu usage was high. This seems to be related to > inefficiency in jdk ssl related implementation. Switching to use > https://github.com/netty/netty-tcnative to handle ssl handshake can be > helpful. > {code} > Thread 77562: (state = IN_NATIVE) > - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], > byte[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 > (Compiled frame) > - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 > (Compiled frame) > - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) > - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, > java.lang.String) @bci=136, line=444 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) > @bci=48, line=166 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, > java.util.Collection) @bci=24, line=147 (Compiled frame) > - > sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, > java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, > sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 > (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) > @bci=217, line=141 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) > - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) > - > sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], > java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) > - > sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=232, line=259 (Compiled frame) > - > sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=6, line=260 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, > java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, > java.lang.String) @bci=10, line=324 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 > (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) > - >
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602415#comment-16602415 ] Yu Yang commented on KAFKA-7304: [~rsivaram] Thanks for looking into the issue! We are still evaluating whether Ted's patch makes a difference. I am testing ssl writing at a smaller scale now. gceasy reports that some brokers run with jdk 10.0.2 + kafka 2.0 with Ted's patch has memory leakage: http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMy8tLWdjLmxvZy5nei0tMTktMS01. Meanwhile brokers running with jdk 1.8u172 seems fine. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDkvMy8tLWdjLmxvZy4xLmN1cnJlbnQuZ3otLTE5LTktMjc= . We used the default value (10 minutes) for `connections.max.idle.ms`. I also tried to set `connections.max.idle.ms` to 1 minute and 30 seconds. Setting a shorter connections.max.idle.ms did not help. When we did experiments with broker restart, all brokers that were not restarted were up for longer than `connections.max.idle.ms`. The heap memory usages for those brokers was not drop. The failed authentications should not be expected. It is not clear to me how that happened. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. This seems to be related to inefficiency in jdk ssl related implementation. Switching to use https://github.com/netty/netty-tcnative to handle ssl handshake can be helpful. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) -
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. This seems to be related to inefficiency in jdk ssl related implementation. Switching to use https://github.com/netty/netty-tcnative to handle ssl handshake might be helpful. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) -
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598306#comment-16598306 ] Yu Yang commented on KAFKA-7304: We experimented with 1.8.0_171, and did not see obvious improvements on reducing ssl related memory usage etc. We also experimented with jdk 10.0.2. Currently we are see two issues with ssl writing to kafka: 1) there is some potential resource leakage in kafka. the leakage might have already been fixed with [~yuzhih...@gmail.com]'s patch. 2) kafka may get high cpu usage when a large number of clients writes to kafka through ssl channels. see https://issues.apache.org/jira/browse/KAFKA-7364 for details. This seems to be related to inefficiency in jdk ssl related implementation. Switching to use https://github.com/netty/netty-tcnative to handle ssl handshake might be helpful. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, line=331
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster of 12 d2.8xlarge instances that uses kafka 2.0.0, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, line=331
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster that have 12 d2.8xlarge instances, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png|height=360! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, line=331 (Compiled frame) -
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Description: while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster that have 12 d2.8xlarge instances, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. The broker's cpu usage would drop immediately when we disabled ssl writing. !Screen Shot 2018-08-30 at 10.57.32 PM.png! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.doHandshake() @bci=570, line=331 (Compiled frame) -
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Attachment: Screen Shot 2018-08-30 at 10.57.32 PM.png > kafka periodically run into high cpu usage with ssl writing > --- > > Key: KAFKA-7364 > URL: https://issues.apache.org/jira/browse/KAFKA-7364 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-30 at 10.57.32 PM.png > > > while testing ssl writing to kafka, we found that kafka often run into high > cpu usage due to inefficiency in jdk ssl implementation. > In detail, we use a test cluster that have 12 d2.8xlarge instances, > jdk-10.0.2, and hosts only one topic that have ~20k producers write to > through ssl channel. We observed that the network threads often get 100% cpu > usage after enabling ssl writing to kafka. To improve kafka's throughput, > we have "num.network.threads=32" for the broker. Even with 32 network > threads, we see the broker cpu usage jump right after ssl writing is enabled. > !Screen Shot 2018-08-30 at 10.57.32 PM.png! > When the broker's cpu usage is high, 'perf top' shows that kafka is busy with > executing code in libsunec.so. The following is a sample stack track that > we get when the broker's cpu usage was high. > {code} > Thread 77562: (state = IN_NATIVE) > - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], > byte[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 > (Compiled frame) > - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 > (Compiled frame) > - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) > - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, > java.lang.String) @bci=136, line=444 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) > @bci=48, line=166 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, > java.util.Collection) @bci=24, line=147 (Compiled frame) > - > sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, > java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, > sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 > (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) > @bci=217, line=141 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) > - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) > - > sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], > java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) > - > sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=232, line=259 (Compiled frame) > - > sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=6, line=260 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, > java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, > java.lang.String) @bci=10, line=324 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 > (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) > - > sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) > @bci=190, line=1966 (Compiled frame) > - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, > line=237 (Compiled frame) > - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled > frame) > - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) > - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled
[jira] [Updated] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
[ https://issues.apache.org/jira/browse/KAFKA-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7364: --- Attachment: (was: Screen Shot 2018-08-30 at 10.57.32 PM.png) > kafka periodically run into high cpu usage with ssl writing > --- > > Key: KAFKA-7364 > URL: https://issues.apache.org/jira/browse/KAFKA-7364 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Yu Yang >Priority: Major > > while testing ssl writing to kafka, we found that kafka often run into high > cpu usage due to inefficiency in jdk ssl implementation. > In detail, we use a test cluster that have 12 d2.8xlarge instances, > jdk-10.0.2, and hosts only one topic that have ~20k producers write to > through ssl channel. We observed that the network threads often get 100% cpu > usage after enabling ssl writing to kafka. To improve kafka's throughput, > we have "num.network.threads=32" for the broker. Even with 32 network > threads, we see the broker cpu usage jump right after ssl writing is enabled. > !Screen Shot 2018-08-30 at 10.57.32 PM.png! > When the broker's cpu usage is high, 'perf top' shows that kafka is busy with > executing code in libsunec.so. The following is a sample stack track that > we get when the broker's cpu usage was high. > {code} > Thread 77562: (state = IN_NATIVE) > - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], > byte[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 > (Compiled frame) > - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 > (Compiled frame) > - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) > - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, > java.lang.String) @bci=136, line=444 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) > @bci=48, line=166 (Compiled frame) > - > sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, > java.util.Collection) @bci=24, line=147 (Compiled frame) > - > sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, > java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, > sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 > (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) > @bci=217, line=141 (Compiled frame) > - > sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) > - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, > java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) > - > sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], > java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) > - > sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=232, line=259 (Compiled frame) > - > sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], > java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) > @bci=6, line=260 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, > java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, > java.lang.String) @bci=10, line=324 (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 > (Compiled frame) > - > sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], > java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) > - > sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) > @bci=190, line=1966 (Compiled frame) > - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, > line=237 (Compiled frame) > - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled > frame) > - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) > - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) > - >
[jira] [Created] (KAFKA-7364) kafka periodically run into high cpu usage with ssl writing
Yu Yang created KAFKA-7364: -- Summary: kafka periodically run into high cpu usage with ssl writing Key: KAFKA-7364 URL: https://issues.apache.org/jira/browse/KAFKA-7364 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.0.0 Reporter: Yu Yang while testing ssl writing to kafka, we found that kafka often run into high cpu usage due to inefficiency in jdk ssl implementation. In detail, we use a test cluster that have 12 d2.8xlarge instances, jdk-10.0.2, and hosts only one topic that have ~20k producers write to through ssl channel. We observed that the network threads often get 100% cpu usage after enabling ssl writing to kafka. To improve kafka's throughput, we have "num.network.threads=32" for the broker. Even with 32 network threads, we see the broker cpu usage jump right after ssl writing is enabled. !Screen Shot 2018-08-30 at 10.57.32 PM.png! When the broker's cpu usage is high, 'perf top' shows that kafka is busy with executing code in libsunec.so. The following is a sample stack track that we get when the broker's cpu usage was high. {code} Thread 77562: (state = IN_NATIVE) - sun.security.ec.ECDSASignature.verifySignedDigest(byte[], byte[], byte[], byte[]) @bci=0 (Compiled frame; information may be imprecise) - sun.security.ec.ECDSASignature.engineVerify(byte[]) @bci=70, line=321 (Compiled frame) - java.security.Signature$Delegate.engineVerify(byte[]) @bci=9, line=1222 (Compiled frame) - java.security.Signature.verify(byte[]) @bci=10, line=655 (Compiled frame) - sun.security.x509.X509CertImpl.verify(java.security.PublicKey, java.lang.String) @bci=136, line=444 (Compiled frame) - sun.security.provider.certpath.BasicChecker.verifySignature(java.security.cert.X509Certificate) @bci=48, line=166 (Compiled frame) - sun.security.provider.certpath.BasicChecker.check(java.security.cert.Certificate, java.util.Collection) @bci=24, line=147 (Compiled frame) - sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(java.security.cert.CertPath, java.util.List, java.util.List) @bci=316, line=125 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(java.security.cert.TrustAnchor, sun.security.provider.certpath.PKIX$ValidatorParams) @bci=390, line=233 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.validate(sun.security.provider.certpath.PKIX$ValidatorParams) @bci=217, line=141 (Compiled frame) - sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=7, line=80 (Compiled frame) - java.security.cert.CertPathValidator.validate(java.security.cert.CertPath, java.security.cert.CertPathParameters) @bci=6, line=292 (Compiled frame) - sun.security.validator.PKIXValidator.doValidate(java.security.cert.X509Certificate[], java.security.cert.PKIXBuilderParameters) @bci=34, line=357 (Compiled frame) - sun.security.validator.PKIXValidator.engineValidate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=232, line=259 (Compiled frame) - sun.security.validator.Validator.validate(java.security.cert.X509Certificate[], java.util.Collection, java.security.AlgorithmConstraints, java.lang.Object) @bci=6, line=260 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.validate(sun.security.validator.Validator, java.security.cert.X509Certificate[], java.security.AlgorithmConstraints, java.lang.String) @bci=10, line=324 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine, boolean) @bci=179, line=279 (Compiled frame) - sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(java.security.cert.X509Certificate[], java.lang.String, javax.net.ssl.SSLEngine) @bci=5, line=130 (Compiled frame) - sun.security.ssl.ServerHandshaker.clientCertificate(sun.security.ssl.HandshakeMessage$CertificateMsg) @bci=190, line=1966 (Compiled frame) - sun.security.ssl.ServerHandshaker.processMessage(byte, int) @bci=160, line=237 (Compiled frame) - sun.security.ssl.Handshaker.processLoop() @bci=96, line=1052 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=4, line=992 (Compiled frame) - sun.security.ssl.Handshaker$1.run() @bci=1, line=989 (Compiled frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame) - sun.security.ssl.Handshaker$DelegatedTask.run() @bci=24, line=1467 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks() @bci=13, line=393 (Compiled frame) - org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(boolean) @bci=88, line=473 (Compiled frame) -
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596663#comment-16596663 ] Yu Yang commented on KAFKA-7304: [~yuzhih...@gmail.com] I applied https://issues.apache.org/jira/secure/attachment/12937151/7304.v7.txt to our test cluster and did more experiments yesterday. We did not observed channel closing/removal related log messages that are added to the patch. I took another memory dump on a test host. This time the memory analyzer reports memory suspects leakage in `sun.security.ssl.SSLSessionImpl`. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-08-29 at 10.49.03 AM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-08-29 at 10.50.47 AM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-08-28 at 11.09.45 AM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:28 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS03LTYtMzU= Sometimes the cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. Sometimes the cluster still could not recover fully due to dramatic increase of heap size and high cpu usage when we turned on the ssl writing traffic again. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= Sometimes the cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. Sometimes the cluster still could not recover fully due to dramatic increase of heap size and high cpu usage when we turned on the ssl writing traffic again. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 7:26 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= Sometimes the cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. Sometimes the cluster still could not recover fully due to dramatic increase of heap size and high cpu usage when we turned on the ssl writing traffic again. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 5:43 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS01LTQzLTU= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/28/18 1:01 AM: - After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjgvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0xLTAtNDc= The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02 The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:10 PM: -- After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMy0xMC02 The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang edited comment on KAFKA-7304 at 8/27/18 11:08 PM: -- After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered by turning off the ssl writing traffic to the cluster, letting the broker to garbage collect the objects in the old gen, and resuming the ssl writing traffic. was (Author: yuyang08): After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered after we turned off the ssl writing traffic to the cluster, let the broker to garbage collect the objects in the old gen, and resume the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false >
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594320#comment-16594320 ] Yu Yang commented on KAFKA-7304: After more experiments, we currently think that the issue is caused by too many idle ssl connections that are not closed on time. I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 32gb for kafka process heap space, and have ~40k clients writes to a test topic on this cluster. The following graph shows the jvm heap usage and gc activity in the past 24 hours or so. The cluster ran fine with low heap usage and low cpu usage. However, the heap usage and the cpu usage of brokers increased sharply when we added or terminated brokers in this cluster (for broker termination, there was no topic partitions allocated on those terminated nodes). http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA== The cluster can be recovered after we turned off the ssl writing traffic to the cluster, let the broker to garbage collect the objects in the old gen, and resume the ssl writing traffic. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592762#comment-16592762 ] Yu Yang commented on KAFKA-7304: Thanks for looking into this, [~yuzhih...@gmail.com]! We also did more experiments on our side with various settings. Based on the initial experiment, it seems to me that your patch earlier has fixed the resource leakage in the closing channel. Meanwhile with >40k concurrent ssl connections, sometimes the brokers may still run into OOM issue as the connections are not closed on time. Currently I am experimenting with increased heap size. Will report back to the thread if we have any findings. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Affects Version/s: 1.1.1 > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, Screen Shot 2018-08-16 at 11.04.16 PM.png, > Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 > PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at > 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot > 2018-08-17 at 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dumps , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector objects. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dumps , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL:
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dumps , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dumps , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL:
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dumps , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL:
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584156#comment-16584156 ] Yu Yang edited comment on KAFKA-7304 at 8/17/18 4:51 PM: - [~ijuma] We have an internal build that cherry-picks 1.1.1 changes. I might miss some fixes. https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java shows that there were only two Selector.java related changes after 1.1.0 release date that was March 23rd. Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? kafka 1.1.0 has included that change. was (Author: yuyang08): [~ijuma] We have an internal build that cherry-picks 1.1.1 changes. I might miss some fixes. https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java shows that there were only two Selector.java related changes after 1.1.0 release date that was March 23rd. Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? We have included that change. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584156#comment-16584156 ] Yu Yang commented on KAFKA-7304: [~ijuma] We have an internal build that cherry-picks 1.1.1 changes. I might miss some fixes. https://github.com/apache/kafka/commits/1.1/clients/src/main/java/org/apache/kafka/common/network/Selector.java shows that there were only two Selector.java related changes after 1.1.0 release date that was March 23rd. Do you mean the fix for https://issues.apache.org/jira/browse/KAFKA-6529 ? We have included that change. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Affects Version/s: (was: 1.1.1) > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems related to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. One observation is that the memory leak seems related to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL:
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-08-17 at 1.05.30 AM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png, Screen Shot 2018-08-17 at > 1.05.30 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-08-17 at 1.04.32 AM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png, Screen Shot 2018-08-17 at 1.04.32 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Attachment: Screen Shot 2018-08-17 at 1.03.35 AM.png > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png, Screen Shot 2018-08-17 at 1.03.35 > AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583565#comment-16583565 ] Yu Yang commented on KAFKA-7304: [~yuzhih...@gmail.com] There was no exception in server.log before we hit frequent full gc. There was various errors in the log after the broker ran into full gc. But i think that those exceptions are not relevant to the root cause. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583565#comment-16583565 ] Yu Yang edited comment on KAFKA-7304 at 8/17/18 8:01 AM: - [~yuzhih...@gmail.com] There was no exception in server.log before the broker hit frequent full gc. There was various errors in the log after the broker ran into full gc. But i think that those exceptions are not relevant to the root cause. was (Author: yuyang08): [~yuzhih...@gmail.com] There was no exception in server.log before we hit frequent full gc. There was various errors in the log after the broker ran into full gc. But i think that those exceptions are not relevant to the root cause. > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port= > -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20. {code} java -version java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 the command line for running kafka: {code} java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.rmi.port= -cp /usr/local/libs/* kafka.Kafka /etc/kafka/server.properties {code} was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true >
[jira] [Updated] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7304: --- Description: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, and hit the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 was: We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enable ssl writing at scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, and hit the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > memory leakage in org.apache.kafka.common.network.Selector > -- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0, 1.1.1 >Reporter: Yu Yang >Priority: Major > Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot > 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, > Screen Shot 2018-08-16 at 4.26.19 PM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients writes concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, and hit the same issue. > We took a few heap dump , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector object. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > {code} > private final Map channels; > private final Map closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7304) memory leakage in org.apache.kafka.common.network.Selector
Yu Yang created KAFKA-7304: -- Summary: memory leakage in org.apache.kafka.common.network.Selector Key: KAFKA-7304 URL: https://issues.apache.org/jira/browse/KAFKA-7304 Project: Kafka Issue Type: Bug Components: core Affects Versions: 1.1.1, 1.1.0 Reporter: Yu Yang Attachments: Screen Shot 2018-08-16 at 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enable ssl writing at scale (>40k clients writes concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, and hit the same issue. We took a few heap dump , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector object. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner. {code} private final Map channels; private final Map closingChannels; {code} Please see the attached images and the following link for sample gc analysis. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7229) Failed to dynamically update kafka certificate in kafka 2.0.0
[ https://issues.apache.org/jira/browse/KAFKA-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7229: --- Priority: Major (was: Critical) > Failed to dynamically update kafka certificate in kafka 2.0.0 > - > > Key: KAFKA-7229 > URL: https://issues.apache.org/jira/browse/KAFKA-7229 > Project: Kafka > Issue Type: Bug > Components: security >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04.5 LTS >Reporter: Yu Yang >Priority: Major > > In kafka 1.1, we use the following command in a cron job to dynamically > update the certificate that kafka uses : > kafka-configs.sh --bootstrap-server localhost:9093 --command-config > /var/pinterest/kafka/client.properties --alter --add-config > listener.name.ssl.ssl.keystore.location=/var/certs/kafka/kafka.keystore.jks.1533141082.38 > --entity-type brokers --entity-name 9 > In kafka 2.0.0, the command fails with the following exception: > [2018-08-01 16:38:01,480] ERROR [AdminClient clientId=adminclient-1] > Connection to node -1 failed authentication due to: SSL handshake failed > (org.apache.kafka.clients.NetworkClient) > Error while executing config command with args '--bootstrap-server > localhost:9093 --command-config /var/pinterest/kafka/client.properties > --alter --add-config > listener.name.ssl.ssl.keystore.location=/var/pinterest/kafka/kafka.keystore.jks.1533141082.38 > --entity-type brokers --entity-name 9' > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274) > at kafka.admin.ConfigCommand$.brokerConfig(ConfigCommand.scala:346) > at kafka.admin.ConfigCommand$.alterBrokerConfig(ConfigCommand.scala:304) > at > kafka.admin.ConfigCommand$.processBrokerConfig(ConfigCommand.scala:290) > at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:83) > at kafka.admin.ConfigCommand.main(ConfigCommand.scala) > Caused by: org.apache.kafka.common.errors.SslAuthenticationException: SSL > handshake failed > Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478) > at > sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at > sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214) > at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186) > at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeWrap(SslTransportLayer.java:439) > at > org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:304) > at > org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) > at org.apache.kafka.common.network.Selector.poll(Selector.java:425) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) > at > org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1116) > at java.lang.Thread.run(Thread.java:748) > Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem > at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) > at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) > at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304) > at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) > at > sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514) > at > sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) > at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) > at java.security.AccessController.doPrivileged(Native Method) > at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) > at > org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) > at >
[jira] [Updated] (KAFKA-7229) Failed to dynamically update kafka certificate in kafka 2.0.0
[ https://issues.apache.org/jira/browse/KAFKA-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7229: --- Priority: Critical (was: Major) > Failed to dynamically update kafka certificate in kafka 2.0.0 > - > > Key: KAFKA-7229 > URL: https://issues.apache.org/jira/browse/KAFKA-7229 > Project: Kafka > Issue Type: Bug > Components: security >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04.5 LTS >Reporter: Yu Yang >Priority: Critical > > In kafka 1.1, we use the following command in a cron job to dynamically > update the certificate that kafka uses : > kafka-configs.sh --bootstrap-server localhost:9093 --command-config > /var/pinterest/kafka/client.properties --alter --add-config > listener.name.ssl.ssl.keystore.location=/var/certs/kafka/kafka.keystore.jks.1533141082.38 > --entity-type brokers --entity-name 9 > In kafka 2.0.0, the command fails with the following exception: > [2018-08-01 16:38:01,480] ERROR [AdminClient clientId=adminclient-1] > Connection to node -1 failed authentication due to: SSL handshake failed > (org.apache.kafka.clients.NetworkClient) > Error while executing config command with args '--bootstrap-server > localhost:9093 --command-config /var/pinterest/kafka/client.properties > --alter --add-config > listener.name.ssl.ssl.keystore.location=/var/pinterest/kafka/kafka.keystore.jks.1533141082.38 > --entity-type brokers --entity-name 9' > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake > failed > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274) > at kafka.admin.ConfigCommand$.brokerConfig(ConfigCommand.scala:346) > at kafka.admin.ConfigCommand$.alterBrokerConfig(ConfigCommand.scala:304) > at > kafka.admin.ConfigCommand$.processBrokerConfig(ConfigCommand.scala:290) > at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:83) > at kafka.admin.ConfigCommand.main(ConfigCommand.scala) > Caused by: org.apache.kafka.common.errors.SslAuthenticationException: SSL > handshake failed > Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem > at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478) > at > sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) > at > sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214) > at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186) > at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469) > at > org.apache.kafka.common.network.SslTransportLayer.handshakeWrap(SslTransportLayer.java:439) > at > org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:304) > at > org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) > at org.apache.kafka.common.network.Selector.poll(Selector.java:425) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) > at > org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1116) > at java.lang.Thread.run(Thread.java:748) > Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem > at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) > at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) > at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304) > at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) > at > sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514) > at > sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) > at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) > at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) > at java.security.AccessController.doPrivileged(Native Method) > at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) > at > org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) > at >
[jira] [Updated] (KAFKA-7229) Failed to dynamically update kafka certificate in kafka 2.0.0
[ https://issues.apache.org/jira/browse/KAFKA-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-7229: --- Description: In kafka 1.1, we use the following command in a cron job to dynamically update the certificate that kafka uses : kafka-configs.sh --bootstrap-server localhost:9093 --command-config /var/pinterest/kafka/client.properties --alter --add-config listener.name.ssl.ssl.keystore.location=/var/certs/kafka/kafka.keystore.jks.1533141082.38 --entity-type brokers --entity-name 9 In kafka 2.0.0, the command fails with the following exception: [2018-08-01 16:38:01,480] ERROR [AdminClient clientId=adminclient-1] Connection to node -1 failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient) Error while executing config command with args '--bootstrap-server localhost:9093 --command-config /var/pinterest/kafka/client.properties --alter --add-config listener.name.ssl.ssl.keystore.location=/var/pinterest/kafka/kafka.keystore.jks.1533141082.38 --entity-type brokers --entity-name 9' java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274) at kafka.admin.ConfigCommand$.brokerConfig(ConfigCommand.scala:346) at kafka.admin.ConfigCommand$.alterBrokerConfig(ConfigCommand.scala:304) at kafka.admin.ConfigCommand$.processBrokerConfig(ConfigCommand.scala:290) at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:83) at kafka.admin.ConfigCommand.main(ConfigCommand.scala) Caused by: org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214) at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186) at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469) at org.apache.kafka.common.network.SslTransportLayer.handshakeWrap(SslTransportLayer.java:439) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:304) at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:258) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:487) at org.apache.kafka.common.network.Selector.poll(Selector.java:425) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1116) at java.lang.Thread.run(Thread.java:748) Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) at java.security.AccessController.doPrivileged(Native Method) at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) at org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:393) at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:473) at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:331) ... 7 more Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching localhost found. at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:204) at sun.security.util.HostnameChecker.match(HostnameChecker.java:95) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) at
[jira] [Commented] (KAFKA-5886) Introduce delivery.timeout.ms producer config (KIP-91)
[ https://issues.apache.org/jira/browse/KAFKA-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540741#comment-16540741 ] Yu Yang commented on KAFKA-5886: [~ashsskum] The pull request [https://github.com/apache/kafka/pull/5270] is currently under review. [~becket_qin], [~guozhang] can you help to assign the ticket to me? > Introduce delivery.timeout.ms producer config (KIP-91) > -- > > Key: KAFKA-5886 > URL: https://issues.apache.org/jira/browse/KAFKA-5886 > Project: Kafka > Issue Type: Improvement > Components: producer >Reporter: Sumant Tambe >Assignee: Sumant Tambe >Priority: Major > > We propose adding a new timeout delivery.timeout.ms. The window of > enforcement includes batching in the accumulator, retries, and the inflight > segments of the batch. With this config, the user has a guaranteed upper > bound on when a record will either get sent, fail or expire from the point > when send returns. In other words we no longer overload request.timeout.ms to > act as a weak proxy for accumulator timeout and instead introduce an explicit > timeout that users can rely on without exposing any internals of the producer > such as the accumulator. > See > [KIP-91|https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer] > for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-6544) kafka process should exit when it encounters "java.io.IOException: Too many open files"
[ https://issues.apache.org/jira/browse/KAFKA-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated KAFKA-6544: --- Description: Our kafka cluster encountered a few disk/xfs failures in the cloud vm instances. When a disk/xfs failure happens, kafka process did not exit gracefully. Instead, it ran into "" status, with port 9092 still be reachable. when failures like this happens, kafka should shutdown all threads and exit. The following is the kafka logs when the failure happens: {code:java} [2018-02-08 12:52:31,764] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:340) at kafka.network.Acceptor.run(SocketServer.scala:283) at java.lang.Thread.run(Thread.java:748) [2018-02-08 12:52:31,772] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:340) at kafka.network.Acceptor.run(SocketServer.scala:283) at java.lang.Thread.run(Thread.java:748) [2018-02-08 12:52:31,772] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:340) at kafka.network.Acceptor.run(SocketServer.scala:283) at java.lang.Thread.run(Thread.java:748) {code} was: Our kafka cluster encountered a few disk/xfs failures in the cloud vm instances. When a disk/xfs failure happens, kafka process did not exit gracefully. Instead, it run into "" status, with port 9092 still be reachable. when failures like this happens, kafka should shutdown all threads and exit. The following is the kafka logs when the failure happens: {code:java} [2018-02-08 12:52:31,764] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:340) at kafka.network.Acceptor.run(SocketServer.scala:283) at java.lang.Thread.run(Thread.java:748) [2018-02-08 12:52:31,772] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:340) at kafka.network.Acceptor.run(SocketServer.scala:283) at java.lang.Thread.run(Thread.java:748) [2018-02-08 12:52:31,772] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:340) at kafka.network.Acceptor.run(SocketServer.scala:283) at java.lang.Thread.run(Thread.java:748) {code} > kafka process should exit when it encounters "java.io.IOException: Too many > open files" > - > > Key: KAFKA-6544 > URL: https://issues.apache.org/jira/browse/KAFKA-6544 > Project: Kafka > Issue Type: Bug > Components: admin, network >Affects Versions: 0.10.2.1 >Reporter: Yu Yang >Priority: Major > > Our kafka cluster encountered a few disk/xfs failures in the cloud vm > instances. When a disk/xfs failure happens, kafka process did not exit > gracefully. Instead, it ran into "" status, with port 9092 still be >
[jira] [Commented] (KAFKA-6544) kafka process should exit when it encounters "java.io.IOException: Too many open files"
[ https://issues.apache.org/jira/browse/KAFKA-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357483#comment-16357483 ] Yu Yang commented on KAFKA-6544: [~cmccabe] The kafka process is in `` status. sudo ls -l /proc/$kafka_pid/fd returns 0. I am also including "netstat -pnt" output here. Connections are either in ESTABLISHED or CLOSE_WAIT status. {code} proc/30413/fd]# sudo ls -l /proc/30413/fd total 0 {code} {code} netstat -pnt | grep "10.1.160.124:9092" | wc 116 812 11252 {code} {code} netstat -pnt | grep "10.1.160.124:9092" tcp 29 0 10.1.160.124:9092 10.1.25.241:55616 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:58624 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.9.121:33894CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:53886 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:43122 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:50766 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.26.165:34282 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.79.149:47682 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.163.135:44008 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.66.116:52398 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.64.116:36656 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.207.247:51904 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.9.16:45942 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.131.15:57118 CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:55974 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.214.5:33040CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:33494 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.201.139:60230 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.207.247:51792 CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:42858 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:44246 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.194.26:42406 CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:32902 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.169.94:35532 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.193.101:48832 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.204.225:60946 CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:35772 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:46972 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:56226 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:46432 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:44436 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:4 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:47364 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:44908 ESTABLISHED - tcp 29 0 10.1.160.124:9092 10.1.25.241:43060 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.10.15:39282CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.181.86:55500 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.17.191:32812 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.141.30:52024 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.76.141:51366 CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:50940 ESTABLISHED - tcp 65 0 10.1.160.124:9092 10.1.11.196:44064 CLOSE_WAIT - tcp 65 0 10.1.160.124:9092 10.1.143.107:37116 CLOSE_WAIT - tcp 29 0 10.1.160.124:9092 10.1.25.241:37416 ESTABLISHED - tcp 65 0 10.1.160.124:9092