[jira] [Created] (KAFKA-1628) [New Java Producer] Topic which contains "." does not correct corresponding metric name
Bhavesh Mistry created KAFKA-1628: - Summary: [New Java Producer] Topic which contains "." does not correct corresponding metric name Key: KAFKA-1628 URL: https://issues.apache.org/jira/browse/KAFKA-1628 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 0.8.2 Environment: ALL Reporter: Bhavesh Mistry Priority: Minor Hmm, it seems that we do allow "." in the topic name. The topic name can't be just "." or ".." though. So, if there is a topic "test.1", we will have the following jmx metrics name. kafka.producer.console-producer.topic.test:type=1 It should be changed to kafka.producer.console-producer.topic:type=test.1 Could you file a jira to follow up on this? Thanks, Jun -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
Bhavesh Mistry created KAFKA-1642: - Summary: [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost Key: KAFKA-1642 URL: https://issues.apache.org/jira/browse/KAFKA-1642 Project: Kafka Issue Type: Bug Components: producer Affects Versions: 0.8.2 Reporter: Bhavesh Mistry Assignee: Jun Rao I see my CPU spike to 100% when network connection is lost for while. It seems network IO thread are very busy logging following error message. Is this expected behavior ? 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: java.lang.IllegalStateException: No entry found for node -2 at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:744) Thanks, Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148121#comment-14148121 ] Bhavesh Mistry edited comment on KAFKA-1642 at 9/25/14 6:42 PM: HI [~jkreps], I will work on the sample program. We are not setting reconnect.backoff.ms and retry.backoff.ms configuration so it would be default configuration. Only thing I can tell you is that I have 4 Producer instances per JVM. So this might amplify issue. Thanks, Bhavesh was (Author: bmis13): HI [~jkreps], I will work on the sample program. We are not setting reconnect.backoff.ms and retry.backoff.ms configuration so it would be default configuration. Only thing I can tell you is that I have 4 Producer instance per JVM. So this might amplify issue. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Jun Rao > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148121#comment-14148121 ] Bhavesh Mistry commented on KAFKA-1642: --- HI [~jkreps], I will work on the sample program. We are not setting reconnect.backoff.ms and retry.backoff.ms configuration so it would be default configuration. Only thing I can tell you is that I have 4 Producer instance per JVM. So this might amplify issue. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Jun Rao > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-1692) [Java New Producer] IO Thread Name Must include Client ID
Bhavesh Mistry created KAFKA-1692: - Summary: [Java New Producer] IO Thread Name Must include Client ID Key: KAFKA-1692 URL: https://issues.apache.org/jira/browse/KAFKA-1692 Project: Kafka Issue Type: Improvement Components: producer Affects Versions: 0.8.2 Reporter: Bhavesh Mistry Assignee: Jun Rao Priority: Trivial Please add client id so people who are looking at Jconsole or Profile tool can see Thread by client id since single JVM can have multiple producer instance. org.apache.kafka.clients.producer.KafkaProducer {code} String ioThreadName = "kafka-producer-network-thread"; if(clientId != null){ ioThreadName = ioThreadName + " | "+clientId; } this.ioThread = new KafkaThread(ioThreadName, this.sender, true); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1692) [Java New Producer] IO Thread Name Must include Client ID
[ https://issues.apache.org/jira/browse/KAFKA-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164275#comment-14164275 ] Bhavesh Mistry commented on KAFKA-1692: --- Description is just suggestion. Sorry could not submit path since I have release next week. Thanks, Bhavesh > [Java New Producer] IO Thread Name Must include Client ID > --- > > Key: KAFKA-1692 > URL: https://issues.apache.org/jira/browse/KAFKA-1692 > Project: Kafka > Issue Type: Improvement > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Trivial > > Please add client id so people who are looking at Jconsole or Profile tool > can see Thread by client id since single JVM can have multiple producer > instance. > org.apache.kafka.clients.producer.KafkaProducer > {code} > String ioThreadName = "kafka-producer-network-thread"; > if(clientId != null){ > ioThreadName = ioThreadName + " | "+clientId; > } > this.ioThread = new KafkaThread(ioThreadName, this.sender, true); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry commented on KAFKA-1642: --- {code} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { public static void main(String[] args) throws IOException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } Callback callback = new Callback() { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } }; ProducerRecord record = new ProducerRecord(topic, builder.toString().getBytes()); while (true) { try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callback); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } } } {code} {code: name=kafkaproducer.properties } # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1 # 64MB of Buffer for log lines (including all messages). buffer.memory=134217728 compression.type=snappy retries=3 # DEFAULT FROM THE KAFKA... # batch size = ((buffer.memory) / (number of partitions)) (so we can have in progress batch size created for each partition.). batch.size=1048576 #2MiB max.request.size=1048576 send.buffer.bytes=2097152 # We do not want to block the buffer Full so application thread will not be blocked but logs lines will be dropped... block.on.buffer.full=false #2MiB send.buffer.bytes=2097152 #wait... linger.ms=5000 {code} > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Jun Rao > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.j
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry edited comment on KAFKA-1642 at 10/13/14 4:08 PM: - {code TestNetworkDownProducer.java} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 200; static CountDownLatch latch = new CountDownLatch(200); public static void main(String[] args) throws IOException, InterruptedException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "logmon.test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); for(int i = 0 ; i < numberTh;i++){ service.execute(new MyProducer(producer,10,builder.toString(), topic)); } latch.await(); System.out.println("All Producers done...!"); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All done...!"); } static class MyProducer implements Runnable { Producer[] producer; long maxloops; String msg ; String topic; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { ProducerRecord record = new ProducerRecord(topic, msg.toString().getBytes()); Callback callBack = new MyCallback(); try{ for(long j=0 ; j < maxloops ; j++){ try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callBack); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } }finally { latch.countDown(); } } } static class MyCallback implements Callback { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } } } {code} {code: kafkaproducer.properties } # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry edited comment on KAFKA-1642 at 10/13/14 4:09 PM: - {code} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 200; static CountDownLatch latch = new CountDownLatch(200); public static void main(String[] args) throws IOException, InterruptedException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); for(int i = 0 ; i < numberTh;i++){ service.execute(new MyProducer(producer,10,builder.toString(), topic)); } latch.await(); System.out.println("All Producers done...!"); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All done...!"); } static class MyProducer implements Runnable { Producer[] producer; long maxloops; String msg ; String topic; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { ProducerRecord record = new ProducerRecord(topic, msg.toString().getBytes()); Callback callBack = new MyCallback(); try{ for(long j=0 ; j < maxloops ; j++){ try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callBack); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } }finally { latch.countDown(); } } } static class MyCallback implements Callback { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } } } {code} Property File {code } # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1 # 64MB of Buffer for log lines (including al
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry edited comment on KAFKA-1642 at 10/13/14 4:09 PM: - {code} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 200; static CountDownLatch latch = new CountDownLatch(200); public static void main(String[] args) throws IOException, InterruptedException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); for(int i = 0 ; i < numberTh;i++){ service.execute(new MyProducer(producer,10,builder.toString(), topic)); } latch.await(); System.out.println("All Producers done...!"); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All done...!"); } static class MyProducer implements Runnable { Producer[] producer; long maxloops; String msg ; String topic; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { ProducerRecord record = new ProducerRecord(topic, msg.toString().getBytes()); Callback callBack = new MyCallback(); try{ for(long j=0 ; j < maxloops ; j++){ try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callBack); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } }finally { latch.countDown(); } } } static class MyCallback implements Callback { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } } } {code} {code } # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1 # 64MB of Buffer for log lines (including all messages). buff
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry edited comment on KAFKA-1642 at 10/13/14 4:10 PM: - {code} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 200; static CountDownLatch latch = new CountDownLatch(200); public static void main(String[] args) throws IOException, InterruptedException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); for(int i = 0 ; i < numberTh;i++){ service.execute(new MyProducer(producer,10,builder.toString(), topic)); } latch.await(); System.out.println("All Producers done...!"); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All done...!"); } static class MyProducer implements Runnable { Producer[] producer; long maxloops; String msg ; String topic; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { ProducerRecord record = new ProducerRecord(topic, msg.toString().getBytes()); Callback callBack = new MyCallback(); try{ for(long j=0 ; j < maxloops ; j++){ try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callBack); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } }finally { latch.countDown(); } } } static class MyCallback implements Callback { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } } } {code} This is property file used: {code } # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1 # 64MB of Buffer for log lines (in
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry edited comment on KAFKA-1642 at 10/13/14 4:11 PM: - {code} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 200; static CountDownLatch latch = new CountDownLatch(200); public static void main(String[] args) throws IOException, InterruptedException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); for(int i = 0 ; i < numberTh;i++){ service.execute(new MyProducer(producer,10,builder.toString(), topic)); } latch.await(); System.out.println("All Producers done...!"); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All done...!"); } static class MyProducer implements Runnable { Producer[] producer; long maxloops; String msg ; String topic; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { ProducerRecord record = new ProducerRecord(topic, msg.toString().getBytes()); Callback callBack = new MyCallback(); try{ for(long j=0 ; j < maxloops ; j++){ try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callBack); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } }finally { latch.countDown(); } } } static class MyCallback implements Callback { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } } } {code} This is property file used: {code} # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1 # 64MB of Buffer for log lines (inc
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169453#comment-14169453 ] Bhavesh Mistry commented on KAFKA-1642: --- [~jkreps] Let me know if you need any other help !! Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Jun Rao > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169415#comment-14169415 ] Bhavesh Mistry edited comment on KAFKA-1642 at 10/13/14 5:05 PM: - {code} import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 200; static CountDownLatch latch = new CountDownLatch(200); public static void main(String[] args) throws IOException, InterruptedException { Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 4; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); for(int i = 0 ; i < numberTh;i++){ service.execute(new MyProducer(producer,10,builder.toString(), topic)); } latch.await(); System.out.println("All Producers done...!"); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All done...!"); } static class MyProducer implements Runnable { Producer[] producer; long maxloops; String msg ; String topic; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { ProducerRecord record = new ProducerRecord(topic, msg.toString().getBytes()); Callback callBack = new MyCallback(); try{ for(long j=0 ; j < maxloops ; j++){ try { for (int i = 0; i < producer.length; i++) { producer[i].send(record, callBack); } Thread.sleep(10); } catch (Throwable th) { System.err.println("FATAL "); th.printStackTrace(); } } }finally { latch.countDown(); } } } static class MyCallback implements Callback { public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ System.err.println("Msg dropped..!"); exception.printStackTrace(); } } } } {code} This is property file used: {code} # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers= BROKERS HERE... #Data Acks acks=1 # 64MB of Buffer for log lines
[jira] [Created] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
Bhavesh Mistry created KAFKA-1710: - Summary: [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition Key: KAFKA-1710 URL: https://issues.apache.org/jira/browse/KAFKA-1710 Project: Kafka Issue Type: Bug Components: producer Environment: Development Reporter: Bhavesh Mistry Assignee: Jun Rao Priority: Critical Hi Kafka Dev Team, When I run the test to send message to single partition for 3 minutes or so on, I have encounter deadlock (please see the screen attached) and thread contention from YourKit profiling. Use Case: 1) Aggregating messages into same partition for metric counting. 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. Here is output: Frozen threads found (potential deadlock) It seems that the following threads have not changed their stack for more than 10 seconds. These threads are possibly (but not necessarily!) in a deadlock or hung. pool-1-thread-128 <--- Frozen for at least 2m org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:237 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:84 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-159 <--- Frozen for at least 2m 1 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:237 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:84 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-55 <--- Frozen for at least 2m org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:237 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:84 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 Thanks, Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-1709) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
Bhavesh Mistry created KAFKA-1709: - Summary: [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition Key: KAFKA-1709 URL: https://issues.apache.org/jira/browse/KAFKA-1709 Project: Kafka Issue Type: Bug Components: producer Environment: Development Reporter: Bhavesh Mistry Assignee: Jun Rao Priority: Critical Hi Kafka Dev Team, When I run the test to send message to single partition for 3 minutes or so on, I have encounter deadlock (please see the screen attached) and thread contention from YourKit profiling. Use Case: 1) Aggregating messages into same partition for metric counting. 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. Here is output: Frozen threads found (potential deadlock) It seems that the following threads have not changed their stack for more than 10 seconds. These threads are possibly (but not necessarily!) in a deadlock or hung. pool-1-thread-128 <--- Frozen for at least 2m org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:237 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:84 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-159 <--- Frozen for at least 2m 1 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:237 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:84 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-55 <--- Frozen for at least 2m org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:237 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:84 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 Thanks, Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1710: -- Attachment: TestNetworkDownProducer.java Java Test Program to Reproduce this issue. > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: TestNetworkDownProducer.java > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1710: -- Attachment: Screen Shot 2014-10-15 at 9.09.06 PM.png Screen Shot 2014-10-13 at 10.19.04 AM.png Your Kit Thread view show thread contentions... > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, TestNetworkDownProducer.java > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173330#comment-14173330 ] Bhavesh Mistry commented on KAFKA-1710: --- Here is out put of Yourkit: {code} Frozen threads found (potential deadlock) It seems that the following threads have not changed their stack for more than 10 seconds. These threads are possibly (but not necessarily!) in a deadlock or hung. kafka-producer-network-thread <--- Frozen for at least 14 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.ready(Cluster, long) RecordAccumulator.java:214 org.apache.kafka.clients.producer.internals.Sender.run(long) Sender.java:147 org.apache.kafka.clients.producer.internals.Sender.run() Sender.java:115 java.lang.Thread.run() Thread.java:744 pool-1-thread-106 <--- Frozen for at least 20 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-15 <--- Frozen for at least 13 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-161 <--- Frozen for at least 13 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-165 <--- Frozen for at least 17 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-172 <--- Frozen for at least 20 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-184 <--- Frozen for at least 11 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-26 <--- Frozen for at least 11 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run()
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173332#comment-14173332 ] Bhavesh Mistry commented on KAFKA-1710: --- MORE OutPut: {code} Frozen threads found (potential deadlock) It seems that the following threads have not changed their stack for more than 10 seconds. These threads are possibly (but not necessarily!) in a deadlock or hung. pool-1-thread-108 <--- Frozen for at least 12 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-113 <--- Frozen for at least 13 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-118 <--- Frozen for at least 16 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-138 <--- Frozen for at least 12 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-151 <--- Frozen for at least 22 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-155 <--- Frozen for at least 13 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-160 <--- Frozen for at least 13 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, Callback) KafkaProducer.java:238 org.kafka.test.TestNetworkDownProducer$MyProducer.run() TestNetworkDownProducer.java:85 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:615 java.lang.Thread.run() Thread.java:744 pool-1-thread-163 <--- Frozen for at least 12 sec org.apache.kafka.clients.producer.internals.RecordAccumulator.app
[jira] [Updated] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1710: -- Attachment: Screen Shot 2014-10-15 at 9.14.15 PM.png Your Kit Monitor Screen shot: > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1710: -- Attachment: th15.dump th14.dump th13.dump th12.dump th11.dump th10.dump th9.dump th8.dump th7.dump th6.dump th5.dump th4.dump th3.dump th2.dump th1.dump JStack Thread dumps. > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173344#comment-14173344 ] Bhavesh Mistry commented on KAFKA-1710: --- I am not able to attached yourkit profiler snapshot. I get following error: TestNetworkDownProducer-2014-10-15-2.snapshot is too large to attach. Attachment is 28.19 MB but the largest allowed attachment is 10.00 MB. Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173344#comment-14173344 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 4:36 AM: - I am not able to attached yourkit profiler snapshot. So I have uploaded to git hub https://github.com/bmistry13/kafka-trunk-producer/blob/master/TestNetworkDownProducer-2014-10-15-3.snapshot Thanks, Bhavesh was (Author: bmis13): I am not able to attached yourkit profiler snapshot. I get following error: TestNetworkDownProducer-2014-10-15-2.snapshot is too large to attach. Attachment is 28.19 MB but the largest allowed attachment is 10.00 MB. Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173344#comment-14173344 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 4:40 AM: - [~jkreps] and [~junrao], I am not able to attached yourkit profiler snapshot. So I have uploaded to git hub https://github.com/bmistry13/kafka-trunk-producer/blob/master/TestNetworkDownProducer-2014-10-15-3.snapshot Let me know if you need more details. Thanks, Bhavesh was (Author: bmis13): I am not able to attached yourkit profiler snapshot. So I have uploaded to git hub https://github.com/bmistry13/kafka-trunk-producer/blob/master/TestNetworkDownProducer-2014-10-15-3.snapshot Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173368#comment-14173368 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 4:54 AM: - Here is property file used for testing: {code} # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers=[list here] #Data Acks acks=0 # 64MB of Buffer for log lines (including all messages). buffer.memory=134217728 compression.type=snappy retries=3 # DEFAULT FROM THE KAFKA... # batch size = ((buffer.memory) / (number of partitions)) (so we can have in progress batch size created for each partition.). batch.size=1048576 #2MiB max.request.size=1048576 send.buffer.bytes=2097152 # We do not want to block the buffer Full so application thread will not be blocked but logs lines will be dropped... block.on.buffer.full=false #2MiB send.buffer.bytes=2097152 #wait... linger.ms=360 {code} was (Author: bmis13): Here is property file used for testing: {code} # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers=dare-msgq00.sv.walmartlabs.com:9092,dare-msgq01.sv.walmartlabs.com:9092,dare-msgq02.sv.walmartlabs.com:9092 #Data Acks acks=0 # 64MB of Buffer for log lines (including all messages). buffer.memory=134217728 compression.type=snappy retries=3 # DEFAULT FROM THE KAFKA... # batch size = ((buffer.memory) / (number of partitions)) (so we can have in progress batch size created for each partition.). batch.size=1048576 #2MiB max.request.size=1048576 send.buffer.bytes=2097152 # We do not want to block the buffer Full so application thread will not be blocked but logs lines will be dropped... block.on.buffer.full=false #2MiB send.buffer.bytes=2097152 #wait... linger.ms=360 {code} > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAcc
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173368#comment-14173368 ] Bhavesh Mistry commented on KAFKA-1710: --- Here is property file used for testing: {code} # THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at https://kafka.apache.org/documentation.html#newproducerconfigs # Broker List bootstrap.servers=dare-msgq00.sv.walmartlabs.com:9092,dare-msgq01.sv.walmartlabs.com:9092,dare-msgq02.sv.walmartlabs.com:9092 #Data Acks acks=0 # 64MB of Buffer for log lines (including all messages). buffer.memory=134217728 compression.type=snappy retries=3 # DEFAULT FROM THE KAFKA... # batch size = ((buffer.memory) / (number of partitions)) (so we can have in progress batch size created for each partition.). batch.size=1048576 #2MiB max.request.size=1048576 send.buffer.bytes=2097152 # We do not want to block the buffer Full so application thread will not be blocked but logs lines will be dropped... block.on.buffer.full=false #2MiB send.buffer.bytes=2097152 #wait... linger.ms=360 {code} > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Jun Rao >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry commented on KAFKA-1710: --- [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate use case. It would be great if you into alternative to synchronization block. {code} synchronized (dq) { .. } {code} Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:32 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block. That is root of the problem. {code title=RecordAccumulator.java|borderStyle=solid} synchronized (dq) { .. } {code} Do you think it would be better to do this following way ? {code title=KafkaAsyncProducer.java|borderStyle=solid } import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try {
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:33 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block. That is root of the problem. {code title=RecordAccumulator.java|borderStyle=solid} synchronized (dq) { } {code} Do you think it would be better to do this following way ? {code title=KafkaAsyncProducer.java|borderStyle=solid } import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try {
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:34 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code title=KafkaAsyncProducer.java|borderStyle=solid} import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take();
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:33 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code title=KafkaAsyncProducer.java|borderStyle=solid } import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take();
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:34 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code title=KafkaAsyncProducer.java} import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take();
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:34 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code} import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take(); if(record != nu
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:36 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you into alternative implementation to synchronization block.Test code amplifies the root cause. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code} import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take();
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:38 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you can look into alternative implementation to synchronization block. Test code amplifies the root cause. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code} import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take();
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174278#comment-14174278 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/16/14 9:56 PM: - [~ewencp], Thanks for looking into this. If you look at the thread dump, you will see the blocked threads as well. As this particular code exposes the Thread contentions in the Kafka Producer. We have this issues when we aggregate event to send to same partition regardless of number of producers. It would be great if you can look into alternative implementation to synchronization block. Test code amplifies the root cause. That is root of the problem. synchronized (dq) { } Do you think it would be better to do this following way ? {code} import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.Metric; import org.apache.kafka.common.PartitionInfo; public class KafkaAsyncProducer implements Producer { // TODO configure this queue private final LinkedBlockingQueue asyncQueue; private final KafkaProducer producer; private final List threadList; private final CountDownLatch latch; private final AtomicBoolean close = new AtomicBoolean(false); public KafkaAsyncProducer(int capacity, int numberOfDrainTreads, Properties configFile ){ if(configFile == null){ throw new NullPointerException("Producer configuration cannot be null"); } // set the capacity for the queue asyncQueue = new LinkedBlockingQueue(capacity); producer = new KafkaProducer(configFile); threadList = new ArrayList(numberOfDrainTreads); latch = new CountDownLatch(numberOfDrainTreads); // start the drain threads... for(int i =0 ; i < numberOfDrainTreads ; i ++){ Thread th = new Thread(new ConsumerThread(),"Kafka_Drain-" +i); th.setDaemon(true); threadList.add(th); th.start(); } } public Future send(ProducerRecord record) { try { if(record == null){ throw new NullPointerException("Null record cannot be sent."); } if(close.get()){ throw new KafkaException("Producer aready closed or in processec of closing..."); } asyncQueue.put(record); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } return null; } public Future send(ProducerRecord record, Callback callback) { throw new UnsupportedOperationException("Send not supported"); } public List partitionsFor(String topic) { // TODO Auto-generated method stub return null; } public Map metrics() { return producer.metrics(); } public void close() { close.compareAndSet(false, true); // wait for drain threads to finish try { latch.await(); // now drain the remaining messages while(!asyncQueue.isEmpty()){ ProducerRecord record = asyncQueue.poll(); producer.send(record); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } producer.close(); } private class ConsumerThread implements Runnable{ public void run() { try{ while(!close.get()){ ProducerRecord record; try { record = asyncQueue.take();
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174403#comment-14174403 ] Bhavesh Mistry commented on KAFKA-1710: --- [~ewencp], Thanks for the looking into this issue. We consume as fast as we can re-publish the message to another aggregated topic based on some kes in message. We see thread contentions in profile tool and I separated out the code and to amplify the problem. We run with about 75 threads. [~ewencp] can you please discuss this issue with Kafka Community as well ? The dead lock will occur something depending on Thread scheduling and how log the are blocked. All I am asking is there a better way to enqueue in coming messages. I just proposed simple above solution that does not impact application threads and only drain threads will be blocked and with buffer as you mentioned we might get better through-put (of course at expense of buffered memory (unbounded concurrent queue) and thread context switching) .If you feel this is know performance issue to send to to single partition then please close this, and you may start discussion on Kafka Community for this issue. Thanks for your help and suggestions !! According to thread dumps, blocks are happening in Synchronization block. {code} "pool-1-thread-200" prio=5 tid=0x7f92451c2000 nid=0x20103 waiting for monitor entry [0x00012d228000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.kafka.clients.producer.internals.RecordAccumulator.append(RecordAccumulator.java:139) - waiting to lock <0x000703ce39f0> (a java.util.ArrayDeque) at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:238) at org.kafka.test.TestNetworkDownProducer$MyProducer.run(TestNetworkDownProducer.java:85) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) "pool-1-thread-199" prio=5 tid=0x7f92451c1800 nid=0x1ff03 waiting for monitor entry [0x00012d0e5000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.kafka.clients.producer.internals.RecordAccumulator.append(RecordAccumulator.java:139) - waiting to lock <0x000703ce39f0> (a java.util.ArrayDeque) at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:238) at org.kafka.test.TestNetworkDownProducer$MyProducer.run(TestNetworkDownProducer.java:85) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurren
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174462#comment-14174462 ] Bhavesh Mistry commented on KAFKA-1642: --- [~jkreps], Did you get chance to re-produce the problem ? Has someone else reported this issues or similar issue ? Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Jun Rao > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175548#comment-14175548 ] Bhavesh Mistry commented on KAFKA-1710: --- [~ewencp], Thank you for entertaining this issue and you may close this. I do agree with you if I increase number of producers then throughput will be alleviated (thread contention to critical block) at expense of TCP connections, memory etc. Do you think it would be good to open another jira issues or story for improving performance when sending to single partition for some time to avoid Thread contention? Please let me know if I should open the performance aspect of New Producer. Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175548#comment-14175548 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/17/14 9:19 PM: - [~ewencp], Thank you for entertaining this issue and you may close this. I do agree with you if I increase number of producers then throughput will be alleviated (thread contention to critical block) at expense of TCP connections, memory etc. Do you think it would be good to open another jira issues or story for improving performance when sending to single partition for some time to avoid Thread contention? Please let me know if I should open the performance aspect of New Producer. Only request is to make New Producer truly Async to enqueue the message regardless of message key or partition number. Thanks, Bhavesh was (Author: bmis13): [~ewencp], Thank you for entertaining this issue and you may close this. I do agree with you if I increase number of producers then throughput will be alleviated (thread contention to critical block) at expense of TCP connections, memory etc. Do you think it would be good to open another jira issues or story for improving performance when sending to single partition for some time to avoid Thread contention? Please let me know if I should open the performance aspect of New Producer. Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run()
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175574#comment-14175574 ] Bhavesh Mistry commented on KAFKA-1710: --- [~jkreps], Only request is to make New Producer truly Async to enqueue the message regardless of message key hashcode or partition number for the message. The new Producer is far far better than old Scala producer. ( I have worked both with new and old producers/consumer and entire linked-in pipeline) But new producer inherit the same problem that old producer had thread contention when queuing message into buffer. I think Kafka Dev team can do better because this use case of aggregating events into single partition is widely used. What my plan is to replace the Steam processing framework with Kafka is possible (For Aggregation and counting metrics etc) We currently use following steam processor, but it has lots of down fall and only distribute the load which Kafka Brokers provide. Any way this is our use case. https://github.com/walmartlabs/mupd8 http://vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175574#comment-14175574 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/17/14 9:32 PM: - [~jkreps], Only request is to make New Producer truly Async to enqueue the message regardless of message key hashcode or partition number for the message. The new Producer is far far better than old Scala producer. ( I have worked on both new and old producers/consumer and entire linked-in pipeline) But new producer inherit the same problem that old producer had thread contention when queuing message into buffer. I think Kafka Dev team can do better because this use case of aggregating events into single partition is widely used. What my plan is to replace the Steam processing framework with Kafka is possible (For Aggregation and counting metrics etc) We currently use following steam processor, but it has lots of down fall and only distribute the load which Kafka Brokers provide. Any way this is our use case. https://github.com/walmartlabs/mupd8 http://vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf Thanks, Bhavesh was (Author: bmis13): [~jkreps], Only request is to make New Producer truly Async to enqueue the message regardless of message key hashcode or partition number for the message. The new Producer is far far better than old Scala producer. ( I have worked both with new and old producers/consumer and entire linked-in pipeline) But new producer inherit the same problem that old producer had thread contention when queuing message into buffer. I think Kafka Dev team can do better because this use case of aggregating events into single partition is widely used. What my plan is to replace the Steam processing framework with Kafka is possible (For Aggregation and counting metrics etc) We currently use following steam processor, but it has lots of down fall and only distribute the load which Kafka Brokers provide. Any way this is our use case. https://github.com/walmartlabs/mupd8 http://vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concu
[jira] [Commented] (KAFKA-1721) Snappy compressor is not thread safe
[ https://issues.apache.org/jira/browse/KAFKA-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178856#comment-14178856 ] Bhavesh Mistry commented on KAFKA-1721: --- I have filled https://github.com/xerial/snappy-java/issues/88 for tracking for Snappy. There is patch provided and Thanks to [~ewencp] for testing the patch. Please see above link for more details. Thanks, Bhavesh > Snappy compressor is not thread safe > > > Key: KAFKA-1721 > URL: https://issues.apache.org/jira/browse/KAFKA-1721 > Project: Kafka > Issue Type: Bug > Components: compression >Reporter: Ewen Cheslack-Postava >Assignee: Ewen Cheslack-Postava > > From the mailing list, it can generate this exception: > 2014-10-20 18:55:21.841 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in > kafka producer I/O thread: > *java.lang.NullPointerException* > at > org.xerial.snappy.BufferRecycler.releaseInputBuffer(BufferRecycler.java:153) > at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:317) > at java.io.FilterOutputStream.close(FilterOutputStream.java:160) > at org.apache.kafka.common.record.Compressor.close(Compressor.java:94) > at > org.apache.kafka.common.record.MemoryRecords.close(MemoryRecords.java:119) > at > org.apache.kafka.clients.producer.internals.RecordAccumulator.drain(RecordAccumulator.java:285) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > This appears to be an issue with the snappy-java library using ThreadLocal > for an internal buffer recycling object which results in that object being > shared unsafely across threads if one thread sends to multiple producers: > {quote} > I think the issue is that you're > using all your producers across a thread pool and the snappy library > uses ThreadLocal BufferRecyclers. When new Snappy streams are allocated, > they may be allocated from the same thread (e.g. one of your MyProducer > classes calls Producer.send() on multiple producers from the same > thread) and therefore use the same BufferRecycler. Eventually you hit > the code in the stacktrace, and if two producer send threads hit it > concurrently they improperly share the unsynchronized BufferRecycler. > This seems like a pain to fix -- it's really a deficiency of the snappy > library and as far as I can see there's no external control over > BufferRecycler in their API. One possibility is to record the thread ID > when we generate a new stream in Compressor and use that to synchronize > access to ensure no concurrent BufferRecycler access. That could be made > specific to snappy so it wouldn't impact other codecs. Not exactly > ideal, but it would work. Unfortunately I can't think of any way for you > to protect against this in your own code since the problem arises in the > producer send thread, which your code should never know about. > Another option would be to setup your producers differently to avoid the > possibility of unsynchronized access from multiple threads (i.e. don't > use the same thread pool approach), but whether you can do that will > depend on your use case. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179014#comment-14179014 ] Bhavesh Mistry commented on KAFKA-1710: --- [~jkreps], I am sorry I did not get back to you soon. The cost of enqueue a message into single partition only is ~54%. Here is test I have done: To *single* partition: Throughput per Thread=2666.5 byte(s)/microsecond All done...! To *all* partition: Throughput per Thread=5818.181818181818 byte(s)/microsecond All done...! The cost of sync block in roughly around {code} package org.kafka.test; import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.Callable; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 75; static CountDownLatch latch = new CountDownLatch(numberTh); public static void main(String[] args) throws IOException, InterruptedException { //Thread.sleep(6); Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "logmon.test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; int numberOfLoop = 5000; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 1; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); MyProducer [] producerThResult = new MyProducer [numberTh]; for(int i = 0 ; i < numberTh;i++){ producerThResult[i] = new MyProducer(producer,numberOfLoop,builder.toString(), topic); service.execute(producerThResult[i]); } latch.await(); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All Producers done...!"); // now interpret the result... of this... long lowestTime = 0 ; for(int i =0 ; i < producerThResult.length;i++){ if(i == 1){ lowestTime = producerThResult[i].totalTimeinNano; }else if ( producerThResult[i].totalTimeinNano < lowestTime){ lowestTime = producerThResult[i].totalTimeinNano; } } long bytesSend = msgLenth * numberOfLoop; long durationInMs = TimeUnit.MILLISECONDS.convert(lowestTime, TimeUnit.NANOSECONDS); double throughput = (bytesSend * 1.0) / (durationInMs); System.out.println("Throughput per Thread=" + throughput + " byte(s)/microsecond"); System.out.println("All done...!"); } static class MyProducer implements Callable , Runnable { Producer[] producer; long maxloops; String msg ; String topic; long totalTimeinNano = 0; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { // ALWAYS SEND DATA TO PARTITION 1 only... //ProducerRecord record = new ProducerRecord(topic, 1,null,msg.toString().getBytes()); ProducerRecord recor
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179014#comment-14179014 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/21/14 8:26 PM: - [~jkreps], I am sorry I did not get back to you soon. The cost of enqueue a message into single partition is ~54% as compare to round-robin. (test with 32 partition to single topic and 3 cluster) The throughput is measuring the cost of put data into buffer. Here is test I have done: To *single* partition: Throughput per Thread=2666.5 byte(s)/microsecond All done...! To *all* partition: Throughput per Thread=5818.181818181818 byte(s)/microsecond All done...! {code} package org.kafka.test; import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.Callable; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 75; static CountDownLatch latch = new CountDownLatch(numberTh); public static void main(String[] args) throws IOException, InterruptedException { //Thread.sleep(6); Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "logmon.test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; int numberOfLoop = 5000; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 1; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); MyProducer [] producerThResult = new MyProducer [numberTh]; for(int i = 0 ; i < numberTh;i++){ producerThResult[i] = new MyProducer(producer,numberOfLoop,builder.toString(), topic); service.execute(producerThResult[i]); } latch.await(); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All Producers done...!"); // now interpret the result... of this... long lowestTime = 0 ; for(int i =0 ; i < producerThResult.length;i++){ if(i == 1){ lowestTime = producerThResult[i].totalTimeinNano; }else if ( producerThResult[i].totalTimeinNano < lowestTime){ lowestTime = producerThResult[i].totalTimeinNano; } } long bytesSend = msgLenth * numberOfLoop; long durationInMs = TimeUnit.MILLISECONDS.convert(lowestTime, TimeUnit.NANOSECONDS); double throughput = (bytesSend * 1.0) / (durationInMs); System.out.println("Throughput per Thread=" + throughput + " byte(s)/microsecond"); System.out.println("All done...!"); } static class MyProducer implements Callable , Runnable { Producer[] producer; long maxloops; String msg ; String topic; long totalTimeinNano = 0; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { // ALWAYS SEND DATA TO PARTITION 1 only...
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179014#comment-14179014 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/21/14 8:28 PM: - [~jkreps], I am sorry I did not get back to you so soon. The cost of enqueue a message into single partition is ~54% as compare to round-robin. (test with 32 partitions to single topic and 3 broker cluster) The throughput is measuring the cost of put data into buffer only not cost of sending data to brokers. Here is test I have done: To *single* partition: Throughput per Thread=2666.5 byte(s)/microsecond All done...! To *all* partition: Throughput per Thread=5818.181818181818 byte(s)/microsecond All done...! {code} package org.kafka.test; import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.Callable; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 75; static CountDownLatch latch = new CountDownLatch(numberTh); public static void main(String[] args) throws IOException, InterruptedException { //Thread.sleep(6); Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "logmon.test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; int numberOfLoop = 5000; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 1; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); MyProducer [] producerThResult = new MyProducer [numberTh]; for(int i = 0 ; i < numberTh;i++){ producerThResult[i] = new MyProducer(producer,numberOfLoop,builder.toString(), topic); service.execute(producerThResult[i]); } latch.await(); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All Producers done...!"); // now interpret the result... of this... long lowestTime = 0 ; for(int i =0 ; i < producerThResult.length;i++){ if(i == 1){ lowestTime = producerThResult[i].totalTimeinNano; }else if ( producerThResult[i].totalTimeinNano < lowestTime){ lowestTime = producerThResult[i].totalTimeinNano; } } long bytesSend = msgLenth * numberOfLoop; long durationInMs = TimeUnit.MILLISECONDS.convert(lowestTime, TimeUnit.NANOSECONDS); double throughput = (bytesSend * 1.0) / (durationInMs); System.out.println("Throughput per Thread=" + throughput + " byte(s)/microsecond"); System.out.println("All done...!"); } static class MyProducer implements Callable , Runnable { Producer[] producer; long maxloops; String msg ; String topic; long totalTimeinNano = 0; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() { // ALWAYS SEN
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179014#comment-14179014 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/21/14 8:29 PM: - [~jkreps], I am sorry I did not get back to you so soon. The cost of enqueue a message into single partition is ~54% as compare to round-robin. (test with 32 partitions to single topic and 3 broker cluster) The throughput is measuring the cost of put data into buffer only not cost of sending data to brokers. Here is test I have done: To *single* partition: Throughput per Thread=2666.5 byte(s)/microsecond All done...! To *all* partition: Throughput per Thread=5818.181818181818 byte(s)/microsecond All done...! Here is test program for this: {code} package org.kafka.test; import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.Callable; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 75; static CountDownLatch latch = new CountDownLatch(numberTh); public static void main(String[] args) throws IOException, InterruptedException { //Thread.sleep(6); Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "logmon.test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; int numberOfLoop = 5000; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 1; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); MyProducer [] producerThResult = new MyProducer [numberTh]; for(int i = 0 ; i < numberTh;i++){ producerThResult[i] = new MyProducer(producer,numberOfLoop,builder.toString(), topic); service.execute(producerThResult[i]); } latch.await(); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All Producers done...!"); // now interpret the result... of this... long lowestTime = 0 ; for(int i =0 ; i < producerThResult.length;i++){ if(i == 1){ lowestTime = producerThResult[i].totalTimeinNano; }else if ( producerThResult[i].totalTimeinNano < lowestTime){ lowestTime = producerThResult[i].totalTimeinNano; } } long bytesSend = msgLenth * numberOfLoop; long durationInMs = TimeUnit.MILLISECONDS.convert(lowestTime, TimeUnit.NANOSECONDS); double throughput = (bytesSend * 1.0) / (durationInMs); System.out.println("Throughput per Thread=" + throughput + " byte(s)/microsecond"); System.out.println("All done...!"); } static class MyProducer implements Callable , Runnable { Producer[] producer; long maxloops; String msg ; String topic; long totalTimeinNano = 0; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() {
[jira] [Commented] (KAFKA-1481) Stop using dashes AND underscores as separators in MBean names
[ https://issues.apache.org/jira/browse/KAFKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179092#comment-14179092 ] Bhavesh Mistry commented on KAFKA-1481: --- Hi [~junrao], Can you please let me know if this will also address the [New Java Producer] metrics() method and which client.id or topic has special chars ? So we have consistent naming across all JMX name bean or metrics() methods ? Here is background on this: {code} Bhavesh, Yes, allowing dot in clientId and topic makes it a bit harder to define the JMX bean names. I see a couple of solutions here. 1. Disable dot in clientId and topic names. The issue is that dot may already be used in existing deployment. 2. We can represent the JMX bean name differently in the new producer. Instead of kafka.producer.myclientid:type=mytopic we could change it to kafka.producer:clientId=myclientid,topic=mytopic I felt that option 2 is probably better since it doesn't affect existing users. Otis, We probably can also use option 2 to address KAFKA-1481. For topic/clientid specific metrics, we could explicitly specify the metric name so that it contains "topic=mytopic,clientid=myclientid". That seems to be a much cleaner way than having all parts included in a single string separated by '|'. Thanks, Jun {code} Thanks, Bhavesh > Stop using dashes AND underscores as separators in MBean names > -- > > Key: KAFKA-1481 > URL: https://issues.apache.org/jira/browse/KAFKA-1481 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.1.1 >Reporter: Otis Gospodnetic >Priority: Critical > Labels: patch > Fix For: 0.8.3 > > Attachments: KAFKA-1481_2014-06-06_13-06-35.patch, > KAFKA-1481_2014-10-13_18-23-35.patch, KAFKA-1481_2014-10-14_21-53-35.patch, > KAFKA-1481_2014-10-15_10-23-35.patch, KAFKA-1481_2014-10-20_23-14-35.patch, > KAFKA-1481_2014-10-21_09-14-35.patch, > KAFKA-1481_IDEA_IDE_2014-10-14_21-53-35.patch, > KAFKA-1481_IDEA_IDE_2014-10-15_10-23-35.patch, > KAFKA-1481_IDEA_IDE_2014-10-20_20-14-35.patch, > KAFKA-1481_IDEA_IDE_2014-10-20_23-14-35.patch > > > MBeans should not use dashes or underscores as separators because these > characters are allowed in hostnames, topics, group and consumer IDs, etc., > and these are embedded in MBeans names making it impossible to parse out > individual bits from MBeans. > Perhaps a pipe character should be used to avoid the conflict. > This looks like a major blocker because it means nobody can write Kafka 0.8.x > monitoring tools unless they are doing it for themselves AND do not use > dashes AND do not use underscores. > See: http://search-hadoop.com/m/4TaT4lonIW -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1721) Snappy compressor is not thread safe
[ https://issues.apache.org/jira/browse/KAFKA-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181730#comment-14181730 ] Bhavesh Mistry commented on KAFKA-1721: --- [~ewencp], Thanks for fixing this issue. Snappy Dev has release new version with fix https://oss.sonatype.org/content/repositories/releases/org/xerial/snappy/snappy-java/1.1.1.4/ Thanks, Bhavesh > Snappy compressor is not thread safe > > > Key: KAFKA-1721 > URL: https://issues.apache.org/jira/browse/KAFKA-1721 > Project: Kafka > Issue Type: Bug > Components: compression >Reporter: Ewen Cheslack-Postava >Assignee: Ewen Cheslack-Postava > > From the mailing list, it can generate this exception: > 2014-10-20 18:55:21.841 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in > kafka producer I/O thread: > *java.lang.NullPointerException* > at > org.xerial.snappy.BufferRecycler.releaseInputBuffer(BufferRecycler.java:153) > at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:317) > at java.io.FilterOutputStream.close(FilterOutputStream.java:160) > at org.apache.kafka.common.record.Compressor.close(Compressor.java:94) > at > org.apache.kafka.common.record.MemoryRecords.close(MemoryRecords.java:119) > at > org.apache.kafka.clients.producer.internals.RecordAccumulator.drain(RecordAccumulator.java:285) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > This appears to be an issue with the snappy-java library using ThreadLocal > for an internal buffer recycling object which results in that object being > shared unsafely across threads if one thread sends to multiple producers: > {quote} > I think the issue is that you're > using all your producers across a thread pool and the snappy library > uses ThreadLocal BufferRecyclers. When new Snappy streams are allocated, > they may be allocated from the same thread (e.g. one of your MyProducer > classes calls Producer.send() on multiple producers from the same > thread) and therefore use the same BufferRecycler. Eventually you hit > the code in the stacktrace, and if two producer send threads hit it > concurrently they improperly share the unsynchronized BufferRecycler. > This seems like a pain to fix -- it's really a deficiency of the snappy > library and as far as I can see there's no external control over > BufferRecycler in their API. One possibility is to record the thread ID > when we generate a new stream in Compressor and use that to synchronize > access to ensure no concurrent BufferRecycler access. That could be made > specific to snappy so it wouldn't impact other codecs. Not exactly > ideal, but it would work. Unfortunately I can't think of any way for you > to protect against this in your own code since the problem arises in the > producer send thread, which your code should never know about. > Another option would be to setup your producers differently to avoid the > possibility of unsynchronized access from multiple threads (i.e. don't > use the same thread pool approach), but whether you can do that will > depend on your use case. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182198#comment-14182198 ] Bhavesh Mistry commented on KAFKA-1710: --- [~jkreps], Sorry to bug you again. Did you get chance to review the above performance number and cost of Sync per thread when Partition is not set and partition set to single partition ? Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179014#comment-14179014 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/24/14 6:21 PM: - [~jkreps], I am sorry I did not get back to you so soon. The cost of enqueue a message into single partition is ~54% as compare to round-robin. (test with 32 partitions to single topic and 3 broker cluster) The throughput is measuring the cost of put data into buffer only not cost of sending data to brokers. Here is test I have done: To *single* partition: Throughput per Thread=2666.5 byte(s)/millisecond All done...! To *all* partition: Throughput per Thread=5818.181818181818 byte(s)/millisecond All done...! Here is test program for this: {code} package org.kafka.test; import java.io.IOException; import java.io.InputStream; import java.util.Properties; import java.util.concurrent.Callable; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ExecutorService; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class TestNetworkDownProducer { static int numberTh = 75; static CountDownLatch latch = new CountDownLatch(numberTh); public static void main(String[] args) throws IOException, InterruptedException { //Thread.sleep(6); Properties prop = new Properties(); InputStream propFile = Thread.currentThread().getContextClassLoader() .getResourceAsStream("kafkaproducer.properties"); String topic = "logmon.test"; prop.load(propFile); System.out.println("Property: " + prop.toString()); StringBuilder builder = new StringBuilder(1024); int msgLenth = 256; int numberOfLoop = 5000; for (int i = 0; i < msgLenth; i++) builder.append("a"); int numberOfProducer = 1; Producer[] producer = new Producer[numberOfProducer]; for (int i = 0; i < producer.length; i++) { producer[i] = new KafkaProducer(prop); } ExecutorService service = new ThreadPoolExecutor(numberTh, numberTh, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(numberTh *2)); MyProducer [] producerThResult = new MyProducer [numberTh]; for(int i = 0 ; i < numberTh;i++){ producerThResult[i] = new MyProducer(producer,numberOfLoop,builder.toString(), topic); service.execute(producerThResult[i]); } latch.await(); for (int i = 0; i < producer.length; i++) { producer[i].close(); } service.shutdownNow(); System.out.println("All Producers done...!"); // now interpret the result... of this... long lowestTime = 0 ; for(int i =0 ; i < producerThResult.length;i++){ if(i == 1){ lowestTime = producerThResult[i].totalTimeinNano; }else if ( producerThResult[i].totalTimeinNano < lowestTime){ lowestTime = producerThResult[i].totalTimeinNano; } } long bytesSend = msgLenth * numberOfLoop; long durationInMs = TimeUnit.MILLISECONDS.convert(lowestTime, TimeUnit.NANOSECONDS); double throughput = (bytesSend * 1.0) / (durationInMs); System.out.println("Throughput per Thread=" + throughput + " byte(s)/microsecond"); System.out.println("All done...!"); } static class MyProducer implements Callable , Runnable { Producer[] producer; long maxloops; String msg ; String topic; long totalTimeinNano = 0; MyProducer(Producer[] list, long maxloops,String msg,String topic){ this.producer = list; this.maxloops = maxloops; this.msg = msg; this.topic = topic; } public void run() {
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183648#comment-14183648 ] Bhavesh Mistry commented on KAFKA-1710: --- [~jkreps], Yes , I did this test with 75 threads and on My mac pro with 8 core with Snappy compression ON. Do you have any idea how we can improve this enqueue for single partition ? May be have x # of CPU active buffer ? Here is info about the box: {code} machdep.cpu.max_basic: 13 machdep.cpu.max_ext: 2147483656 machdep.cpu.vendor: GenuineIntel machdep.cpu.brand_string: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz machdep.cpu.family: 6 machdep.cpu.model: 58 machdep.cpu.extmodel: 3 machdep.cpu.extfamily: 0 machdep.cpu.stepping: 9 machdep.cpu.feature_bits: 3219913727 2142954495 machdep.cpu.leaf7_feature_bits: 641 machdep.cpu.extfeature_bits: 672139520 1 machdep.cpu.signature: 198313 machdep.cpu.brand: 0 machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC POPCNT AES PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: SMEP ENFSTRG RDWRFSGS machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI machdep.cpu.logical_per_package: 16 machdep.cpu.cores_per_package: 8 {code} > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This mes
[jira] [Commented] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186350#comment-14186350 ] Bhavesh Mistry commented on KAFKA-1710: --- [~jkreps], I understand the current code base is adding bytes to shared memory and doing compression (on application thread). The older consumer seems to do all this in back-ground thread. So What changed to have this in fore-ground ? Also, if you had to re-engineer this code, How would you re-engineer to remove Synchronization and move everything in background so more runable state is give to Application Thread and cost of enqueue will very less. I am really interested in solving this problem for my application. So I just wanted to know your suggestions/ideas, how would you solve this ? Thanks for all your help so far !! Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186350#comment-14186350 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/28/14 4:40 AM: - [~jkreps], I understand the current code base is adding bytes to shared memory and doing compression (on application thread). The older consumer seems to do all this in back-ground thread. So What changed to have this in fore-ground ? Also, if you had to re-engineer this code, How would you re-engineer to remove Synchronization and move everything in background so more runable state is give to Application Thread and cost of enqueue will very less. (Of Course at cost of memory). I am really interested in solving this problem for my application. So I just wanted to know your suggestions/ideas, how would you solve this ? Thanks for all your help so far !! Thanks, Bhavesh was (Author: bmis13): [~jkreps], I understand the current code base is adding bytes to shared memory and doing compression (on application thread). The older consumer seems to do all this in back-ground thread. So What changed to have this in fore-ground ? Also, if you had to re-engineer this code, How would you re-engineer to remove Synchronization and move everything in background so more runable state is give to Application Thread and cost of enqueue will very less. I am really interested in solving this problem for my application. So I just wanted to know your suggestions/ideas, how would you solve this ? Thanks for all your help so far !! Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurre
[jira] [Comment Edited] (KAFKA-1710) [New Java Producer Potential Deadlock] Producer Deadlock when all messages is being sent to single partition
[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186350#comment-14186350 ] Bhavesh Mistry edited comment on KAFKA-1710 at 10/28/14 4:58 AM: - [~jkreps], I understand the current code base is adding bytes to shared memory and doing compression (on application thread). The older consumer seems to do all this in back-ground thread. So What changed to have this in fore-ground ? Also, if you had to re-engineer this code, How would you re-engineer to remove Synchronization and move everything in background so more runable state is give to Application Thread and cost of enqueue will very less. (Of Course at cost of memory). I am really interested in solving this problem for my application. So I just wanted to know your suggestions/ideas, how would you solve this ? Thanks for all your help so far !!Only think I can think of is do *AsynKafkaProducer* as mentioned in previous comments where [~ewencp] mentioned that problem will be those threads that are enqueue message at cost of memory, thread context switching etc... Thanks, Bhavesh was (Author: bmis13): [~jkreps], I understand the current code base is adding bytes to shared memory and doing compression (on application thread). The older consumer seems to do all this in back-ground thread. So What changed to have this in fore-ground ? Also, if you had to re-engineer this code, How would you re-engineer to remove Synchronization and move everything in background so more runable state is give to Application Thread and cost of enqueue will very less. (Of Course at cost of memory). I am really interested in solving this problem for my application. So I just wanted to know your suggestions/ideas, how would you solve this ? Thanks for all your help so far !! Thanks, Bhavesh > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionTyp
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222571#comment-14222571 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 1:31 AM: - The patch provided does not solve the problem. When you have more than one or more producer instance, the effect amplifies. org.apache.kafka.clients.producer.internals.Send.run() takes 100% CPU due to infinite loop when there is no brokers (no work to be done to dump data). Thanks, Bhavesh was (Author: bmis13): The patch provided does not solve the problem. When you have more than one producer instance, the effect amplifies. org.apache.kafka.clients.producer.internals.Send.run() takes 100% CPU due to infinite loop when there is no brokers. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry reopened KAFKA-1642: --- The patch provided does not solve the problem. When you have more than one producer instance, the effect amplifies. org.apache.kafka.clients.producer.internals.Send.run() takes 100% CPU due to infinite loop when there is no brokers. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1642: -- Attachment: 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch Please take look at experimental patch that solve this problem by capturing the correct Node state and also not so elegant by exponential backoff run() method by sleeping (many of the value is hard coded but it is just experimental). Also, there is another problem close() method on producer does not exit and JVM does not gracefully shutdown because io thread is spinning in while loop during network outage. This is also another edge case. I hope this will be very helpful and solve problem. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223161#comment-14223161 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 5:26 PM: - [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code} In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to still CPU cycle , I think must protect it some how and check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Thanks, Bhavesh was (Author: bmis13): [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code] In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to still CPU cycle , I think must protect it some how and check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-produc
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223161#comment-14223161 ] Bhavesh Mistry commented on KAFKA-1642: --- [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code] In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to still CPU cycle , I think must protect it some how and check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223161#comment-14223161 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 5:27 PM: - [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code} In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to still CPU cycle , I think must protect it some how and check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Based on code diff I have done from 0.8.1.1 tag and this. This issue also occur in 0.8.1.1 as well I think. Thanks, Bhavesh was (Author: bmis13): [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code} In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to still CPU cycle , I think must protect it some how and check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are ve
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223161#comment-14223161 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 6:57 PM: - [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code} In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to stealing CPU cycle , I think must protect it some how and must check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Based on code diff I have done from 0.8.1.1 tag and this. This issue also occur in 0.8.1.1 as well I think. Thanks, Bhavesh was (Author: bmis13): [~ewencp], The way to reproduce this is to simulate network instability by turning on and off network service (or turn on/off physical cable). The connect and see if recover and disconnect and connect again etc.. you will see the behavior again and again. The issue is also with connection state management : {code} private void initiateConnect(Node node, long now) { try { log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port()); // TODO FIX java.lang.IllegalStateException: No entry found for node -3 (We need put before remove it..).. this.connectionStates.connecting(node.id(), now); (This line has problem because it will loose previous last attempt made get above exception and it will try to connect to that node for ever and ever with exception ) selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { /* attempt failed, we'll try again after the backoff */ connectionStates.disconnectedWhenConnectting(node.id()); /* maybe the problem is our metadata, update it */ metadata.requestUpdate(); log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e); } } {code} In my opinion, regardless of what node status is in run() method needs to be safe-guarded from still CPU Cycle when there is no state for Node. (Hence I have added exponential sleep as temp solution to not to still CPU cycle , I think must protect it some how and check the execution time...) Please let me know if you need more info and i will be more than happy to reproduce bug and we can have conference call, and I can show you the problem. Based on code diff I have done from 0.8.1.1 tag and this. This issue also occur in 0.8.1.1 as well I think. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223532#comment-14223532 ] Bhavesh Mistry commented on KAFKA-1642: --- Also Regarding KafkaProder.close() method hangs for ever because of following loop, and {code} Sender.java // okay we stopped accepting requests but there may still be // requests in the accumulator or waiting for acknowledgment, // wait until these are completed. while (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0) { try { run(time.milliseconds()); } catch (Exception e) { log.error("Uncaught error in kafka producer I/O thread: ", e); } } KafkaProducer.java /** * Close this producer. This method blocks until all in-flight requests complete. */ @Override public void close() { log.trace("Closing the Kafka producer."); this.sender.initiateClose(); try { this.ioThread.join(); // THIS IS BLOCKED since ioThread does not give up. } catch (InterruptedException e) { throw new KafkaException(e); } this.metrics.close(); log.debug("The Kafka producer has closed."); } {code} The issue describe in KAFKA-1788 is likelihood, but if you look the close call stack then calling thread that initiated the close() will hang till io thread dies (which it never dies when data is there and network is down). Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223532#comment-14223532 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 9:21 PM: - [~ewencp], Also Regarding KafkaProder.close() method hangs for ever because of following loop, and {code} Sender.java // okay we stopped accepting requests but there may still be // requests in the accumulator or waiting for acknowledgment, // wait until these are completed. while (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0) { try { run(time.milliseconds()); } catch (Exception e) { log.error("Uncaught error in kafka producer I/O thread: ", e); } } KafkaProducer.java /** * Close this producer. This method blocks until all in-flight requests complete. */ @Override public void close() { log.trace("Closing the Kafka producer."); this.sender.initiateClose(); try { this.ioThread.join(); // THIS IS BLOCKED since ioThread does not give up. } catch (InterruptedException e) { throw new KafkaException(e); } this.metrics.close(); log.debug("The Kafka producer has closed."); } {code} The issue describe in KAFKA-1788 is likelihood, but if you look the close call stack then calling thread that initiated the close() will hang till io thread dies (which it never dies when data is there and network is down). Thanks, Bhavesh was (Author: bmis13): Also Regarding KafkaProder.close() method hangs for ever because of following loop, and {code} Sender.java // okay we stopped accepting requests but there may still be // requests in the accumulator or waiting for acknowledgment, // wait until these are completed. while (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0) { try { run(time.milliseconds()); } catch (Exception e) { log.error("Uncaught error in kafka producer I/O thread: ", e); } } KafkaProducer.java /** * Close this producer. This method blocks until all in-flight requests complete. */ @Override public void close() { log.trace("Closing the Kafka producer."); this.sender.initiateClose(); try { this.ioThread.join(); // THIS IS BLOCKED since ioThread does not give up. } catch (InterruptedException e) { throw new KafkaException(e); } this.metrics.close(); log.debug("The Kafka producer has closed."); } {code} The issue describe in KAFKA-1788 is likelihood, but if you look the close call stack then calling thread that initiated the close() will hang till io thread dies (which it never dies when data is there and network is down). Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223532#comment-14223532 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 9:22 PM: - [~ewencp], Also Regarding KafkaProder.close() method hangs for ever because of following loop, and {code} Sender.java // okay we stopped accepting requests but there may still be // requests in the accumulator or waiting for acknowledgment, // wait until these are completed. while (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0) { try { run(time.milliseconds()); } catch (Exception e) { log.error("Uncaught error in kafka producer I/O thread: ", e); } } KafkaProducer.java /** * Close this producer. This method blocks until all in-flight requests complete. */ @Override public void close() { log.trace("Closing the Kafka producer."); this.sender.initiateClose(); try { this.ioThread.join(); // THIS IS BLOCKED since ioThread does not give up so it is all related in my opinion. } catch (InterruptedException e) { throw new KafkaException(e); } this.metrics.close(); log.debug("The Kafka producer has closed."); } {code} The issue describe in KAFKA-1788 is likelihood, but if you look the close call stack then calling thread that initiated the close() will hang till io thread dies (which it never dies when data is there and network is down). Thanks, Bhavesh was (Author: bmis13): [~ewencp], Also Regarding KafkaProder.close() method hangs for ever because of following loop, and {code} Sender.java // okay we stopped accepting requests but there may still be // requests in the accumulator or waiting for acknowledgment, // wait until these are completed. while (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0) { try { run(time.milliseconds()); } catch (Exception e) { log.error("Uncaught error in kafka producer I/O thread: ", e); } } KafkaProducer.java /** * Close this producer. This method blocks until all in-flight requests complete. */ @Override public void close() { log.trace("Closing the Kafka producer."); this.sender.initiateClose(); try { this.ioThread.join(); // THIS IS BLOCKED since ioThread does not give up. } catch (InterruptedException e) { throw new KafkaException(e); } this.metrics.close(); log.debug("The Kafka producer has closed."); } {code} The issue describe in KAFKA-1788 is likelihood, but if you look the close call stack then calling thread that initiated the close() will hang till io thread dies (which it never dies when data is there and network is down). Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223571#comment-14223571 ] Bhavesh Mistry commented on KAFKA-1642: --- Here is exact steps how to reproducer it bug: (Must have demon program continuously running). 1) Start with happy Situation where all borkers are up everything is running fine. And verify all top -pid JAVA_PID and your kit (kafka network threads are taking less than 4% CPU). 2) Shutdown network (turn off network or pull the eth0 cable) wait for while and you will see that CPU spike to 325% under top (if you have 4 producer) and verify your kit is showing 25% CPU consumption for for each Kafka io thread. 3) Connect back the network ( Spike will still be there but CPU after while come down to 100% or so ) and remain connected for while. 4) again simulate network failure (to simulate network instability) repeat steps again 1 to 4 but wait for 10 or so minutes in between and you will see the trends of CPU spike along with above exception. java.lang.IllegalStateException: No entry found for node -2 Also, I see that Kafka is logging excessively when network is down (your kit shows it is taking more CPU Cycle as compare to normal) Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223626#comment-14223626 ] Bhavesh Mistry commented on KAFKA-1642: --- Also, there is issue in my last patch. I did not update the lastConnectAttemptMs...in connecting. {code} /** * Enter the connecting state for the given node. * @param node The id of the node we are connecting to * @param now The current time. */ public void connecting(int node, long now) { NodeConnectionState nodeConn = nodeState.get(node); if(nodeConn == null){ nodeState.put(node, new NodeConnectionState(ConnectionState.CONNECTING, now)); }else{ nodeConn.state = ConnectionState.CONNECTING; nodeConn.lastConnectAttemptMs = now; (This will capture and update last connection attempt) } } {code} > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223626#comment-14223626 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/24/14 10:16 PM: -- Also, there is issue in my experimental patch. I did not update the lastConnectAttemptMs...in connecting state method to solve the issue with illegal sate exp: {code} /** * Enter the connecting state for the given node. * @param node The id of the node we are connecting to * @param now The current time. */ public void connecting(int node, long now) { NodeConnectionState nodeConn = nodeState.get(node); if(nodeConn == null){ nodeState.put(node, new NodeConnectionState(ConnectionState.CONNECTING, now)); }else{ nodeConn.state = ConnectionState.CONNECTING; nodeConn.lastConnectAttemptMs = now; (This will capture and update last connection attempt) } } {code} was (Author: bmis13): Also, there is issue in my last patch. I did not update the lastConnectAttemptMs...in connecting. {code} /** * Enter the connecting state for the given node. * @param node The id of the node we are connecting to * @param now The current time. */ public void connecting(int node, long now) { NodeConnectionState nodeConn = nodeState.get(node); if(nodeConn == null){ nodeState.put(node, new NodeConnectionState(ConnectionState.CONNECTING, now)); }else{ nodeConn.state = ConnectionState.CONNECTING; nodeConn.lastConnectAttemptMs = now; (This will capture and update last connection attempt) } } {code} > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1642: -- Affects Version/s: 0.8.1.1 > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223779#comment-14223779 ] Bhavesh Mistry commented on KAFKA-1642: --- [~ewencp], Thanks for looking into this really appreciate your response. Also, do you think rapid connect and disconnect is also due to incorrect Node state management ? connecting method and initiateConnection also ? Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223779#comment-14223779 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/25/14 1:31 AM: - [~ewencp], Thanks for looking into this really appreciate your response. Also, do you think rapid connect and disconnect is also due to incorrect Node state management ? connecting method and initiateConnection also ? Also, Can we also take the defensive coding and have protection in this tight infinite loop to throttle CPU cycle if it ends up with start-end duration is below some xx ms. This will actually prevent this issues.We had this issue on Prod so I just wanted to highlight the impact of 325% CPU and excessive logging. Thanks, Bhavesh was (Author: bmis13): [~ewencp], Thanks for looking into this really appreciate your response. Also, do you think rapid connect and disconnect is also due to incorrect Node state management ? connecting method and initiateConnection also ? Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224039#comment-14224039 ] Bhavesh Mistry commented on KAFKA-1642: --- Here are some more cases to reproduce this simulating network connection issue with one of brokers only and still problem persist: Case 1: brokers connection is down (note according to ZK leader for partition still with b1 ) Have tree brokers: b1, b2, b3 1) Start your daemon program and keep sending data to all the brokers and continue sending some data 2) Observed that you have data netstat -a | grep b1|b2|b3 (keep pumping data for 5 minutes and observed normal behavior using top -pid or top -p java_pid ) 3) Simulate a network connection or problem establishing new TCP connection via following as java program still continues to pump data aggressively (please note TCP connection to B1 still active and connected) a) sudo vi /etc/hosts 2) add entry "b1 127.0.0.1" b) /etc/init.d/network restart after while (5 to 7 minutes you will see the issue but keep pumping data, and also repeat this for b2 it will be more CPU consumption) 4) Under a heavy dumping data, now producer will try to establish new TCP connection to B1 and it will get connection refused (Note that CPU spikes up again and remain in state) just because could not establish. Case 2) Simulate Firewall rule such as you are only allowed (4 TCP connection to each brokers) Do step 1,2 and 3 above. 4) use Iptable rule to reject To start an "enforcing fire wall": iptables -A OUTPUT -p tcp -m tcp -d b1 --dport 9092 -j REJECT 5) Still pump data will while iptable rejects ( you will see CPU spike to to 200% more depending on # of producer) To "recover" : iptables -D OUTPUT -p tcp -m tcp -d b1 --dport 9092 -j REJECT > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224041#comment-14224041 ] Bhavesh Mistry commented on KAFKA-1642: --- [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do based on some configuration, we can do CPU Throttling just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224041#comment-14224041 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/25/14 4:37 AM: - [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do based on some configuration, we can do CPU Throttling just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out. Once thanks for your detail analysis. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Thanks, Bhavesh was (Author: bmis13): [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do based on some configuration, we can do CPU Throttling just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224041#comment-14224041 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/25/14 4:39 AM: - [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do some stats based on some configuration, we can do CPU Throttling (if need) just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out ( you have my email id) . Once again thanks for your detail analysis and looking at this at short notice. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Thanks, Bhavesh was (Author: bmis13): [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do based on some configuration, we can do CPU Throttling just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out. Once thanks for your detail analysis. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224041#comment-14224041 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/25/14 4:40 AM: - [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do some stats based on some configuration, we can do CPU Throttling (if need) just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out ( you have my email id) . Once again thanks for your detail analysis and looking at this at short notice. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Also, I still feel that produce.close() is also needs to be looked at (join() method with come configuration time out) Thanks, Bhavesh was (Author: bmis13): [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do some stats based on some configuration, we can do CPU Throttling (if need) just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out ( you have my email id) . Once again thanks for your detail analysis and looking at this at short notice. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224046#comment-14224046 ] Bhavesh Mistry commented on KAFKA-1642: --- Also, Are you going to port back the back to 0.8.1.1 version as well ? Please let me know also. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224046#comment-14224046 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/25/14 4:43 AM: - Also, Are you going to port back the patch to 0.8.1.1 version as well ? Please let me know also. Thanks, Bhavesh was (Author: bmis13): Also, Are you going to port back the back to 0.8.1.1 version as well ? Please let me know also. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224041#comment-14224041 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/25/14 5:37 AM: - [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do some stats based on some configuration, we can do CPU Throttling (if need) just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out ( you have my email id) . Once again thanks for your detail analysis and looking at this at short notice. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Also, I still feel that produce.close() is also needs to be looked at (join() method with some configuration time out so thread does not hang) Thanks, Bhavesh was (Author: bmis13): [~ewencp], I hope above steps will give you comprehensive steps to reproduce problems with run() method. It would be really great if we can make the client more resilient and robust so network and brokers instability does not cause CPU spikes and degrade application performance. Hence, I would strongly at least detect the time run(time) is taking and do some stats based on some configuration, we can do CPU Throttling (if need) just to be more defensive or at lest detect that io thread is taking CPU cycle. By the way the experimental patch still works for steps describe above as well due to hard coded back-off. Any time you have patch or any thing, please let me know I will test it out ( you have my email id) . Once again thanks for your detail analysis and looking at this at short notice. Please look into to ClusterConnectionStates and how it manage the state of node when disconnecting immediately . please look into connecting(int node, long now) and this (I feel connecting needs to come before not after). selector.connect(node.id(), new InetSocketAddress(node.host(), node.port()), this.socketSendBuffer, this.socketReceiveBuffer); this.connectionStates.connecting(node.id(), now); Also, I still feel that produce.close() is also needs to be looked at (join() method with come configuration time out) Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226751#comment-14226751 ] Bhavesh Mistry edited comment on KAFKA-1642 at 11/26/14 8:08 PM: - [~ewencp], Even setting long following parameter, states of system does get impacted does not matter what reconnect.backoff.ms and retry.backoff.ms is set to. Once Node state is removed, the time out is set to 0. Please see the following logs. #15 minutes reconnect.backoff.ms=90 retry.backoff.ms=90 {code} 2014-11-26 11:01:27.898 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:02:27.903 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:03:27.903 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:04:27.903 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:05:27.904 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:06:27.905 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:07:27.906 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:08:27.908 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:09:27.908 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:10:27.909 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:11:27.909 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:12:27.910 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:13:27.911 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:14:27.912 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:15:27.914 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:00:27.613 [kafka-producer-network-thread | heartbeat] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: 2014-11-26 11:00:27.613 [kafka-producer-network-thread | rawlog] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: java.lang.IllegalStateException: No entry found for node -1 at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:131) at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:120) at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:407) at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:393) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:187) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:744) java.lang.IllegalStateException: No entry found for node -3 at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:131) at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:120) at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:407) at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:393) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:187) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:744) 2014-11-26 11:00:27.613 [kafka-producer-network-thread | heartbeat] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: 2014-11-26 11:00:27.613 [kafka-producer-network-thread | error] ERROR org.apache.k
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226751#comment-14226751 ] Bhavesh Mistry commented on KAFKA-1642: --- [~ewencp], Even setting long following parameter, states of system does get impacted does not matter what reconnect.backoff.ms and retry.backoff.ms is set to. Once Node state is removed, the time out is set to 0. Please see the following logs. # 15 minutes reconnect.backoff.ms=90 retry.backoff.ms=90 {code} 2014-11-26 11:01:27.898 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:02:27.903 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:03:27.903 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:04:27.903 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:05:27.904 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:06:27.905 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:07:27.906 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:08:27.908 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:09:27.908 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:10:27.909 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:11:27.909 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:12:27.910 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:13:27.911 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:14:27.912 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:15:27.914 Kafka Drop message topic=.rawlog org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 6 ms. 2014-11-26 11:00:27.613 [kafka-producer-network-thread | heartbeat] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: 2014-11-26 11:00:27.613 [kafka-producer-network-thread | rawlog] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: java.lang.IllegalStateException: No entry found for node -1 at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:131) at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:120) at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:407) at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:393) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:187) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:744) java.lang.IllegalStateException: No entry found for node -3 at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:131) at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:120) at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:407) at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:393) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:187) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:744) 2014-11-26 11:00:27.613 [kafka-producer-network-thread | heartbeat] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: 2014-11-26 11:00:27.613 [kafka-producer-network-thread | error] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228040#comment-14228040 ] Bhavesh Mistry commented on KAFKA-1642: --- [~soumen.sarkar], Time out is one thing, but also IO Thread needs to be safe guarded to see how aggressive it is based on network and data to be send. So it does not consume so much CPU cycle. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.1.1, 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229233#comment-14229233 ] Bhavesh Mistry commented on KAFKA-1642: --- I just discovered yesterday that 0.8.1.1 release does not have new producer code base jar officially released jar (kafka-clients) although code is there in 0.8.1.1 branch. That created confusion about porting to 0.8.1.1. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavesh Mistry updated KAFKA-1642: -- Affects Version/s: (was: 0.8.1.1) > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232061#comment-14232061 ] Bhavesh Mistry commented on KAFKA-1642: --- Hi [~ewencp], I will not have time to validate this patch till next week. Here is my comments: 1) You still have not address the Producer.close() method issue that in event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread. I think we need similar for this. 2) Also, can we please add JMX monitoring for IO tread to know how quick it is running. It will great to add this and run() method will report duration to metric. {code} try{ ThreadMXBean bean = ManagementFactory.getThreadMXBean( ); if(bean.isThreadCpuTimeSupported() && bean.isThreadCpuTimeEnabled()){ this.ioTheadCPUTime = metrics.sensor("iothread-cpu"); this.ioTheadCPUTime.add("iothread-cpu-ms", "The Rate Of CPU Cycle used by iothead in NANOSECONDS", new Rate(TimeUnit.NANOSECONDS) { public double measure(MetricConfig config, long now) { return (now - metadata.lastUpdate()) / 1000.0; } }); } }catch(Throwable th){ log.warn("Not able to set the CPU time... etc"); } {code} 3) Please check the timeout final value in *pollTimeout* if it is zero for constantly then we need to slow IO thread down. 4) Defensive check in for back off in run() method when IO thread is aggressive: 5) When all nodes are disconnected, do you still want to spin the IO Thread ? 6) When you have a firewall rule that says "you can only have 2 concurrent TCP connections from Client to Brokers" and client still have live TCP connection to same not (Broker), but new TCP connection is rejected. Node State will be marked as Disconnected in initiateConnect ? Are you handling that gracefully ? By the way, thank you very much for quick reply and with new patch. I appreciate your help. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232061#comment-14232061 ] Bhavesh Mistry edited comment on KAFKA-1642 at 12/2/14 8:04 PM: Hi [~ewencp], I will not have time to validate this patch till next week. Here is my comments: 1) Producer.close() method issue is not address with patch. In event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread. I think we need similar solution. 2) Also, can we please add JMX monitoring for IO tread to know how quick it is running. It will great to add this and run() method will report duration to metric in nano sec. {code} try{ ThreadMXBean bean = ManagementFactory.getThreadMXBean( ); if(bean.isThreadCpuTimeSupported() && bean.isThreadCpuTimeEnabled()){ this.ioTheadCPUTime = metrics.sensor("iothread-cpu"); this.ioTheadCPUTime.add("iothread-cpu-ms", "The Rate Of CPU Cycle used by iothead in NANOSECONDS", new Rate(TimeUnit.NANOSECONDS) { public double measure(MetricConfig config, long now) { return (now - metadata.lastUpdate()) / 1000.0; } }); } }catch(Throwable th){ log.warn("Not able to set the CPU time... etc"); } {code} 3) Please check the timeout final value in *pollTimeout* if it is zero for constantly then we need to slow IO thread down. 4) Defensive check is need for back off in run() method when IO thread is aggressive. {code} while (running) { long start = time.milliseconds(); try { run(time.milliseconds()); } catch (Exception e) { log.error("Uncaught error in kafka producer I/O thread: ", e); }finally{ long durationInMs = time.milliseconds() - start; // TODO Fix ME HERE GET DO exponential back-off sleep etc to prevent still CPU CYCLE HERE ?? How Much ...for the edge case... if(durationInMs < 200){ if(client.isAllRegistredNodesAreDown()){ countinuousRetry++; /// TODO MAKE THIS CONSTANT CONFIGURATION. when do we rest this interval ? so we can try aggressive again... sleepInMs = ((long) Math.pow(2, countinuousRetry) * 500); }else{ sleepInMs = 500 ; countinuousRetry = 0; } // Wait until the desired next time arrives using nanosecond // accuracy timer (wait(time) isn't accurate enough on most platforms) try { // TODO SLEEP IS NOT GOOD SOLUTON.. Thread.sleep(sleepInMs); } catch (InterruptedException e) { log.error("While sleeping some one interupted this tread probally close method on prodcuer close () "); } } } } {code} 5) When all nodes are disconnected, do you still want to spin the IO Thread ? 6) When you have a firewall rule that says "you can only have 2 concurrent TCP connections from Client to Brokers" and client still have live TCP connection to same node (Broker), but new TCP connections are rejected. Node State will be marked as Disconnected in initiateConnect ? Is this case handled gracefully ? By the way, thank you very much for quick reply and with new patch. I appreciate your help. Thanks, Bhavesh was (Author: bmis13): Hi [~ewencp], I will not have time to validate this patch till next week. Here is my comments: 1) You still have not address the Producer.close() method issue that in event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread. I think we need similar for this. 2) Also, can we please add JMX monitoring for IO tread to know how quick it is running. It will great to add this and run() method will report duration to metric. {code} try{ ThreadMXBean bean = ManagementFactory.getThreadMXBean( ); if(bean.isThreadCpuTimeSupported() && bean.isThreadCpuTimeEnabled()){ this.ioTheadCPUTime = me
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234297#comment-14234297 ] Bhavesh Mistry commented on KAFKA-1642: --- [~ewencp], 1) I will posted toward KAFKA-1788 and perhaps link the issue. 2) True , some sort of measure would be great 5,10...25 50, 95 and 99 percentile would be great of execution time. The point is just measure the duration report the rate of execution. 3) Agree with what you are saying and I have observed same behavior. But only recommendation is to add some intelligence to *timeouts* to detect if for long period and consecutive timeout is zero then there is problem. (Little more defensive) 4) Again I agree with you point, but based in your previous comments you had mentioned that you may consider having back-off logic further up the chain. So I was just checking run() is best place to do that check. Again, may be add intelligence here if you get consecutive “Exception” then likelihood of high CPU is high. 5) Ok. I agree what you are saying is data needs to be de-queue so more data can be en-queue even in event of network lost. Is my understanding correct ? 6) All I am saying is network firewall rule (such as only 2 TCP connections per source host) or Brokers running out of File Descriptor so new connection to broker is not established but Client have live and active TCP connection to same broker. But based on what I see in the method * initiateConnect* will mark the entire Broker or Node status as disconnected. Is this expected behavior? So question is: will client continue to send data ? Thank you very much for entertaining my questions so far and I will test out the patch next week. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1788) producer record can stay in RecordAccumulator forever if leader is no available
[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234595#comment-14234595 ] Bhavesh Mistry commented on KAFKA-1788: --- We also need to fix the Producer Close which hangs JVM because io.join() thread does not exit. Please refer to KAFKA-1642 for more details. So Kakfa core Dev needs to give guidance on how to solve this problem. Please see below comments from that linked issue. 1) Producer.close() method issue is not address with patch. In event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread. I think we need similar solution. [~ewencp], 1. I'm specifically trying to address the CPU usage here. I realize from your perspective they are closely related since they're both can be triggered by a loss of network connectivity, but internally they're really separate issues – the CPU usage has to do with incorrect timeouts and the join() issues is due to the lack of timeouts on produce operations. That's why I pointed you toward KAFKA-1788. If a timeout is added for data in the producer, that would resolve the close issue as well since any data waiting in the producer would eventually timeout and the IO thread could exit. I think that's the cleanest solution since it solves both problems with a single setting (the amount of time your willing to wait before discarding data). If you think a separate timeout specifically for Producer.close() is worthwhile I'd suggest filing a separate JIRA for that. > producer record can stay in RecordAccumulator forever if leader is no > available > --- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 > Project: Kafka > Issue Type: Bug > Components: core, producer >Affects Versions: 0.8.2 >Reporter: Jun Rao >Assignee: Jun Rao > Labels: newbie++ > Fix For: 0.8.3 > > > In the new producer, when a partition has no leader for a long time (e.g., > all replicas are down), the records for that partition will stay in the > RecordAccumulator until the leader is available. This may cause the > bufferpool to be full and the callback for the produced message to block for > a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239063#comment-14239063 ] Bhavesh Mistry commented on KAFKA-1642: --- [~stevenz3wu], 0.8.2 is very well tested and worked well under heavy load. This bug is rare only happen when broker or network has issue. We have been producing about 7 to 10 TB per day using this new producer, so 0.8.2 is very safe to use in production. It has survived pick traffic of the year on large e-commerce site. So I am fairly confident that New Java API is indeed does true round-robin and much faster than Scala Based API. [~ewencp], I will verify the patch by end of this Friday, but do let me know your understanding based on my last comment. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239063#comment-14239063 ] Bhavesh Mistry edited comment on KAFKA-1642 at 12/9/14 6:53 AM: [~stevenz3wu], 0.8.2 is very well tested and worked well under heavy load. This bug is rare only happen when broker or network has issue. We have been producing about 7 to 10 TB per day using this new producer, so 0.8.2 is very safe to use in production. It has survived pick traffic of the year on large e-commerce site. So I am fairly confident that New Java API is indeed does true round-robin and much faster than Scala Based API. [~ewencp], I will verify the patch by end of this Friday, but do let me know your understanding based on my last comment. The goal is to rest this issue and cover all the use case. Thanks, Bhavesh was (Author: bmis13): [~stevenz3wu], 0.8.2 is very well tested and worked well under heavy load. This bug is rare only happen when broker or network has issue. We have been producing about 7 to 10 TB per day using this new producer, so 0.8.2 is very safe to use in production. It has survived pick traffic of the year on large e-commerce site. So I am fairly confident that New Java API is indeed does true round-robin and much faster than Scala Based API. [~ewencp], I will verify the patch by end of this Friday, but do let me know your understanding based on my last comment. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.8.2 >Reporter: Bhavesh Mistry >Assignee: Ewen Cheslack-Postava >Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1788) producer record can stay in RecordAccumulator forever if leader is no available
[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249266#comment-14249266 ] Bhavesh Mistry commented on KAFKA-1788: --- [~jkreps], Can we just take quick look at the NodeConnectionState ? If all registered Nodes are down, then exit it quickly or attempt to connect ? This will have accurate status of al Nodes registered... (may we can do TCP ping for all nodes). I am not sure if producer key is fixed to only one brokers then does it still have all Node status ? Here is reference code: https://github.com/apache/kafka/blob/0.8.2/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java https://github.com/apache/kafka/blob/0.8.2/clients/src/main/java/org/apache/kafka/clients/NodeConnectionState.java I did this in experimental path for o KAFKA-1642 (but used hard coded timeout for join method). Thanks, Bhavesh > producer record can stay in RecordAccumulator forever if leader is no > available > --- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 > Project: Kafka > Issue Type: Bug > Components: core, producer >Affects Versions: 0.8.2 >Reporter: Jun Rao >Assignee: Jun Rao > Labels: newbie++ > Fix For: 0.8.3 > > > In the new producer, when a partition has no leader for a long time (e.g., > all replicas are down), the records for that partition will stay in the > RecordAccumulator until the leader is available. This may cause the > bufferpool to be full and the callback for the produced message to block for > a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1788) producer record can stay in RecordAccumulator forever if leader is no available
[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249266#comment-14249266 ] Bhavesh Mistry edited comment on KAFKA-1788 at 12/17/14 1:26 AM: - [~jkreps], Can we just take quick look at the NodeConnectionState ? If all registered Nodes are down, then exit it quickly or attempt to connect ? This will have accurate status of all Nodes registered... (may we can do TCP ping for all nodes). I am not sure if producer key is fixed to only one brokers then does it still have all Node status ? Here is reference code: https://github.com/apache/kafka/blob/0.8.2/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java https://github.com/apache/kafka/blob/0.8.2/clients/src/main/java/org/apache/kafka/clients/NodeConnectionState.java I did this in experimental path for o KAFKA-1642 (but used hard coded timeout for join method for IO thread and interrupted if it does not get killed ). Thanks, Bhavesh was (Author: bmis13): [~jkreps], Can we just take quick look at the NodeConnectionState ? If all registered Nodes are down, then exit it quickly or attempt to connect ? This will have accurate status of al Nodes registered... (may we can do TCP ping for all nodes). I am not sure if producer key is fixed to only one brokers then does it still have all Node status ? Here is reference code: https://github.com/apache/kafka/blob/0.8.2/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java https://github.com/apache/kafka/blob/0.8.2/clients/src/main/java/org/apache/kafka/clients/NodeConnectionState.java I did this in experimental path for o KAFKA-1642 (but used hard coded timeout for join method). Thanks, Bhavesh > producer record can stay in RecordAccumulator forever if leader is no > available > --- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 > Project: Kafka > Issue Type: Bug > Components: core, producer >Affects Versions: 0.8.2 >Reporter: Jun Rao >Assignee: Jun Rao > Labels: newbie++ > Fix For: 0.8.3 > > > In the new producer, when a partition has no leader for a long time (e.g., > all replicas are down), the records for that partition will stay in the > RecordAccumulator until the leader is available. This may cause the > bufferpool to be full and the callback for the produced message to block for > a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257686#comment-14257686 ] Bhavesh Mistry commented on KAFKA-1642: --- [~ewencp], Patch indeed solve the high CPU Problem reported by this bug. I have tested all brokers down, one broker down and two broker down: Here are some interesting Observations from YourKit: 0) Overall, patch has also brought down overall consumption in Normal Healthy or Happy case where every thing is up and running. In old code (without patch), I use to see about 10% of overall CPU used by process by io threads (4 in my case), it has reduce to 5% or less now with path. 1) When two brokers are down, then occasionally I see IO thread blocked. ( I did not see this when one brokers is down) {code} kafka-producer-network-thread | rawlog [BLOCKED] [DAEMON] org.apache.kafka.clients.producer.internals.Metadata.fetch() Metadata.java:70 java.lang.Thread.run() Thread.java:744 {code} 2) record-error-rate metric remain zero despite following firewall rule. In my opinion, it should have called org.apache.kafka.clients.producer.Callback but I did not see that happening either in either one or two brokers down. Should I file another issue for this ? Please confirm. {code} 00100 reject tcp from me to b1.ip dst-port 9092 00200 reject tcp from me to b2.ip dst-port 9092 {code} {code} class LoggingCallBaHandler implements Callback { /** * A callback method the user can implement to provide asynchronous * handling of request completion. This method will be called when the * record sent to the server has been acknowledged. Exactly one of the * arguments will be non-null. * * @param metadata *The metadata for the record that was sent (i.e. the *partition and offset). Null if an error occurred. * @param exception *The exception thrown during processing of this record. *Null if no error occurred. */ @Override public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ exception.printStackTrace(); } } } {code} I do not see any exception at all on consolenot sure why ? 3) Application does NOT gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting for monitor entry [0x00011e104000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown)
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257686#comment-14257686 ] Bhavesh Mistry edited comment on KAFKA-1642 at 12/23/14 11:36 PM: -- [~ewencp], Patch indeed solve the high CPU Problem reported by this bug. I have tested all brokers down, one broker down and two broker down (except for last use cases where one of the brokers runs out of Socket File Descriptor a rear case) : Here are some interesting Observations from YourKit: 0) Overall, patch has also brought down overall consumption in Normal Healthy or Happy case where every thing is up and running. In old code (without patch), I use to see about 10% of overall CPU used by process by io threads (4 in my case), it has reduce to 5% or less now with path. 1) When two brokers are down, then occasionally I see IO thread blocked. ( I did not see this when one brokers is down) {code} kafka-producer-network-thread | rawlog [BLOCKED] [DAEMON] org.apache.kafka.clients.producer.internals.Metadata.fetch() Metadata.java:70 java.lang.Thread.run() Thread.java:744 {code} 2) record-error-rate metric remain zero despite following firewall rule. In my opinion, it should have called org.apache.kafka.clients.producer.Callback but I did not see that happening either in either one or two brokers down. Should I file another issue for this ? Please confirm. {code} 00100 reject tcp from me to b1.ip dst-port 9092 00200 reject tcp from me to b2.ip dst-port 9092 {code} {code} class LoggingCallBaHandler implements Callback { /** * A callback method the user can implement to provide asynchronous * handling of request completion. This method will be called when the * record sent to the server has been acknowledged. Exactly one of the * arguments will be non-null. * * @param metadata *The metadata for the record that was sent (i.e. the *partition and offset). Null if an error occurred. * @param exception *The exception thrown during processing of this record. *Null if no error occurred. */ @Override public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ exception.printStackTrace(); } } } {code} I do not see any exception at all on consolenot sure why ? 3) Application does NOT gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting for monitor entry [0x00011e104000] java.lang.Thread.State: BLOCKED (on object
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257686#comment-14257686 ] Bhavesh Mistry edited comment on KAFKA-1642 at 12/23/14 11:39 PM: -- [~ewencp], Patch indeed solve the high CPU Problem reported by this bug. I have tested all brokers down, one broker down and two broker down (except for last use cases where one of the brokers runs out of Socket File Descriptor a rear case) : I am sorry for last response, I got busy with other stuff so testing got delayed. Here are some interesting Observations from YourKit: 0) Overall, patch has also brought down overall consumption in Normal Healthy or Happy case where every thing is up and running. In old code (without patch), I use to see about 10% of overall CPU used by process by io threads (4 in my case), it has reduce to 5% or less now with path. 1) When two brokers are down, then occasionally I see IO thread blocked. ( I did not see this when one brokers is down) {code} kafka-producer-network-thread | rawlog [BLOCKED] [DAEMON] org.apache.kafka.clients.producer.internals.Metadata.fetch() Metadata.java:70 java.lang.Thread.run() Thread.java:744 {code} 2) record-error-rate metric remain zero despite following firewall rule. In my opinion, it should have called org.apache.kafka.clients.producer.Callback but I did not see that happening either in either one or two brokers down. Should I file another issue for this ? Please confirm. {code} 00100 reject tcp from me to b1.ip dst-port 9092 00200 reject tcp from me to b2.ip dst-port 9092 {code} {code} class LoggingCallBaHandler implements Callback { /** * A callback method the user can implement to provide asynchronous * handling of request completion. This method will be called when the * record sent to the server has been acknowledged. Exactly one of the * arguments will be non-null. * * @param metadata *The metadata for the record that was sent (i.e. the *partition and offset). Null if an error occurred. * @param exception *The exception thrown during processing of this record. *Null if no error occurred. */ @Override public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ exception.printStackTrace(); } } } {code} I do not see any exception at all on consolenot sure why ? 3) Application does NOT gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257686#comment-14257686 ] Bhavesh Mistry edited comment on KAFKA-1642 at 12/23/14 11:41 PM: -- [~ewencp], Patch indeed solve the high CPU Problem reported by this bug. I have tested all brokers down, one broker down and two broker down (except for last use cases where one of the brokers runs out of Socket File Descriptor a rear case) : I am sorry for last response, I got busy with other stuff so testing got delayed. Here are some interesting Observations from YourKit: 0) Overall, patch has also brought down overall consumption in Normal Healthy or Happy case where every thing is up and running. In old code (without patch), I use to see about 10% of overall CPU used by process by io threads (4 in my case), it has reduce to 5% or less now with path. 1) When two brokers are down, then occasionally I see IO thread blocked. ( I did not see this when one brokers is down) {code} kafka-producer-network-thread | rawlog [BLOCKED] [DAEMON] org.apache.kafka.clients.producer.internals.Metadata.fetch() Metadata.java:70 java.lang.Thread.run() Thread.java:744 {code} 2) record-error-rate metric remain zero despite following firewall rule. In my opinion, it should have called org.apache.kafka.clients.producer.Callback but I did not see that happening either in either one or two brokers down. Should I file another issue for this ? Please confirm. {code} 00100 reject tcp from me to b1.ip dst-port 9092 00200 reject tcp from me to b2.ip dst-port 9092 {code} {code} class LoggingCallBaHandler implements Callback { /** * A callback method the user can implement to provide asynchronous * handling of request completion. This method will be called when the * record sent to the server has been acknowledged. Exactly one of the * arguments will be non-null. * * @param metadata *The metadata for the record that was sent (i.e. the *partition and offset). Null if an error occurred. * @param exception *The exception thrown during processing of this record. *Null if no error occurred. */ @Override public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ exception.printStackTrace(); } } } {code} I do not see any exception at all on consolenot sure why ? 3) Application does NOT gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting
[jira] [Comment Edited] (KAFKA-1788) producer record can stay in RecordAccumulator forever if leader is no available
[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257691#comment-14257691 ] Bhavesh Mistry edited comment on KAFKA-1788 at 12/23/14 11:44 PM: -- HI All, I did NOT try this patch, but when one or two or all brokers are down then I see application will not shutdown due to close() method: Application does not gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting for monitor entry [0x00011e104000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "Thread-4" prio=5 tid=0x7f8bdb39f000 nid=0xa107 in Object.wait() [0x00011ea89000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.$$YJP$$wait(Native Method) at java.lang.Object.wait(Object.java) at java.lang.Thread.join(Thread.java:1280) - locked <0x000705c2f650> (a org.apache.kafka.common.utils.KafkaThread) at java.lang.Thread.join(Thread.java:1354) at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:322) at "kafka-producer-network-thread | error" daemon prio=5 tid=0x7f8bd814e000 nid=0x7403 runnable [0x00011e6c] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.$$YJP$$kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.kevent0(KQueueArrayWrapper.java) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:200) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:103) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x000705c109f8> (a sun.nio.ch.Util$2) - locked <0x000705c109e8> (a java.util.Collections$UnmodifiableSet) - locked <0x000705c105c8> (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.kafka.common.network.Selector.select(Selector.java:322) at org.apache.kafka.common.network.Selector.poll(Selector.java:212) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:128) at java.lang.Thread.run(Thread.java:744) {code} Thanks, Bhavesh was (Author: bmis13): HI All, I did NOT try this patch, but when one or two or all brokers are down then I see application will not shutdown due to close() method: Application does not gracefully shutdown whe
[jira] [Commented] (KAFKA-1788) producer record can stay in RecordAccumulator forever if leader is no available
[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257691#comment-14257691 ] Bhavesh Mistry commented on KAFKA-1788: --- HI All, I did NOT try this patch, but when one or two or all brokers are down then I see application will not shutdown due to close() method: Application does not gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting for monitor entry [0x00011e104000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "Thread-4" prio=5 tid=0x7f8bdb39f000 nid=0xa107 in Object.wait() [0x00011ea89000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.$$YJP$$wait(Native Method) at java.lang.Object.wait(Object.java) at java.lang.Thread.join(Thread.java:1280) - locked <0x000705c2f650> (a org.apache.kafka.common.utils.KafkaThread) at java.lang.Thread.join(Thread.java:1354) at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:322) at "kafka-producer-network-thread | error" daemon prio=5 tid=0x7f8bd814e000 nid=0x7403 runnable [0x00011e6c] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.$$YJP$$kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.kevent0(KQueueArrayWrapper.java) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:200) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:103) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x000705c109f8> (a sun.nio.ch.Util$2) - locked <0x000705c109e8> (a java.util.Collections$UnmodifiableSet) - locked <0x000705c105c8> (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.kafka.common.network.Selector.select(Selector.java:322) at org.apache.kafka.common.network.Selector.poll(Selector.java:212) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:128) at java.lang.Thread.run(Thread.java:744) {code} > producer record can stay in RecordAccumulator forever if leader is no > available > --- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 >
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257686#comment-14257686 ] Bhavesh Mistry edited comment on KAFKA-1642 at 12/24/14 12:01 AM: -- [~ewencp], Patch indeed solve the high CPU Problem reported by this bug. I have tested all brokers down, one broker down and two broker down (except for last use cases where one of the brokers runs out of Socket File Descriptor a rear case) : I am sorry for last response, I got busy with other stuff so testing got delayed. Here are some interesting Observations from YourKit: 0) Overall, patch has also brought down overall consumption in Normal Healthy or Happy case where every thing is up and running. In old code (without patch), I use to see about 10% of overall CPU used by process by io threads (4 in my case), it has reduce to 5% or less now with path. 1) When two brokers are down, then occasionally I see IO thread blocked. ( I did not see this when one brokers is down) {code} kafka-producer-network-thread | rawlog [BLOCKED] [DAEMON] org.apache.kafka.clients.producer.internals.Metadata.fetch() Metadata.java:70 java.lang.Thread.run() Thread.java:744 {code} 2) record-error-rate metric remain zero despite following firewall rule. In my opinion, it should have called org.apache.kafka.clients.producer.Callback but I did not see that happening either in either one or two brokers down. Should I file another issue for this ? Please confirm. {code} sudo ipfw add reject tcp from me to b1.ip dst-port 9092 sudo ipfw add reject tcp from me to b2.ip dst-port 9092 00100 reject tcp from me to b1.ip dst-port 9092 00200 reject tcp from me to b2.ip dst-port 9092 {code} {code} class LoggingCallBaHandler implements Callback { /** * A callback method the user can implement to provide asynchronous * handling of request completion. This method will be called when the * record sent to the server has been acknowledged. Exactly one of the * arguments will be non-null. * * @param metadata *The metadata for the record that was sent (i.e. the *partition and offset). Null if an error occurred. * @param exception *The exception thrown during processing of this record. *Null if no error occurred. */ @Override public void onCompletion(RecordMetadata metadata, Exception exception) { if(exception != null){ exception.printStackTrace(); } } } {code} I do not see any exception at all on consolenot sure why ? 3) Application does NOT gracefully shutdown when there one or more brokers are down. (io Thread never exits this is know issue ) {code} "SIGTERM handler" daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for monitor entry [0x00011e906000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for monitor entry [0x00011e803000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for monitor entry [0x00011e30a000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:744) "SIGTERM handler" daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for monitor entry [0x00011e207000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Shutdown.exit(Shutdown.java:212) - waiting to lock <0x00070008f7c0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at