[jira] [Resolved] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-6134. Resolution: Fixed > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Labels: regression > Fix For: 1.0.0, 0.11.0.2 > > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221718#comment-16221718 ] ASF GitHub Bot commented on KAFKA-6134: --- Github user hachikuji closed the pull request at: https://github.com/apache/kafka/pull/4141 > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Labels: regression > Fix For: 1.0.0, 0.11.0.2 > > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores
[ https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221621#comment-16221621 ] Richard Yu edited comment on KAFKA-4499 at 10/27/17 5:04 AM: - I have realized that this particular way is flawed and should be reconsidered. There is an another approach from my understanding, which utilizes provided methods in TreeMap: Thread Cache is used as a general storage area, and would be used to access specific NamedCaches, with each CachingWindowStore being assigned one name. ThreadCache has a method defined as all(String namespace) which could return all keys present in the specified NamedCache. However that does not neccesarily mean it would be sorted. However, we could utilize TreeMap's descendingKeySet() method to get all keys in descending order. This would require the definition of a new method in NamedCache which calls upon descendingKeySet() but in effect it will allow us a sorted set from which to operate. was (Author: yohan123): I have realized that this particular way is flawed and should be reconsidered. There is an another approach from my understanding, which utilizes provided methods in TreeMap: Thread Cache is used as a general storage area, and would be used to access specific NamedCaches, with each CachingWindowStore being assigned one name. ThreadCache has a method defined as all(String namespace) which could return all keys present in the specified NamedCache. However that does not neccesarily mean it would be sorted. However, we could utilize TreeMap's descendingKeySet() method to get all keys in ascending order. This would require the definition of a new method in NamedCache which calls upon descendingKeySet() but in effect it will allow us a sorted set from which to operate. > Add "getAllKeys" API for querying windowed KTable stores > > > Key: KAFKA-4499 > URL: https://issues.apache.org/jira/browse/KAFKA-4499 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax > Labels: needs-kip > Attachments: 4499-CachingWindowStore-v1.patch > > > Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via > IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide > method {{all()}} to scan the whole store (ie, returns an iterator over all > stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} > (for windowed-{{KTable}} stores). > This limits the usage of a windowed store, because the user needs to know > what keys are stored in order the query it. It would be useful to provide > possible APIs like this (only a rough sketch): > - {{keys()}} returns all keys available in the store (maybe together with > available time ranges) > - {{all(long timeFrom, long timeTo)}} that returns all window for a specific > time range > - {{allLatest()}} that returns the latest window for each key > Because this feature would require to scan multiple segments (ie, RockDB > instances) it would be quite inefficient with current store design. Thus, > this feature also required to redesign the underlying window store itself. > Because this is a major change, a KIP > (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals) > is required. The KIP should cover the actual API design as well as the store > refactoring. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores
[ https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221621#comment-16221621 ] Richard Yu edited comment on KAFKA-4499 at 10/27/17 4:56 AM: - I have realized that this particular way is flawed and should be reconsidered. There is an another approach from my understanding, which utilizes provided methods in TreeMap: Thread Cache is used as a general storage area, and would be used to access specific NamedCaches, with each CachingWindowStore being assigned one name. ThreadCache has a method defined as all(String namespace) which could return all keys present in the specified NamedCache. However that does not neccesarily mean it would be sorted. However, we could utilize TreeMap's descendingKeySet() method to get all keys in ascending order. This would require the definition of a new method in NamedCache which calls upon descendingKeySet() but in effect it will allow us a sorted set from which to operate. was (Author: yohan123): Some information is currently lacking when it comes to fetchAll() as the keys associated with the timeFrom and timeTo parameters is missing, so I decided to use a substitute. If permissible, a TreeMap or HashMap can be added so that the timestamps and there corresponding keys (byte arrays) could be recorded and retrieved. > Add "getAllKeys" API for querying windowed KTable stores > > > Key: KAFKA-4499 > URL: https://issues.apache.org/jira/browse/KAFKA-4499 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax > Labels: needs-kip > Attachments: 4499-CachingWindowStore-v1.patch > > > Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via > IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide > method {{all()}} to scan the whole store (ie, returns an iterator over all > stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} > (for windowed-{{KTable}} stores). > This limits the usage of a windowed store, because the user needs to know > what keys are stored in order the query it. It would be useful to provide > possible APIs like this (only a rough sketch): > - {{keys()}} returns all keys available in the store (maybe together with > available time ranges) > - {{all(long timeFrom, long timeTo)}} that returns all window for a specific > time range > - {{allLatest()}} that returns the latest window for each key > Because this feature would require to scan multiple segments (ie, RockDB > instances) it would be quite inefficient with current store design. Thus, > this feature also required to redesign the underlying window store itself. > Because this is a major change, a KIP > (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals) > is required. The KIP should cover the actual API design as well as the store > refactoring. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-5668) queryable state window store range scan only returns results from one store
[ https://issues.apache.org/jira/browse/KAFKA-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-5668: - Labels: bug (was: ) > queryable state window store range scan only returns results from one store > --- > > Key: KAFKA-5668 > URL: https://issues.apache.org/jira/browse/KAFKA-5668 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Xavier Léauté >Assignee: Damian Guy > Labels: bug > Fix For: 1.0.0 > > > Queryable state range scans return results from all local stores for simple > key-value stores. > For windowed stores however it only returns results from a single local store. > The expected behavior is that windowed store range scans would return results > from all local stores as it does for simple key-value stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-5668) queryable state window store range scan only returns results from one store
[ https://issues.apache.org/jira/browse/KAFKA-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-5668: - Fix Version/s: 1.0.0 > queryable state window store range scan only returns results from one store > --- > > Key: KAFKA-5668 > URL: https://issues.apache.org/jira/browse/KAFKA-5668 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Xavier Léauté >Assignee: Damian Guy > Labels: bug > Fix For: 1.0.0 > > > Queryable state range scans return results from all local stores for simple > key-value stores. > For windowed stores however it only returns results from a single local store. > The expected behavior is that windowed store range scans would return results > from all local stores as it does for simple key-value stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-5668) queryable state window store range scan only returns results from one store
[ https://issues.apache.org/jira/browse/KAFKA-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-5668: - Affects Version/s: 0.11.0.1 > queryable state window store range scan only returns results from one store > --- > > Key: KAFKA-5668 > URL: https://issues.apache.org/jira/browse/KAFKA-5668 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Xavier Léauté >Assignee: Damian Guy > > Queryable state range scans return results from all local stores for simple > key-value stores. > For windowed stores however it only returns results from a single local store. > The expected behavior is that windowed store range scans would return results > from all local stores as it does for simple key-value stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores
[ https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221621#comment-16221621 ] Richard Yu commented on KAFKA-4499: --- Some information is currently lacking when it comes to fetchAll() as the keys associated with the timeFrom and timeTo parameters is missing, so I decided to use a substitute. If permissible, a TreeMap or HashMap can be added so that the timestamps and there corresponding keys (byte arrays) could be recorded and retrieved. > Add "getAllKeys" API for querying windowed KTable stores > > > Key: KAFKA-4499 > URL: https://issues.apache.org/jira/browse/KAFKA-4499 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax > Labels: needs-kip > Attachments: 4499-CachingWindowStore-v1.patch > > > Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via > IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide > method {{all()}} to scan the whole store (ie, returns an iterator over all > stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} > (for windowed-{{KTable}} stores). > This limits the usage of a windowed store, because the user needs to know > what keys are stored in order the query it. It would be useful to provide > possible APIs like this (only a rough sketch): > - {{keys()}} returns all keys available in the store (maybe together with > available time ranges) > - {{all(long timeFrom, long timeTo)}} that returns all window for a specific > time range > - {{allLatest()}} that returns the latest window for each key > Because this feature would require to scan multiple segments (ie, RockDB > instances) it would be quite inefficient with current store design. Thus, > this feature also required to redesign the underlying window store itself. > Because this is a major change, a KIP > (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals) > is required. The KIP should cover the actual API design as well as the store > refactoring. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores
[ https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Yu updated KAFKA-4499: -- Attachment: 4499-CachingWindowStore-v1.patch This is my current prototype for the fetchAll() and all() methods. The fetchAll() method still might require some thought. > Add "getAllKeys" API for querying windowed KTable stores > > > Key: KAFKA-4499 > URL: https://issues.apache.org/jira/browse/KAFKA-4499 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax > Labels: needs-kip > Attachments: 4499-CachingWindowStore-v1.patch > > > Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via > IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide > method {{all()}} to scan the whole store (ie, returns an iterator over all > stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} > (for windowed-{{KTable}} stores). > This limits the usage of a windowed store, because the user needs to know > what keys are stored in order the query it. It would be useful to provide > possible APIs like this (only a rough sketch): > - {{keys()}} returns all keys available in the store (maybe together with > available time ranges) > - {{all(long timeFrom, long timeTo)}} that returns all window for a specific > time range > - {{allLatest()}} that returns the latest window for each key > Because this feature would require to scan multiple segments (ie, RockDB > instances) it would be quite inefficient with current store design. Thus, > this feature also required to redesign the underlying window store itself. > Because this is a major change, a KIP > (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals) > is required. The KIP should cover the actual API design as well as the store > refactoring. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221596#comment-16221596 ] ASF GitHub Bot commented on KAFKA-6134: --- GitHub user hachikuji opened a pull request: https://github.com/apache/kafka/pull/4141 KAFKA-6134: Read partition reassignment lazily on event handling This patch prevents an O(n^2) increase in memory utilization during partition reassignment. Instead of storing the reassigned partitions in the `PartitionReassignment` object (which is added after ever partition reassignment), we read the data fresh from ZK when processing the event. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hachikuji/kafka KAFKA-6134 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4141 commit 5131bb19f6fe7fc1939035c48ead052a0ac967a4 Author: Jason GustafsonDate: 2017-10-27T02:01:05Z KAFKA-6134: Read partition reassignment lazily on event handling > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Labels: regression > Fix For: 1.0.0, 0.11.0.2 > > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently
[ https://issues.apache.org/jira/browse/KAFKA-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6131: --- Affects Version/s: 0.11.0.1 > Transaction markers are sometimes discarded if txns complete concurrently > - > > Key: KAFKA-6131 > URL: https://issues.apache.org/jira/browse/KAFKA-6131 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.1, 1.0.0 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram > Fix For: 1.0.0, 0.11.0.2, 1.1.0 > > > Concurrent tests being added under KAFKA-6096 for transaction coordinator > fail to complete some transactions when multiple transactions are completed > concurrently. > The problem is with the following code snippet - there are two very similar > uses of concurrent map in {{TransactionMarkerChannelManager}} and the test > fails because some transaction markers are discarded. {{getOrElseUpdate}} in > scala maps are not atomic. The test passes consistently with one thread. > {quote} > val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new > ConcurrentHashMap[Int, TxnMarkerQueue]().asScala > val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new > TxnMarkerQueue(broker)) > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently
[ https://issues.apache.org/jira/browse/KAFKA-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221583#comment-16221583 ] ASF GitHub Bot commented on KAFKA-6131: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4140 > Transaction markers are sometimes discarded if txns complete concurrently > - > > Key: KAFKA-6131 > URL: https://issues.apache.org/jira/browse/KAFKA-6131 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram > > Concurrent tests being added under KAFKA-6096 for transaction coordinator > fail to complete some transactions when multiple transactions are completed > concurrently. > The problem is with the following code snippet - there are two very similar > uses of concurrent map in {{TransactionMarkerChannelManager}} and the test > fails because some transaction markers are discarded. {{getOrElseUpdate}} in > scala maps are not atomic. The test passes consistently with one thread. > {quote} > val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new > ConcurrentHashMap[Int, TxnMarkerQueue]().asScala > val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new > TxnMarkerQueue(broker)) > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6135) TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException
[ https://issues.apache.org/jira/browse/KAFKA-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated KAFKA-6135: -- Attachment: 6135.out Test output . > TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException > -- > > Key: KAFKA-6135 > URL: https://issues.apache.org/jira/browse/KAFKA-6135 > Project: Kafka > Issue Type: Bug >Reporter: Ted Yu > Attachments: 6135.out > > > From > https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/2293/testReport/junit/kafka.api/TransactionsTest/testFencingOnCommit/ > : > {code} > org.scalatest.junit.JUnitTestFailedError: Got an unexpected exception from a > fenced producer. > at > org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException(AssertionsForJUnit.scala:100) > at > org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException$(AssertionsForJUnit.scala:99) > at > org.scalatest.junit.JUnitSuite.newAssertionFailedException(JUnitSuite.scala:71) > at org.scalatest.Assertions.fail(Assertions.scala:1105) > at org.scalatest.Assertions.fail$(Assertions.scala:1101) > at org.scalatest.junit.JUnitSuite.fail(JUnitSuite.scala:71) > at > kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:319) > ... > Caused by: org.apache.kafka.common.KafkaException: Cannot execute > transactional method because we are in an error state > at > org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:782) > at > org.apache.kafka.clients.producer.internals.TransactionManager.beginCommit(TransactionManager.java:220) > at > org.apache.kafka.clients.producer.KafkaProducer.commitTransaction(KafkaProducer.java:617) > at > kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:313) > ... 48 more > Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer > attempted an operation with an old epoch. Either there is a newer producer > with the same transactionalId, or the producer's transaction has been expired > by the broker. > {code} > Confirmed with [~apurva] that the above would not be covered by his fix for > KAFKA-6119 > Temporarily marking this as bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6135) TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException
Ted Yu created KAFKA-6135: - Summary: TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException Key: KAFKA-6135 URL: https://issues.apache.org/jira/browse/KAFKA-6135 Project: Kafka Issue Type: Bug Reporter: Ted Yu >From >https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/2293/testReport/junit/kafka.api/TransactionsTest/testFencingOnCommit/ > : {code} org.scalatest.junit.JUnitTestFailedError: Got an unexpected exception from a fenced producer. at org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException(AssertionsForJUnit.scala:100) at org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException$(AssertionsForJUnit.scala:99) at org.scalatest.junit.JUnitSuite.newAssertionFailedException(JUnitSuite.scala:71) at org.scalatest.Assertions.fail(Assertions.scala:1105) at org.scalatest.Assertions.fail$(Assertions.scala:1101) at org.scalatest.junit.JUnitSuite.fail(JUnitSuite.scala:71) at kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:319) ... Caused by: org.apache.kafka.common.KafkaException: Cannot execute transactional method because we are in an error state at org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:782) at org.apache.kafka.clients.producer.internals.TransactionManager.beginCommit(TransactionManager.java:220) at org.apache.kafka.clients.producer.KafkaProducer.commitTransaction(KafkaProducer.java:617) at kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:313) ... 48 more Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker. {code} Confirmed with [~apurva] that the above would not be covered by his fix for KAFKA-6119 Temporarily marking this as bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores
[ https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221561#comment-16221561 ] Richard Yu edited comment on KAFKA-4499 at 10/27/17 1:38 AM: - I just like to seek confirmation on something: from my understanding, the fetch() methods found in ReadOnlyWindowStore, the keys must be sorted when returned. Does that neccesarily have to be the case with the new methods as well? If so, is it pre or post serialization? was (Author: yohan123): I just like to seek confirmation on something: for the fetch() methods found in ReadOnlyWindowStore, the keys must be sorted when returned. Does that neccesarily have to be the case with the new methods as well? > Add "getAllKeys" API for querying windowed KTable stores > > > Key: KAFKA-4499 > URL: https://issues.apache.org/jira/browse/KAFKA-4499 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax > Labels: needs-kip > > Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via > IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide > method {{all()}} to scan the whole store (ie, returns an iterator over all > stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} > (for windowed-{{KTable}} stores). > This limits the usage of a windowed store, because the user needs to know > what keys are stored in order the query it. It would be useful to provide > possible APIs like this (only a rough sketch): > - {{keys()}} returns all keys available in the store (maybe together with > available time ranges) > - {{all(long timeFrom, long timeTo)}} that returns all window for a specific > time range > - {{allLatest()}} that returns the latest window for each key > Because this feature would require to scan multiple segments (ie, RockDB > instances) it would be quite inefficient with current store design. Thus, > this feature also required to redesign the underlying window store itself. > Because this is a major change, a KIP > (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals) > is required. The KIP should cover the actual API design as well as the store > refactoring. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6134: --- Labels: regression (was: ) > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Labels: regression > Fix For: 1.0.0, 0.11.0.2 > > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6134: --- Fix Version/s: 0.11.0.2 1.0.0 > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Labels: regression > Fix For: 1.0.0, 0.11.0.2 > > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics
[ https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221559#comment-16221559 ] Jeff Klukas commented on KAFKA-3073: Created a KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-215%3A+Add+topic+regex+support+for+Connect+sinks > KafkaConnect should support regular expression for topics > - > > Key: KAFKA-3073 > URL: https://issues.apache.org/jira/browse/KAFKA-3073 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Reporter: Gwen Shapira >Assignee: Liquan Pei > Labels: needs-kip > > KafkaConsumer supports both a list of topics or a pattern when subscribing. > KafkaConnect only supports a list of topics, which is not just more of a > hassle to configure - it also requires more maintenance. > I suggest adding topics.pattern as a new configuration and letting users > choose between 'topics' or 'topics.pattern'. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics
[ https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221528#comment-16221528 ] Ewen Cheslack-Postava commented on KAFKA-3073: -- We actually already ended up with some uses of Pattern for configs in transformations. IIRC, I raised this point when we were discussing those but not promising full Pattern support and instead just guaranteeing common regex functionality is a reasonable tradeoff. 99% of the time people will just be using very basic functionality anyway. > KafkaConnect should support regular expression for topics > - > > Key: KAFKA-3073 > URL: https://issues.apache.org/jira/browse/KAFKA-3073 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Reporter: Gwen Shapira >Assignee: Liquan Pei > Labels: needs-kip > > KafkaConsumer supports both a list of topics or a pattern when subscribing. > KafkaConnect only supports a list of topics, which is not just more of a > hassle to configure - it also requires more maintenance. > I suggest adding topics.pattern as a new configuration and letting users > choose between 'topics' or 'topics.pattern'. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics
[ https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221509#comment-16221509 ] Ismael Juma commented on KAFKA-3073: If we want to support regex in the protocol, we need a cross-platform solution. The best option I've seen so far is: https://github.com/google/re2 https://github.com/google/re2j > KafkaConnect should support regular expression for topics > - > > Key: KAFKA-3073 > URL: https://issues.apache.org/jira/browse/KAFKA-3073 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Reporter: Gwen Shapira >Assignee: Liquan Pei > Labels: needs-kip > > KafkaConsumer supports both a list of topics or a pattern when subscribing. > KafkaConnect only supports a list of topics, which is not just more of a > hassle to configure - it also requires more maintenance. > I suggest adding topics.pattern as a new configuration and letting users > choose between 'topics' or 'topics.pattern'. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics
[ https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221506#comment-16221506 ] Jeff Klukas commented on KAFKA-3073: Since a significant segment of the Kafka user community come from a Java background, I would expect many users would be most comfortable with Java's regex library over PCREs. It seems reasonable to me to start there (since the Java consumer already includes a subscribe() variant that takes a Java regex Pattern) and leave it to a separate effort to find a suitable PCRE library and add a new subscribe() variant. If we add PCRE support in the future, we could control which type of regex is used via an additional 'topics.pattern.type' configuration option. > KafkaConnect should support regular expression for topics > - > > Key: KAFKA-3073 > URL: https://issues.apache.org/jira/browse/KAFKA-3073 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Reporter: Gwen Shapira >Assignee: Liquan Pei > Labels: needs-kip > > KafkaConsumer supports both a list of topics or a pattern when subscribing. > KafkaConnect only supports a list of topics, which is not just more of a > hassle to configure - it also requires more maintenance. > I suggest adding topics.pattern as a new configuration and letting users > choose between 'topics' or 'topics.pattern'. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson reassigned KAFKA-6134: -- Assignee: Jason Gustafson > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221443#comment-16221443 ] Onur Karaman edited comment on KAFKA-6134 at 10/26/17 11:28 PM: If you want to port a fix to 1.0 without pulling in all of KAFKA-5642, I think you can just lazily read the reassignment state upon actually processing the PartitionReassignment instead of providing one as part of the PartitionReassignment instance so that you'd only have one partition reassignment mapping allocated at any point in time. was (Author: onurkaraman): If you want to port a fix to 1.0 without pulling in all of KAFKA-5642, I think you can just lazily read the reassignment state upon actually processing the PartitionReassignment so that you'd only have one partition reassignment mapping allocated at any point in time. > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Priority: Critical > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221443#comment-16221443 ] Onur Karaman commented on KAFKA-6134: - If you want to port a fix to 1.0 without pulling in all of KAFKA-5642, I think you can just lazily read the reassignment state upon actually processing the PartitionReassignment so that you'd only have one partition reassignment mapping allocated at any point in time. > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Priority: Critical > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes
[ https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221441#comment-16221441 ] Roger Hoover edited comment on KAFKA-6129 at 10/26/17 11:25 PM: What did you configure for advertised.listeners? My guess is that the endpoints returned from the initial metadata request are not resolvable (just guess though) was (Author: theduderog): What did you configure for advertised.listeners? My guess is that the endpoints returned from the initial metadata request are not resolvable. > kafka issue when exposing through nodeport in kubernetes > > > Key: KAFKA-6129 > URL: https://issues.apache.org/jira/browse/KAFKA-6129 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.1 > Environment: kubernetes >Reporter: Francesco vigotti >Priority: Critical > > I've started writing in this issue: > https://issues.apache.org/jira/browse/KAFKA-2729 > but then I'm going to open this new issue because I've probably found the > cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong > in his setup ( and all other application works using the same nodeport > redirection , ie: zookeeper ) > kafka brokers fails , silently (randomly in multiple brokers setup) and with > a misleading error from producer so I think that Kafka should be improved, > providing more robust pre-startup flight-checks and identifying/reporting the > current issue > After further investigation from my reply here > https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster > ( 1 zk + 1 kafka-broker ) I've found the problem, > the problem is with kubernetes, ( I don't know why this issue appeared only > now to me , if something changed in recent kube-proxy versions or in kafka > 0.10+ , or ... ) > anyway my old kafka cluster started being underreplicated and return various > problem , > the problem happens when in kubernetes pods are created and redirected using > a nodeport-service ( over a static ip in my case ) to expose kafka brokers > from the host, when using hostNetwork ( so no redirection ) everything > works, what is strange is that zookeeper instead works fine with nodeport ( > which create a redirection rule in iptables->nat->prerouting ) the only > application I've found problems with this kubernetes configuration is kafka, > what is weird is that kafka starts correctly without errors, but on multiple > broker clusters there are random issues, on single broker cluster instead the > console-producer fails with infinite looop of : > ``` > [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation > id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation > id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation > id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > ``` > , still no errors reported from broker or zookeeper, > Also I want to say that I've come across this discussion : > > https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer > > but the proposed solution for the host pod ( to allow self-resolving of > advertised hostname) didn't worked > ``` > hostAliases: > - ip: "127.0.0.1" > hostnames: > - "---myhosthostname---" > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes
[ https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221441#comment-16221441 ] Roger Hoover commented on KAFKA-6129: - What did you configure for advertised.listeners? My guess is that the endpoints returned from the initial metadata request are not resolvable. > kafka issue when exposing through nodeport in kubernetes > > > Key: KAFKA-6129 > URL: https://issues.apache.org/jira/browse/KAFKA-6129 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.1 > Environment: kubernetes >Reporter: Francesco vigotti >Priority: Critical > > I've started writing in this issue: > https://issues.apache.org/jira/browse/KAFKA-2729 > but then I'm going to open this new issue because I've probably found the > cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong > in his setup ( and all other application works using the same nodeport > redirection , ie: zookeeper ) > kafka brokers fails , silently (randomly in multiple brokers setup) and with > a misleading error from producer so I think that Kafka should be improved, > providing more robust pre-startup flight-checks and identifying/reporting the > current issue > After further investigation from my reply here > https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster > ( 1 zk + 1 kafka-broker ) I've found the problem, > the problem is with kubernetes, ( I don't know why this issue appeared only > now to me , if something changed in recent kube-proxy versions or in kafka > 0.10+ , or ... ) > anyway my old kafka cluster started being underreplicated and return various > problem , > the problem happens when in kubernetes pods are created and redirected using > a nodeport-service ( over a static ip in my case ) to expose kafka brokers > from the host, when using hostNetwork ( so no redirection ) everything > works, what is strange is that zookeeper instead works fine with nodeport ( > which create a redirection rule in iptables->nat->prerouting ) the only > application I've found problems with this kubernetes configuration is kafka, > what is weird is that kafka starts correctly without errors, but on multiple > broker clusters there are random issues, on single broker cluster instead the > console-producer fails with infinite looop of : > ``` > [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation > id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation > id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation > id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > ``` > , still no errors reported from broker or zookeeper, > Also I want to say that I've come across this discussion : > > https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer > > but the proposed solution for the host pod ( to allow self-resolving of > advertised hostname) didn't worked > ``` > hostAliases: > - ip: "127.0.0.1" > hostnames: > - "---myhosthostname---" > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221415#comment-16221415 ] Jason Gustafson commented on KAFKA-6134: [~onurkaraman] I noticed the change in trunk. Do you think the fix can be ported to 1.0? > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Priority: Critical > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221392#comment-16221392 ] Onur Karaman commented on KAFKA-6134: - Yes I had noticed the O(N^2) behavior a while ago. I believe this should be mitigated after KAFKA-5642 since PartitionReassignment is now a case object instead of a case class containing the remaining reassignment mapping. > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Priority: Critical > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson updated KAFKA-6134: --- Priority: Critical (was: Major) > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson >Priority: Critical > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson updated KAFKA-6134: --- Description: We've had a couple users reporting spikes in memory usage when the controller is performing partition reassignment in 0.11. After investigation, we found that the controller event queue was using most of the retained memory. In particular, we found several thousand {{PartitionReassignment}} objects, each one containing one fewer partition than the previous one (see the attached image). >From the code, it seems clear why this is happening. We have a watch on the >partition reassignment path which adds the {{PartitionReassignment}} object to >the event queue: {code} override def handleDataChange(dataPath: String, data: Any): Unit = { val partitionReassignment = ZkUtils.parsePartitionReassignmentData(data.toString) eventManager.put(controller.PartitionReassignment(partitionReassignment)) } {code} In the {{PartitionReassignment}} event handler, we iterate through all of the partitions in the reassignment. After we complete reassignment for each partition, we remove that partition and update the node in zookeeper. {code} // remove this partition from that list val updatedPartitionsBeingReassigned = partitionsBeingReassigned - topicAndPartition // write the new list to zookeeper zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) {code} This triggers the handler above which adds a new event in the queue. So what you get is an n^2 increase in memory where n is the number of partitions. was: We've had a couple users reporting spikes in memory usage when the controller is performing partition reassignment in 0.11. After investigation, we found that the controller event queue was using most of the retained memory. In particular, we found several thousand {{PartitionReassignment}} objects, each one containing one fewer partition than the previous one: !Screen Shot 2017-10-26 at 3.05.40 PM.png|thumbnail!. >From the code, it seems clear why this is happening. We have a watch on the >partition reassignment path which adds the {{PartitionReassignment}} object to >the event queue: {code} override def handleDataChange(dataPath: String, data: Any): Unit = { val partitionReassignment = ZkUtils.parsePartitionReassignmentData(data.toString) eventManager.put(controller.PartitionReassignment(partitionReassignment)) } {code} In the {{PartitionReassignment}} event handler, we iterate through all of the partitions in the reassignment. After we complete reassignment for each partition, we remove that partition and update the node in zookeeper. {code} // remove this partition from that list val updatedPartitionsBeingReassigned = partitionsBeingReassigned - topicAndPartition // write the new list to zookeeper zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) {code} This triggers the handler above which adds a new event in the queue. So what you get is an n^2 increase in memory where n is the number of partitions. > High memory usage on controller during partition reassignment > - > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jason Gustafson > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper >
[jira] [Created] (KAFKA-6134) High memory usage on controller during partition reassignment
Jason Gustafson created KAFKA-6134: -- Summary: High memory usage on controller during partition reassignment Key: KAFKA-6134 URL: https://issues.apache.org/jira/browse/KAFKA-6134 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 0.11.0.0, 0.11.0.1 Reporter: Jason Gustafson Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png We've had a couple users reporting spikes in memory usage when the controller is performing partition reassignment in 0.11. After investigation, we found that the controller event queue was using most of the retained memory. In particular, we found several thousand {{PartitionReassignment}} objects, each one containing one fewer partition than the previous one: !Screen Shot 2017-10-26 at 3.05.40 PM.png|thumbnail!. >From the code, it seems clear why this is happening. We have a watch on the >partition reassignment path which adds the {{PartitionReassignment}} object to >the event queue: {code} override def handleDataChange(dataPath: String, data: Any): Unit = { val partitionReassignment = ZkUtils.parsePartitionReassignmentData(data.toString) eventManager.put(controller.PartitionReassignment(partitionReassignment)) } {code} In the {{PartitionReassignment}} event handler, we iterate through all of the partitions in the reassignment. After we complete reassignment for each partition, we remove that partition and update the node in zookeeper. {code} // remove this partition from that list val updatedPartitionsBeingReassigned = partitionsBeingReassigned - topicAndPartition // write the new list to zookeeper zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) {code} This triggers the handler above which adds a new event in the queue. So what you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221330#comment-16221330 ] Jason Gustafson commented on KAFKA-6132: [~pnowojski] Thanks for the update. If you're confident that it is fixed, perhaps you can close this issue? > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6123) MetricsReporter does not get auto-generated client.id
[ https://issues.apache.org/jira/browse/KAFKA-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221248#comment-16221248 ] Kevin Lu commented on KAFKA-6123: - [~guozhang] May I work on this? Does this require a KIP? > MetricsReporter does not get auto-generated client.id > - > > Key: KAFKA-6123 > URL: https://issues.apache.org/jira/browse/KAFKA-6123 > Project: Kafka > Issue Type: Improvement > Components: clients, metrics >Affects Versions: 0.11.0.0 >Reporter: Kevin Lu >Priority: Minor > Labels: clients, metrics, newbie++ > > When a {{MetricsReporter}} is configured for a client, it will receive the > user-specified configurations via {{Configurable.configure(Map> configs)}}. Likewise, {{ProducerInterceptor}} and {{ConsumerInterceptor}} > receive user-specified configurations in their configure methods. > The difference is when a user does not specify the {{client.id}} field, Kafka > will auto-generate client ids (producer-1, producer-2, consumer-1, > consumer-2, etc). This auto-generated {{client.id}} will be passed into the > interceptors' configure method, but it is not passed to the > {{MetricsReporter}} configure method. > This makes it harder to directly map {{MetricsReporter}} with the > interceptors for the client when users do not specify the {{client.id}} > field. The {{client.id}} can be determined from identifying a metric with the > {{client.id}} tag, but this is hacky and requires traversal. > It would be useful to have auto-generated {{client.id}} field also passed to > the {{MetricsReporter}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (KAFKA-6123) MetricsReporter does not get auto-generated client.id
[ https://issues.apache.org/jira/browse/KAFKA-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Lu updated KAFKA-6123: Comment: was deleted (was: [~guozhang] May I work on this? Does this require a KIP?) > MetricsReporter does not get auto-generated client.id > - > > Key: KAFKA-6123 > URL: https://issues.apache.org/jira/browse/KAFKA-6123 > Project: Kafka > Issue Type: Improvement > Components: clients, metrics >Affects Versions: 0.11.0.0 >Reporter: Kevin Lu >Priority: Minor > Labels: clients, metrics, newbie++ > > When a {{MetricsReporter}} is configured for a client, it will receive the > user-specified configurations via {{Configurable.configure(Map> configs)}}. Likewise, {{ProducerInterceptor}} and {{ConsumerInterceptor}} > receive user-specified configurations in their configure methods. > The difference is when a user does not specify the {{client.id}} field, Kafka > will auto-generate client ids (producer-1, producer-2, consumer-1, > consumer-2, etc). This auto-generated {{client.id}} will be passed into the > interceptors' configure method, but it is not passed to the > {{MetricsReporter}} configure method. > This makes it harder to directly map {{MetricsReporter}} with the > interceptors for the client when users do not specify the {{client.id}} > field. The {{client.id}} can be determined from identifying a metric with the > {{client.id}} tag, but this is hacky and requires traversal. > It would be useful to have auto-generated {{client.id}} field also passed to > the {{MetricsReporter}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6123) MetricsReporter does not get auto-generated client.id
[ https://issues.apache.org/jira/browse/KAFKA-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221244#comment-16221244 ] Kevin Lu commented on KAFKA-6123: - [~guozhang] May I work on this? Does this require a KIP? > MetricsReporter does not get auto-generated client.id > - > > Key: KAFKA-6123 > URL: https://issues.apache.org/jira/browse/KAFKA-6123 > Project: Kafka > Issue Type: Improvement > Components: clients, metrics >Affects Versions: 0.11.0.0 >Reporter: Kevin Lu >Priority: Minor > Labels: clients, metrics, newbie++ > > When a {{MetricsReporter}} is configured for a client, it will receive the > user-specified configurations via {{Configurable.configure(Map> configs)}}. Likewise, {{ProducerInterceptor}} and {{ConsumerInterceptor}} > receive user-specified configurations in their configure methods. > The difference is when a user does not specify the {{client.id}} field, Kafka > will auto-generate client ids (producer-1, producer-2, consumer-1, > consumer-2, etc). This auto-generated {{client.id}} will be passed into the > interceptors' configure method, but it is not passed to the > {{MetricsReporter}} configure method. > This makes it harder to directly map {{MetricsReporter}} with the > interceptors for the client when users do not specify the {{client.id}} > field. The {{client.id}} can be determined from identifying a metric with the > {{client.id}} tag, but this is hacky and requires traversal. > It would be useful to have auto-generated {{client.id}} field also passed to > the {{MetricsReporter}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6100) Streams quick start crashes Java on Windows
[ https://issues.apache.org/jira/browse/KAFKA-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221171#comment-16221171 ] ASF GitHub Bot commented on KAFKA-6100: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4136 > Streams quick start crashes Java on Windows > > > Key: KAFKA-6100 > URL: https://issues.apache.org/jira/browse/KAFKA-6100 > Project: Kafka > Issue Type: Bug > Components: streams > Environment: Windows 10 VM >Reporter: Vahid Hashemian >Assignee: Guozhang Wang > Fix For: 1.0.0 > > Attachments: Screen Shot 2017-10-20 at 11.53.14 AM.png, > java.exe_171023_115335.dmp.zip > > > *This issue was detected in 1.0.0 RC2.* > The following step in streams quick start crashes Java on Windows 10: > {{bin/kafka-run-class.sh > org.apache.kafka.streams.examples.wordcount.WordCountDemo}} > I tracked this down to [this > change|https://github.com/apache/kafka/commit/196bcfca0c56420793f85514d1602bde564b0651#diff-6512f838e273b79676cac5f72456127fR67], > and it seems to new version of RocksDB is to blame. I tried the quick start > with the previous version of RocksDB (5.7.3) and did not run into this issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-6100) Streams quick start crashes Java on Windows
[ https://issues.apache.org/jira/browse/KAFKA-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang resolved KAFKA-6100. -- Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 4136 [https://github.com/apache/kafka/pull/4136] > Streams quick start crashes Java on Windows > > > Key: KAFKA-6100 > URL: https://issues.apache.org/jira/browse/KAFKA-6100 > Project: Kafka > Issue Type: Bug > Components: streams > Environment: Windows 10 VM >Reporter: Vahid Hashemian > Fix For: 1.0.0 > > Attachments: Screen Shot 2017-10-20 at 11.53.14 AM.png, > java.exe_171023_115335.dmp.zip > > > *This issue was detected in 1.0.0 RC2.* > The following step in streams quick start crashes Java on Windows 10: > {{bin/kafka-run-class.sh > org.apache.kafka.streams.examples.wordcount.WordCountDemo}} > I tracked this down to [this > change|https://github.com/apache/kafka/commit/196bcfca0c56420793f85514d1602bde564b0651#diff-6512f838e273b79676cac5f72456127fR67], > and it seems to new version of RocksDB is to blame. I tried the quick start > with the previous version of RocksDB (5.7.3) and did not run into this issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (KAFKA-6100) Streams quick start crashes Java on Windows
[ https://issues.apache.org/jira/browse/KAFKA-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang reassigned KAFKA-6100: Assignee: Guozhang Wang > Streams quick start crashes Java on Windows > > > Key: KAFKA-6100 > URL: https://issues.apache.org/jira/browse/KAFKA-6100 > Project: Kafka > Issue Type: Bug > Components: streams > Environment: Windows 10 VM >Reporter: Vahid Hashemian >Assignee: Guozhang Wang > Fix For: 1.0.0 > > Attachments: Screen Shot 2017-10-20 at 11.53.14 AM.png, > java.exe_171023_115335.dmp.zip > > > *This issue was detected in 1.0.0 RC2.* > The following step in streams quick start crashes Java on Windows 10: > {{bin/kafka-run-class.sh > org.apache.kafka.streams.examples.wordcount.WordCountDemo}} > I tracked this down to [this > change|https://github.com/apache/kafka/commit/196bcfca0c56420793f85514d1602bde564b0651#diff-6512f838e273b79676cac5f72456127fR67], > and it seems to new version of RocksDB is to blame. I tried the quick start > with the previous version of RocksDB (5.7.3) and did not run into this issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6133) NullPointerException in S3 Connector when using rotate.interval.ms
[ https://issues.apache.org/jira/browse/KAFKA-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elizabeth Bennett updated KAFKA-6133: - Description: I just tried out the new rotate.interval.ms feature in the S3 connector to do time based flushing. I am getting this NPE on every event: {code}[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask) java.lang.NullPointerException at io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288) at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234) at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) [2017-10-20 23:21:35,233] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask) [2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask) org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception. at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748){code} I dug into the S3 connect code a bit and it looks like the {{rotate.interval.ms}} feature only works if you are using the TimeBasedPartitioner. It will get the TimestampExtractor class from the TimeBasedPartitioner to determine the timestamp of the event, and will use this for the time based flushing. I'm using a custom partitioner, but I'd still really like to use the {{rotate.interval.ms}} feature, using wall clock time to determine the flushing behavior. I'd be willing to work on fixing this issue, but I want to confirm it is actually bug, and not that it was specifically designed to only work with the TimeBasedPartitioner. Even if it is the latter, it should probably not crash with an NPE. was: I just tried out the new rotate.interval.ms feature in the S3 connector to do time based flushing. I am getting this NPE on every event: {code}[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask) java.lang.NullPointerException at io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288) at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234) at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at
[jira] [Updated] (KAFKA-6133) NullPointerException in S3 Connector when using rotate.interval.ms
[ https://issues.apache.org/jira/browse/KAFKA-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elizabeth Bennett updated KAFKA-6133: - Description: I just tried out the new rotate.interval.ms feature in the S3 connector to do time based flushing. I am getting this NPE on every event: {code}[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask) java.lang.NullPointerException at io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288) at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234) at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) [2017-10-20 23:21:35,233] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask) [2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask) org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception. at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748){code} I dug into the S3 connect code a bit and it looks like the {{rotate.interval.ms}} feature only works if you are using the TimeBasedPartitioner. It will get the TimestampExtractor class from the TimeBasedPartitioner to determine the timestamp of the event, and will use this for the time based flushing. I'm using a custom partitioner, but I'd still really like to use the {{rotate.interval.ms}} feature, using wall clock time to determine the flushing behavior. I'd be willing to work on fixing this issue, but I want to confirm it is actually bug, and not that it was specifically designed to only work with the TimeBasedPartitioner. Even if it is the later, it should probably not crash with an NPE. was: I just tried out the new rotate.interval.ms feature in the S3 connector to do time based flushing. I am getting this NPE on every event: {{[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask) java.lang.NullPointerException at io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288) at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234) at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221118#comment-16221118 ] Piotr Nowojski commented on KAFKA-6132: --- Indeed problem seems to be solved with 1.0.0-rc3. Thanks! > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221035#comment-16221035 ] Piotr Nowojski commented on KAFKA-6132: --- Thanks, it worked. Build is under way, we will see an answer in 20-30 minutes. https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293308789 > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6048) Support negative record timestamps
[ https://issues.apache.org/jira/browse/KAFKA-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221014#comment-16221014 ] Børge Svingen commented on KAFKA-6048: -- @james.c: We are trying to resolve the same problem at The New York Times. Maybe we can work together on this? > Support negative record timestamps > -- > > Key: KAFKA-6048 > URL: https://issues.apache.org/jira/browse/KAFKA-6048 > Project: Kafka > Issue Type: Improvement > Components: clients, core, streams >Affects Versions: 1.0.0 >Reporter: Matthias J. Sax >Assignee: james chien > Labels: needs-kip > > Kafka does not support negative record timestamps, and this prevents the > storage of historical data in Kafka. In general, negative timestamps are > supported by UNIX system time stamps: > From https://en.wikipedia.org/wiki/Unix_time > {quote} > The Unix time number is zero at the Unix epoch, and increases by exactly > 86,400 per day since the epoch. Thus 2004-09-16T00:00:00Z, 12,677 days after > the epoch, is represented by the Unix time number 12,677 × 86,400 = > 1095292800. This can be extended backwards from the epoch too, using negative > numbers; thus 1957-10-04T00:00:00Z, 4,472 days before the epoch, is > represented by the Unix time number −4,472 × 86,400 = −386380800. > {quote} > Allowing for negative timestamps would require multiple changes: > - while brokers in general do support negative timestamps, broker use {{-1}} > as default value if a producer uses an old message format (this would not be > compatible with supporting negative timestamps "end-to-end" as {{-1}} cannot > be used as "unknown" anymore): we could introduce a message flag indicating a > missing timestamp (and let producer throw an exception if > {{ConsumerRecord#timestamp()}} is called. Another possible solution might be, > to require topics that are used by old producers to be configured with > {{LogAppendTime}} semantics and rejecting writes to topics with > {{CreateTime}} semantics for older message formats > - {{KafkaProducer}} does not allow to send records with negative timestamp > and thus this would need to be fixed > - Streams API does drop records with negative timestamps (or fails by > default) -- also, some internal store implementation for windowed stores > assume that there are not negative timestamps to do range queries > There might be other gaps we need to address. This is just a summary of issue > coming to my mind atm. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6048) Support negative record timestamps
[ https://issues.apache.org/jira/browse/KAFKA-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221014#comment-16221014 ] Børge Svingen edited comment on KAFKA-6048 at 10/26/17 6:55 PM: [~james.c]: We are trying to resolve the same problem at The New York Times. Maybe we can work together on this? was (Author: bsvingen): @james.c: We are trying to resolve the same problem at The New York Times. Maybe we can work together on this? > Support negative record timestamps > -- > > Key: KAFKA-6048 > URL: https://issues.apache.org/jira/browse/KAFKA-6048 > Project: Kafka > Issue Type: Improvement > Components: clients, core, streams >Affects Versions: 1.0.0 >Reporter: Matthias J. Sax >Assignee: james chien > Labels: needs-kip > > Kafka does not support negative record timestamps, and this prevents the > storage of historical data in Kafka. In general, negative timestamps are > supported by UNIX system time stamps: > From https://en.wikipedia.org/wiki/Unix_time > {quote} > The Unix time number is zero at the Unix epoch, and increases by exactly > 86,400 per day since the epoch. Thus 2004-09-16T00:00:00Z, 12,677 days after > the epoch, is represented by the Unix time number 12,677 × 86,400 = > 1095292800. This can be extended backwards from the epoch too, using negative > numbers; thus 1957-10-04T00:00:00Z, 4,472 days before the epoch, is > represented by the Unix time number −4,472 × 86,400 = −386380800. > {quote} > Allowing for negative timestamps would require multiple changes: > - while brokers in general do support negative timestamps, broker use {{-1}} > as default value if a producer uses an old message format (this would not be > compatible with supporting negative timestamps "end-to-end" as {{-1}} cannot > be used as "unknown" anymore): we could introduce a message flag indicating a > missing timestamp (and let producer throw an exception if > {{ConsumerRecord#timestamp()}} is called. Another possible solution might be, > to require topics that are used by old producers to be configured with > {{LogAppendTime}} semantics and rejecting writes to topics with > {{CreateTime}} semantics for older message formats > - {{KafkaProducer}} does not allow to send records with negative timestamp > and thus this would need to be fixed > - Streams API does drop records with negative timestamps (or fails by > default) -- also, some internal store implementation for windowed stores > assume that there are not negative timestamps to do range queries > There might be other gaps we need to address. This is just a summary of issue > coming to my mind atm. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221009#comment-16221009 ] Ted Yu commented on KAFKA-6132: --- Add this to your pom.xml : {code} staging https://repository.apache.org/content/groups/staging/ {code} and specify 1.0.0 as Kafka version. > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221003#comment-16221003 ] Piotr Nowojski commented on KAFKA-6132: --- Could you give me a hint how to access this RC from staging repository? > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6133) NullPointerException in S3 Connector when using rotate.interval.ms
Elizabeth Bennett created KAFKA-6133: Summary: NullPointerException in S3 Connector when using rotate.interval.ms Key: KAFKA-6133 URL: https://issues.apache.org/jira/browse/KAFKA-6133 Project: Kafka Issue Type: Bug Components: KafkaConnect Affects Versions: 0.11.0.0 Reporter: Elizabeth Bennett I just tried out the new rotate.interval.ms feature in the S3 connector to do time based flushing. I am getting this NPE on every event: {{[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask) java.lang.NullPointerException at io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288) at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234) at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) [2017-10-20 23:21:35,233] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask) [2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask) org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception. at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)}} I dug into the S3 connect code a bit and it looks like the {{rotate.interval.ms}} feature only works if you are using the TimeBasedPartitioner. It will get the TimestampExtractor class from the TimeBasedPartitioner to determine the timestamp of the event, and will use this for the time based flushing. I'm using a custom partitioner, but I'd still really like to use the {{rotate.interval.ms}} feature, using wall clock time to determine the flushing behavior. I'd be willing to work on fixing this issue, but I want to confirm it is actually bug, and not that it was specifically designed to only work with the TimeBasedPartitioner. Even if it is the later, it should probably not crash with an NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220986#comment-16220986 ] Ted Yu commented on KAFKA-6132: --- The RC is in maven staging repo. > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220976#comment-16220976 ] Piotr Nowojski edited comment on KAFKA-6132 at 10/26/17 6:39 PM: - I don't know. This fix is not in any released version so I could easily to it in pom? edit: I would have to include 1.0.0-rc3 somehow into my travis build. I will try to think something out. was (Author: pnowojski): I don't know. This fix is not in any released version so I could easily to it in pom? > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220976#comment-16220976 ] Piotr Nowojski commented on KAFKA-6132: --- I don't know. This fix is not in any released version so I could easily to it in pom? > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220951#comment-16220951 ] Ted Yu commented on KAFKA-6132: --- Does KAFKA-6042: 'Avoid deadlock between two groups with delayed operations' solve the deadlock ? > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: Travis >Reporter: Piotr Nowojski >Priority: Critical > > I have found some intermittent failures on travis when using Kafka 0.11 > transactions for writing. One of them is a apparent deadlock with the > following stack trace: > {code:java} > "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 > nid=0x1260 waiting on condition [0x7f4b10fa4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x947048a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at > org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) > at > org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) > {code} > I was unsuccessful to reproduce it locally, however I think I can semi > reliably reproduce it on Travis. Scenario includes simultaneous sequence of > instantiating new producers, calling {{KafkaProducer.initTransactions}}, > closing them interleaved with writing. I have created a stripped down version > of this scenario as a github project: > https://github.com/pnowojski/kafka-init-deadlock > The code for the test scenario is here: > https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java > I have defined 30 build profiles that run this test and in case of detecting > a dead lock (5 minutes period of inactivity), stack trace of all threads is > being printed out. Example travis run: > https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 > as you can see deadlock occurred in 7 out of 30 builds. It seems like in this > scenario all of them are failing/dead locking in exactly same way. > I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently
[ https://issues.apache.org/jira/browse/KAFKA-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220946#comment-16220946 ] ASF GitHub Bot commented on KAFKA-6131: --- GitHub user rajinisivaram opened a pull request: https://github.com/apache/kafka/pull/4140 KAFKA-6131: Use atomic putIfAbsent to create txn marker queues You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajinisivaram/kafka KAFKA-6131-txn-concurrentmap Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4140.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4140 commit b80fcb5bfc8cb70f270eb558ed6267dca99d1e08 Author: Rajini SivaramDate: 2017-10-26T17:45:47Z KAFKA-6131: Use atomic putIfAbsent to create txn marker queues > Transaction markers are sometimes discarded if txns complete concurrently > - > > Key: KAFKA-6131 > URL: https://issues.apache.org/jira/browse/KAFKA-6131 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram > > Concurrent tests being added under KAFKA-6096 for transaction coordinator > fail to complete some transactions when multiple transactions are completed > concurrently. > The problem is with the following code snippet - there are two very similar > uses of concurrent map in {{TransactionMarkerChannelManager}} and the test > fails because some transaction markers are discarded. {{getOrElseUpdate}} in > scala maps are not atomic. The test passes consistently with one thread. > {quote} > val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new > ConcurrentHashMap[Int, TxnMarkerQueue]().asScala > val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new > TxnMarkerQueue(broker)) > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6132) KafkaProducer.initTransactions dead locks
[ https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Nowojski updated KAFKA-6132: -- Description: I have found some intermittent failures on travis when using Kafka 0.11 transactions for writing. One of them is a apparent deadlock with the following stack trace: {code:java} "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 nid=0x1260 waiting on condition [0x7f4b10fa4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x947048a8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) at org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) {code} I was unsuccessful to reproduce it locally, however I think I can semi reliably reproduce it on Travis. Scenario includes simultaneous sequence of instantiating new producers, calling {{KafkaProducer.initTransactions}}, closing them interleaved with writing. I have created a stripped down version of this scenario as a github project: https://github.com/pnowojski/kafka-init-deadlock The code for the test scenario is here: https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java I have defined 30 build profiles that run this test and in case of detecting a dead lock (5 minutes period of inactivity), stack trace of all threads is being printed out. Example travis run: https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 as you can see deadlock occurred in 7 out of 30 builds. It seems like in this scenario all of them are failing/dead locking in exactly same way. I have observed this issue both on 0.11.0.0 and 0.11.0.1 was: I have found some intermittent failures on travis when using Kafka 0.11 transactions for writing. One of them is a apparent deadlock with the following stack trace: {code:java} "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 nid=0x1260 waiting on condition [0x7f4b10fa4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x947048a8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) at org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) {code} I was unsuccessful to reproduce it locally, however I think I can semi reliably reproduce it on Travis. Scenario includes simultaneous sequence of instantiating new producers, calling {{KafkaProducer.initTransactions}}, closing them interleaved with writing. I have created a stripped down version of this scenario as a github project: https://github.com/pnowojski/kafka-init-deadlock The code for test scenario is here: https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java I have defined 30 build profiles that run this test and in case of detecting a dead lock (5 minutes period of inactivity) stack trace of all threads is being printed out. Example travis run: https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 as you can see deadlock occurred in 7 out of 30 builds. It seems like in this scenario all of them are failing/dead locking in exactly same way. I have observed this issue both on 0.11.0.0 and 0.11.0.1 > KafkaProducer.initTransactions dead locks > - > > Key: KAFKA-6132 > URL: https://issues.apache.org/jira/browse/KAFKA-6132 >
[jira] [Created] (KAFKA-6132) KafkaProducer.initTransactions dead locks
Piotr Nowojski created KAFKA-6132: - Summary: KafkaProducer.initTransactions dead locks Key: KAFKA-6132 URL: https://issues.apache.org/jira/browse/KAFKA-6132 Project: Kafka Issue Type: Bug Components: producer Affects Versions: 0.11.0.0, 0.11.0.1 Environment: Travis Reporter: Piotr Nowojski Priority: Critical I have found some intermittent failures on travis when using Kafka 0.11 transactions for writing. One of them is a apparent deadlock with the following stack trace: {code:java} "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 nid=0x1260 waiting on condition [0x7f4b10fa4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x947048a8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50) at org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537) {code} I was unsuccessful to reproduce it locally, however I think I can semi reliably reproduce it on Travis. Scenario includes simultaneous sequence of instantiating new producers, calling {{KafkaProducer.initTransactions}}, closing them interleaved with writing. I have created a stripped down version of this scenario as a github project: https://github.com/pnowojski/kafka-init-deadlock The code for test scenario is here: https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java I have defined 30 build profiles that run this test and in case of detecting a dead lock (5 minutes period of inactivity) stack trace of all threads is being printed out. Example travis run: https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284 as you can see deadlock occurred in 7 out of 30 builds. It seems like in this scenario all of them are failing/dead locking in exactly same way. I have observed this issue both on 0.11.0.0 and 0.11.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently
Rajini Sivaram created KAFKA-6131: - Summary: Transaction markers are sometimes discarded if txns complete concurrently Key: KAFKA-6131 URL: https://issues.apache.org/jira/browse/KAFKA-6131 Project: Kafka Issue Type: Bug Components: core Affects Versions: 1.0.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Concurrent tests being added under KAFKA-6096 for transaction coordinator fail to complete some transactions when multiple transactions are completed concurrently. The problem is with the following code snippet - there are two very similar uses of concurrent map in {{TransactionMarkerChannelManager}} and the test fails because some transaction markers are discarded. {{getOrElseUpdate}} in scala maps are not atomic. The test passes consistently with one thread. {quote} val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new ConcurrentHashMap[Int, TxnMarkerQueue]().asScala val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new TxnMarkerQueue(broker)) {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6111) Tests for KafkaZkClient
[ https://issues.apache.org/jira/browse/KAFKA-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6111: --- Description: Some methods in KafkaZkClient have no tests at the moment and we need to fix that. (was: It has no tests at the moment and we need to fix that.) > Tests for KafkaZkClient > --- > > Key: KAFKA-6111 > URL: https://issues.apache.org/jira/browse/KAFKA-6111 > Project: Kafka > Issue Type: Sub-task >Reporter: Ismael Juma > Fix For: 1.1.0 > > > Some methods in KafkaZkClient have no tests at the moment and we need to fix > that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-5029) cleanup javadocs and logging
[ https://issues.apache.org/jira/browse/KAFKA-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-5029: --- Fix Version/s: 1.1.0 > cleanup javadocs and logging > > > Key: KAFKA-5029 > URL: https://issues.apache.org/jira/browse/KAFKA-5029 > Project: Kafka > Issue Type: Sub-task >Reporter: Onur Karaman >Assignee: Onur Karaman > Fix For: 1.1.0 > > > Remove state change logger, splitting it up into the controller logs or > broker logs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (KAFKA-6112) SSL + ACL does not seem to work
[ https://issues.apache.org/jira/browse/KAFKA-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sönke Liebau reassigned KAFKA-6112: --- Assignee: Sönke Liebau > SSL + ACL does not seem to work > --- > > Key: KAFKA-6112 > URL: https://issues.apache.org/jira/browse/KAFKA-6112 > Project: Kafka > Issue Type: Bug > Components: security >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jagadish Prasath Ramu >Assignee: Sönke Liebau > > I'm trying to enable ACL for a cluster that has SSL based authentication > setup. > Similar issue (or exceptions) has been reported in the following JIRA: > https://issues.apache.org/jira/browse/KAFKA-3687 (refer the last 2 exceptions > that were posted after the issue was closed). > error messages seen in Producer: > {noformat} > [2017-10-24 18:32:25,254] WARN Error while fetching metadata with correlation > id 349 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > [2017-10-24 18:32:25,362] WARN Error while fetching metadata with correlation > id 350 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > [2017-10-24 18:32:25,470] WARN Error while fetching metadata with correlation > id 351 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > [2017-10-24 18:32:25,575] WARN Error while fetching metadata with correlation > id 352 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > {noformat} > security related kafka config.properties: > {noformat} > ssl.keystore.location=kafka.server.keystore.jks > ssl.keystore.password=abc123 > ssl.key.password=abc123 > ssl.truststore.location=kafka.server.truststore.jks > ssl.truststore.password=abc123 > ssl.client.auth=required > ssl.enabled.protocols = TLSv1.2,TLSv1.1,TLSv1 > ssl.keystore.type = JKS > ssl.truststore.type = JKS > security.inter.broker.protocol = SSL > authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer > allow.everyone.if.no.acl.found=false > super.users=User:Bob;User:"CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX" > listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093 > {noformat} > client configuration file: > {noformat} > security.protocol=SSL > ssl.truststore.location=kafka.client.truststore.jks > ssl.truststore.password=abc123 > ssl.keystore.location=kafka.client.keystore.jks > ssl.keystore.password=abc123 > ssl.key.password=abc123 > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 > ssl.truststore.type=JKS > ssl.keystore.type=JKS > group.id=group-1 > {noformat} > The debug messages of authorizer log does not show any "DENY" messages. > {noformat} > [2017-10-24 18:32:26,319] DEBUG operation = Create on resource = > Cluster:kafka-cluster from host = 127.0.0.1 is Allow based on acl = > User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX has Allow permission for > operations: Create from hosts: 127.0.0.1 (kafka.authorizer.logger) > [2017-10-24 18:32:26,319] DEBUG Principal = > User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX is Allowed Operation = > Create from host = 127.0.0.1 on resource = Cluster:kafka-cluster > (kafka.authorizer.logger) > {noformat} > I have followed the scripts stated in the thread: > http://comments.gmane.org/gmane.comp.apache.kafka.user/12619 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6112) SSL + ACL does not seem to work
[ https://issues.apache.org/jira/browse/KAFKA-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220485#comment-16220485 ] Sönke Liebau commented on KAFKA-6112: - Hi [~jagadish.prasath], I've just tested this and ACLs work fine with SSL as authentication method. I suspect that the issue is somewhere in your ssl config or certificate setup. When I intentionally break my configuration so that the brokers are not recognized as superusers I see the error that you have shown above. The issue probably is the "" around your broker principal. Please try again with the line looking like this: {code} super.users=User:Bob;User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX {code} Let me know if you need further assistance, but I am inclined to close this as "not a bug" with the current information. > SSL + ACL does not seem to work > --- > > Key: KAFKA-6112 > URL: https://issues.apache.org/jira/browse/KAFKA-6112 > Project: Kafka > Issue Type: Bug > Components: security >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Jagadish Prasath Ramu > > I'm trying to enable ACL for a cluster that has SSL based authentication > setup. > Similar issue (or exceptions) has been reported in the following JIRA: > https://issues.apache.org/jira/browse/KAFKA-3687 (refer the last 2 exceptions > that were posted after the issue was closed). > error messages seen in Producer: > {noformat} > [2017-10-24 18:32:25,254] WARN Error while fetching metadata with correlation > id 349 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > [2017-10-24 18:32:25,362] WARN Error while fetching metadata with correlation > id 350 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > [2017-10-24 18:32:25,470] WARN Error while fetching metadata with correlation > id 351 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > [2017-10-24 18:32:25,575] WARN Error while fetching metadata with correlation > id 352 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) > {noformat} > security related kafka config.properties: > {noformat} > ssl.keystore.location=kafka.server.keystore.jks > ssl.keystore.password=abc123 > ssl.key.password=abc123 > ssl.truststore.location=kafka.server.truststore.jks > ssl.truststore.password=abc123 > ssl.client.auth=required > ssl.enabled.protocols = TLSv1.2,TLSv1.1,TLSv1 > ssl.keystore.type = JKS > ssl.truststore.type = JKS > security.inter.broker.protocol = SSL > authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer > allow.everyone.if.no.acl.found=false > super.users=User:Bob;User:"CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX" > listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093 > {noformat} > client configuration file: > {noformat} > security.protocol=SSL > ssl.truststore.location=kafka.client.truststore.jks > ssl.truststore.password=abc123 > ssl.keystore.location=kafka.client.keystore.jks > ssl.keystore.password=abc123 > ssl.key.password=abc123 > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 > ssl.truststore.type=JKS > ssl.keystore.type=JKS > group.id=group-1 > {noformat} > The debug messages of authorizer log does not show any "DENY" messages. > {noformat} > [2017-10-24 18:32:26,319] DEBUG operation = Create on resource = > Cluster:kafka-cluster from host = 127.0.0.1 is Allow based on acl = > User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX has Allow permission for > operations: Create from hosts: 127.0.0.1 (kafka.authorizer.logger) > [2017-10-24 18:32:26,319] DEBUG Principal = > User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX is Allowed Operation = > Create from host = 127.0.0.1 on resource = Cluster:kafka-cluster > (kafka.authorizer.logger) > {noformat} > I have followed the scripts stated in the thread: > http://comments.gmane.org/gmane.comp.apache.kafka.user/12619 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma resolved KAFKA-6128. Resolution: Not A Problem > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma reopened KAFKA-6128: > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6128: --- Fix Version/s: (was: 0.11.0.1) (was: 0.11.0.0) > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes
[ https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francesco vigotti updated KAFKA-6129: - Description: I've started writing in this issue: https://issues.apache.org/jira/browse/KAFKA-2729 but then I'm going to open this new issue because I've probably found the cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his setup ( and all other application works using the same nodeport redirection , ie: zookeeper ) kafka brokers fails , silently (randomly in multiple brokers setup) and with a misleading error from producer so I think that Kafka should be improved, providing more robust pre-startup flight-checks and identifying/reporting the current issue After further investigation from my reply here https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster ( 1 zk + 1 kafka-broker ) I've found the problem, the problem is with kubernetes, ( I don't know why this issue appeared only now to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , or ... ) anyway my old kafka cluster started being underreplicated and return various problem , the problem happens when in kubernetes pods are created and redirected using a nodeport-service ( over a static ip in my case ) to expose kafka brokers from the host, when using hostNetwork ( so no redirection ) everything works, what is strange is that zookeeper instead works fine with nodeport ( which create a redirection rule in iptables->nat->prerouting ) the only application I've found problems with this kubernetes configuration is kafka, what is weird is that kafka starts correctly without errors, but on multiple broker clusters there are random issues, on single broker cluster instead the console-producer fails with infinite looop of : ``` [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) ``` , still no errors reported from broker or zookeeper, Also I want to say that I've come across this discussion : https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer but the proposed solution for the host pod ( to allow self-resolving of advertised hostname) didn't worked ``` hostAliases: - ip: "127.0.0.1" hostnames: - "---myhosthostname---" was: I've started writing in this issue: https://issues.apache.org/jira/browse/KAFKA-2729 but then I'm going to open this new issue because I've probably found the cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his setup ( and all other application works using the same nodeport redirection , ie: zookeeper ) kafka brokers fails , silently (randomly in multiple brokers setup) and with a misleading error from producer so I think that Kafka should be improved, providing more robust pre-startup flight-checks and identifying/reporting the current issue After further investigation from my reply here https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster ( 1 zk + 1 kafka-broker ) I've found the problem, the problem is with kubernetes, ( I don't know why this issue appeared only now to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , or ... ) anyway my old kafka cluster started being underreplicated and return various problem , the problem happens when in kubernetes pods are created and redirected using a nodeport-service ( over a static ip in my case ) to expose kafka brokers from the host, when using hostNetwork ( so no redirection ) everything works, what is strange is that zookeeper instead works fine with nodeport ( which create a redirection rule in iptables->nat->prerouting ) the only application I've found problems with this kubernetes configuration is kafka, what is weird is that kafka starts correctly without errors, but on multiple broker clusters there are random issues, on single broker cluster instead the console-producer fails with infinite looop of : ``` [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION}
[jira] [Resolved] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alastair Munro resolved KAFKA-6128. --- Resolution: Fixed Fix Version/s: 0.11.0.0 0.11.0.1 Broken zookeeper replication for /brokers/ids/. > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > Fix For: 0.11.0.1, 0.11.0.0 > > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220385#comment-16220385 ] Alastair Munro commented on KAFKA-6128: --- We seem to have a broken zookeeper. If I test on another setup, we are good. So in summary kafka is connecting to a load balanced zookeeper:2181 cluster, and zookeeper.connect uses this. When a node is stopped, the id /brokers/ids/ is not removed from some of the zookeeper nodes. On restart the broker connects to one of the zookeeper nodes where /brokers/ids/ has not been updated and reports the issue of not being shutdown properly. > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220293#comment-16220293 ] Alastair Munro edited comment on KAFKA-6128 at 10/26/17 10:45 AM: -- It does it with 0.11.0.0. It would seem the broker id's are not being replicated in zookeeper. So kafka connects to zookeeper:2181 which is a load balancer. Then when it terminates, it removes the broker-id on the zookeeper it connected to, but the change is not replicated to the other two nodes. As zookeeper may scale up and down, we use zookeeper:2181 rather than a list of hosts. Testing replication in zookeeper (eg creating /test using zkCli.sh), replication works fine. So why doesn't the broker changes replicate? I scaled this down to have only brokers 0 and 1, but the change is only made to the node the departing kafka connected to: {code} oc rsh zoo-0 bin/zkCli.sh ls /brokers/ids [0, 1] oc rsh zoo-1 bin/zkCli.sh ls /brokers/ids [0, 1, 2] {code} was (Author: amunro): It does it with 0.11.0.0. It would seem the broker id's are not being replicated in zookeeper. So kafka connects to zookeeper:2181 which is a load balancer. Then when it terminates, it removes the broker-id on the zookeeper it connected to, but the change is not replicated to the other two nodes. As zookeeper may scale up and down, we use zookeeper:2181 rather than a list of hosts. Testing replication in zookeeper (eg creating /test using zkCli.sh), replication works fine. So why doesn't the broker changes replicate? I scaled this down to have only brokers 0 and 1, but some of the zoo nodes don't see the change: {code} oc rsh zoo-0 bin/zkCli.sh ls /brokers/ids [0, 1] oc rsh zoo-1 bin/zkCli.sh ls /brokers/ids [0, 1, 2] {code} > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220293#comment-16220293 ] Alastair Munro commented on KAFKA-6128: --- It does it with 0.11.0.0. It would seem the broker id's are not being replicated in zookeeper. So kafka connects to zookeeper:2181 which is a load balancer. Then when it terminates, it removes the broker-id on the zookeeper it connected to, but the change is not replicated to the other two nodes. As zookeeper may scale up and down, we use zookeeper:2181 rather than a list of hosts. Testing replication in zookeeper (eg creating /test using zkCli.sh), replication works fine. So why doesn't the broker changes replicate? I scaled this down to have only brokers 0 and 1, but some of the zoo nodes don't see the change: {code} oc rsh zoo-0 bin/zkCli.sh ls /brokers/ids [0, 1] oc rsh zoo-1 bin/zkCli.sh ls /brokers/ids [0, 1, 2] {code} > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6130) VerifiableConsumer with --max-messages doesn't exit
[ https://issues.apache.org/jira/browse/KAFKA-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Bentley updated KAFKA-6130: --- Summary: VerifiableConsumer with --max-messages doesn't exit (was: VerifiableConsume with --max-messages ) > VerifiableConsumer with --max-messages doesn't exit > --- > > Key: KAFKA-6130 > URL: https://issues.apache.org/jira/browse/KAFKA-6130 > Project: Kafka > Issue Type: Bug >Reporter: Tom Bentley >Assignee: Tom Bentley >Priority: Minor > > If I run {{kafka-verifiable-consumer.sh --max-messages=N}} I expect the tool > to consume N messages and then exit. It will actually consume as many > messages as are in the topic and then block. > The problem is that although the max messages will cause the loop in > onRecordsReceived() to break, the loop in run() will just call > onRecordsReceived() again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6130) VerifiableConsume with --max-messages
Tom Bentley created KAFKA-6130: -- Summary: VerifiableConsume with --max-messages Key: KAFKA-6130 URL: https://issues.apache.org/jira/browse/KAFKA-6130 Project: Kafka Issue Type: Bug Reporter: Tom Bentley Assignee: Tom Bentley Priority: Minor If I run {{kafka-verifiable-consumer.sh --max-messages=N}} I expect the tool to consume N messages and then exit. It will actually consume as many messages as are in the topic and then block. The problem is that although the max messages will cause the loop in onRecordsReceived() to break, the loop in run() will just call onRecordsReceived() again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220284#comment-16220284 ] Francesco vigotti commented on KAFKA-2729: -- I've maybe found the problem to my issue which maybe is not related to this topic because in my case simple broker restart didn't worked, I've create a dedicated issue then... https://issues.apache.org/jira/browse/KAFKA-6129 > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes
Francesco vigotti created KAFKA-6129: Summary: kafka issue when exposing through nodeport in kubernetes Key: KAFKA-6129 URL: https://issues.apache.org/jira/browse/KAFKA-6129 Project: Kafka Issue Type: Bug Affects Versions: 0.10.2.1 Environment: kubernetes Reporter: Francesco vigotti Priority: Critical I've started writing in this issue: https://issues.apache.org/jira/browse/KAFKA-2729 but then I'm going to open this new issue because I've probably found the cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his setup ( and all other application works using the same nodeport redirection , ie: zookeeper ) kafka brokers fails , silently (randomly in multiple brokers setup) and with a misleading error from producer so I think that Kafka should be improved, providing more robust pre-startup flight-checks and identifying/reporting the current issue After further investigation from my reply here https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster ( 1 zk + 1 kafka-broker ) I've found the problem, the problem is with kubernetes, ( I don't know why this issue appeared only now to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , or ... ) anyway my old kafka cluster started being underreplicated and return various problem , the problem happens when in kubernetes pods are created and redirected using a nodeport-service ( over a static ip in my case ) to expose kafka brokers from the host, when using hostNetwork ( so no redirection ) everything works, what is strange is that zookeeper instead works fine with nodeport ( which create a redirection rule in iptables->nat->prerouting ) the only application I've found problems with this kubernetes configuration is kafka, what is weird is that kafka starts correctly without errors, but on multiple broker clusters there are random issues, on single broker cluster instead the console-producer fails with infinite looop of : ``` [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) ``` , still no errors reported from broker or zookeeper, Also I want to say that I've come across this discussion : https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer but the proposed solution for the host pod ( to allow self-resolving of advertised hostname) didn't worked ``` hostAliases: - ip: "127.0.0.1" hostnames: - "---myhosthostname---" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220248#comment-16220248 ] Alastair Munro edited comment on KAFKA-6128 at 10/26/17 9:53 AM: - Kafka says it was not shutdown cleanly and shuts down again. I will retag image back to kafka 0.11.0.0 to confirm and restart pods. was (Author: amunro): Kafka says it was not shutdown cleanly and shuts down again. > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6128) Shutdown script does not do a clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220248#comment-16220248 ] Alastair Munro commented on KAFKA-6128: --- Kafka says it was not shutdown cleanly and shuts down again. > Shutdown script does not do a clean shutdown > > > Key: KAFKA-6128 > URL: https://issues.apache.org/jira/browse/KAFKA-6128 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.1 >Reporter: Alastair Munro >Priority: Minor > > Shutdown script (sending term signal) does not do a clean shutdown. > We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka > runs the shutdown script prior to stopping the pod kafka is running on: > {code} > lifecycle: > preStop: > exec: > command: > - ./bin/kafka-server-stop.sh > {code} > This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the > same behaviour if we send a TERM signal to the kafka process (same as the > shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6128) Shutdown script does not do a clean shutdown
Alastair Munro created KAFKA-6128: - Summary: Shutdown script does not do a clean shutdown Key: KAFKA-6128 URL: https://issues.apache.org/jira/browse/KAFKA-6128 Project: Kafka Issue Type: Bug Affects Versions: 0.11.0.1 Reporter: Alastair Munro Priority: Minor Shutdown script (sending term signal) does not do a clean shutdown. We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka runs the shutdown script prior to stopping the pod kafka is running on: {code} lifecycle: preStop: exec: command: - ./bin/kafka-server-stop.sh {code} This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the same behaviour if we send a TERM signal to the kafka process (same as the shutdown script). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5609) Connect log4j should log to file by default
[ https://issues.apache.org/jira/browse/KAFKA-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220170#comment-16220170 ] Miroslav Slivka commented on KAFKA-5609: Can I do this? I suppose it should be DailyRollingFileAppender, right? > Connect log4j should log to file by default > --- > > Key: KAFKA-5609 > URL: https://issues.apache.org/jira/browse/KAFKA-5609 > Project: Kafka > Issue Type: Improvement > Components: config, KafkaConnect >Affects Versions: 0.11.0.0 >Reporter: Yeva Byzek >Priority: Minor > Labels: easyfix > > {{https://github.com/apache/kafka/blob/trunk/config/connect-log4j.properties}} > Currently logs to stdout. It should also log to a file by default, otherwise > it just writes to console and messages can be lost -- This message was sent by Atlassian JIRA (v6.4.14#64029)