[jira] [Resolved] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-6134.

Resolution: Fixed

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: regression
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221718#comment-16221718
 ] 

ASF GitHub Bot commented on KAFKA-6134:
---

Github user hachikuji closed the pull request at:

https://github.com/apache/kafka/pull/4141


> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: regression
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores

2017-10-26 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221621#comment-16221621
 ] 

Richard Yu edited comment on KAFKA-4499 at 10/27/17 5:04 AM:
-

I have realized that this particular way is flawed and should be reconsidered. 
There is an another approach from my understanding, which utilizes provided 
methods in TreeMap:

Thread Cache is used as a general storage area, and would be used to access 
specific NamedCaches, with each CachingWindowStore being assigned one name. 
ThreadCache has a method defined as all(String namespace) which could return 
all keys present in the specified NamedCache. However that does not neccesarily 
mean it would be sorted. 

However, we could utilize TreeMap's descendingKeySet() method to get all keys 
in descending order. This would require the definition of a new method in 
NamedCache which calls upon descendingKeySet() but in effect it will allow us a 
sorted set from which to operate. 


was (Author: yohan123):
I have realized that this particular way is flawed and should be reconsidered. 
There is an another approach from my understanding, which utilizes provided 
methods in TreeMap:

Thread Cache is used as a general storage area, and would be used to access 
specific NamedCaches, with each CachingWindowStore being assigned one name. 
ThreadCache has a method defined as all(String namespace) which could return 
all keys present in the specified NamedCache. However that does not neccesarily 
mean it would be sorted. 

However, we could utilize TreeMap's descendingKeySet() method to get all keys 
in ascending order. This would require the definition of a new method in 
NamedCache which calls upon descendingKeySet() but in effect it will allow us a 
sorted set from which to operate. 

> Add "getAllKeys" API for querying windowed KTable stores
> 
>
> Key: KAFKA-4499
> URL: https://issues.apache.org/jira/browse/KAFKA-4499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>  Labels: needs-kip
> Attachments: 4499-CachingWindowStore-v1.patch
>
>
> Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via 
> IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide 
> method {{all()}} to scan the whole store (ie, returns an iterator over all 
> stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} 
> (for windowed-{{KTable}} stores).
> This limits the usage of a windowed store, because the user needs to know 
> what keys are stored in order the query it. It would be useful to provide 
> possible APIs like this (only a rough sketch):
>  - {{keys()}} returns all keys available in the store (maybe together with 
> available time ranges)
>  - {{all(long timeFrom, long timeTo)}} that returns all window for a specific 
> time range
>  - {{allLatest()}} that returns the latest window for each key
> Because this feature would require to scan multiple segments (ie, RockDB 
> instances) it would be quite inefficient with current store design. Thus, 
> this feature also required to redesign the underlying window store itself.
> Because this is a major change, a KIP 
> (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals)
>  is required. The KIP should cover the actual API design as well as the store 
> refactoring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores

2017-10-26 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221621#comment-16221621
 ] 

Richard Yu edited comment on KAFKA-4499 at 10/27/17 4:56 AM:
-

I have realized that this particular way is flawed and should be reconsidered. 
There is an another approach from my understanding, which utilizes provided 
methods in TreeMap:

Thread Cache is used as a general storage area, and would be used to access 
specific NamedCaches, with each CachingWindowStore being assigned one name. 
ThreadCache has a method defined as all(String namespace) which could return 
all keys present in the specified NamedCache. However that does not neccesarily 
mean it would be sorted. 

However, we could utilize TreeMap's descendingKeySet() method to get all keys 
in ascending order. This would require the definition of a new method in 
NamedCache which calls upon descendingKeySet() but in effect it will allow us a 
sorted set from which to operate. 


was (Author: yohan123):
Some information is currently lacking when it comes to fetchAll() as the keys 
associated with the timeFrom and timeTo parameters is missing, so I decided to 
use a substitute. If permissible, a TreeMap or HashMap can be added so that the 
timestamps and there corresponding keys (byte arrays) could be recorded and 
retrieved. 

> Add "getAllKeys" API for querying windowed KTable stores
> 
>
> Key: KAFKA-4499
> URL: https://issues.apache.org/jira/browse/KAFKA-4499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>  Labels: needs-kip
> Attachments: 4499-CachingWindowStore-v1.patch
>
>
> Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via 
> IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide 
> method {{all()}} to scan the whole store (ie, returns an iterator over all 
> stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} 
> (for windowed-{{KTable}} stores).
> This limits the usage of a windowed store, because the user needs to know 
> what keys are stored in order the query it. It would be useful to provide 
> possible APIs like this (only a rough sketch):
>  - {{keys()}} returns all keys available in the store (maybe together with 
> available time ranges)
>  - {{all(long timeFrom, long timeTo)}} that returns all window for a specific 
> time range
>  - {{allLatest()}} that returns the latest window for each key
> Because this feature would require to scan multiple segments (ie, RockDB 
> instances) it would be quite inefficient with current store design. Thus, 
> this feature also required to redesign the underlying window store itself.
> Because this is a major change, a KIP 
> (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals)
>  is required. The KIP should cover the actual API design as well as the store 
> refactoring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5668) queryable state window store range scan only returns results from one store

2017-10-26 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-5668:
-
Labels: bug  (was: )

> queryable state window store range scan only returns results from one store
> ---
>
> Key: KAFKA-5668
> URL: https://issues.apache.org/jira/browse/KAFKA-5668
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>  Labels: bug
> Fix For: 1.0.0
>
>
> Queryable state range scans return results from all local stores for simple 
> key-value stores.
> For windowed stores however it only returns results from a single local store.
> The expected behavior is that windowed store range scans would return results 
> from all local stores as it does for simple key-value stores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5668) queryable state window store range scan only returns results from one store

2017-10-26 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-5668:
-
Fix Version/s: 1.0.0

> queryable state window store range scan only returns results from one store
> ---
>
> Key: KAFKA-5668
> URL: https://issues.apache.org/jira/browse/KAFKA-5668
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>  Labels: bug
> Fix For: 1.0.0
>
>
> Queryable state range scans return results from all local stores for simple 
> key-value stores.
> For windowed stores however it only returns results from a single local store.
> The expected behavior is that windowed store range scans would return results 
> from all local stores as it does for simple key-value stores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5668) queryable state window store range scan only returns results from one store

2017-10-26 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-5668:
-
Affects Version/s: 0.11.0.1

> queryable state window store range scan only returns results from one store
> ---
>
> Key: KAFKA-5668
> URL: https://issues.apache.org/jira/browse/KAFKA-5668
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>
> Queryable state range scans return results from all local stores for simple 
> key-value stores.
> For windowed stores however it only returns results from a single local store.
> The expected behavior is that windowed store range scans would return results 
> from all local stores as it does for simple key-value stores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores

2017-10-26 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221621#comment-16221621
 ] 

Richard Yu commented on KAFKA-4499:
---

Some information is currently lacking when it comes to fetchAll() as the keys 
associated with the timeFrom and timeTo parameters is missing, so I decided to 
use a substitute. If permissible, a TreeMap or HashMap can be added so that the 
timestamps and there corresponding keys (byte arrays) could be recorded and 
retrieved. 

> Add "getAllKeys" API for querying windowed KTable stores
> 
>
> Key: KAFKA-4499
> URL: https://issues.apache.org/jira/browse/KAFKA-4499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>  Labels: needs-kip
> Attachments: 4499-CachingWindowStore-v1.patch
>
>
> Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via 
> IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide 
> method {{all()}} to scan the whole store (ie, returns an iterator over all 
> stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} 
> (for windowed-{{KTable}} stores).
> This limits the usage of a windowed store, because the user needs to know 
> what keys are stored in order the query it. It would be useful to provide 
> possible APIs like this (only a rough sketch):
>  - {{keys()}} returns all keys available in the store (maybe together with 
> available time ranges)
>  - {{all(long timeFrom, long timeTo)}} that returns all window for a specific 
> time range
>  - {{allLatest()}} that returns the latest window for each key
> Because this feature would require to scan multiple segments (ie, RockDB 
> instances) it would be quite inefficient with current store design. Thus, 
> this feature also required to redesign the underlying window store itself.
> Because this is a major change, a KIP 
> (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals)
>  is required. The KIP should cover the actual API design as well as the store 
> refactoring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores

2017-10-26 Thread Richard Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated KAFKA-4499:
--
Attachment: 4499-CachingWindowStore-v1.patch

This is my current prototype for the fetchAll() and all() methods. The 
fetchAll() method still might require some thought. 

> Add "getAllKeys" API for querying windowed KTable stores
> 
>
> Key: KAFKA-4499
> URL: https://issues.apache.org/jira/browse/KAFKA-4499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>  Labels: needs-kip
> Attachments: 4499-CachingWindowStore-v1.patch
>
>
> Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via 
> IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide 
> method {{all()}} to scan the whole store (ie, returns an iterator over all 
> stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} 
> (for windowed-{{KTable}} stores).
> This limits the usage of a windowed store, because the user needs to know 
> what keys are stored in order the query it. It would be useful to provide 
> possible APIs like this (only a rough sketch):
>  - {{keys()}} returns all keys available in the store (maybe together with 
> available time ranges)
>  - {{all(long timeFrom, long timeTo)}} that returns all window for a specific 
> time range
>  - {{allLatest()}} that returns the latest window for each key
> Because this feature would require to scan multiple segments (ie, RockDB 
> instances) it would be quite inefficient with current store design. Thus, 
> this feature also required to redesign the underlying window store itself.
> Because this is a major change, a KIP 
> (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals)
>  is required. The KIP should cover the actual API design as well as the store 
> refactoring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221596#comment-16221596
 ] 

ASF GitHub Bot commented on KAFKA-6134:
---

GitHub user hachikuji opened a pull request:

https://github.com/apache/kafka/pull/4141

KAFKA-6134: Read partition reassignment lazily on event handling

This patch prevents an O(n^2) increase in memory utilization during 
partition reassignment. Instead of storing the reassigned partitions in the 
`PartitionReassignment` object (which is added after ever partition 
reassignment), we read the data fresh from ZK when processing the event.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hachikuji/kafka KAFKA-6134

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4141


commit 5131bb19f6fe7fc1939035c48ead052a0ac967a4
Author: Jason Gustafson 
Date:   2017-10-27T02:01:05Z

KAFKA-6134: Read partition reassignment lazily on event handling




> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: regression
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6131:
---
Affects Version/s: 0.11.0.1

> Transaction markers are sometimes discarded if txns complete concurrently
> -
>
> Key: KAFKA-6131
> URL: https://issues.apache.org/jira/browse/KAFKA-6131
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.1, 1.0.0
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
> Fix For: 1.0.0, 0.11.0.2, 1.1.0
>
>
> Concurrent tests being added under KAFKA-6096 for transaction coordinator 
> fail to complete some transactions when multiple transactions are completed 
> concurrently.
> The problem is with the following code snippet - there are two very similar 
> uses of concurrent map in {{TransactionMarkerChannelManager}} and the test 
> fails because some transaction markers are discarded. {{getOrElseUpdate}} in 
> scala maps are not atomic. The test passes consistently with one thread.
> {quote}
> val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new 
> ConcurrentHashMap[Int, TxnMarkerQueue]().asScala
> val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new 
> TxnMarkerQueue(broker))
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221583#comment-16221583
 ] 

ASF GitHub Bot commented on KAFKA-6131:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4140


> Transaction markers are sometimes discarded if txns complete concurrently
> -
>
> Key: KAFKA-6131
> URL: https://issues.apache.org/jira/browse/KAFKA-6131
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>
> Concurrent tests being added under KAFKA-6096 for transaction coordinator 
> fail to complete some transactions when multiple transactions are completed 
> concurrently.
> The problem is with the following code snippet - there are two very similar 
> uses of concurrent map in {{TransactionMarkerChannelManager}} and the test 
> fails because some transaction markers are discarded. {{getOrElseUpdate}} in 
> scala maps are not atomic. The test passes consistently with one thread.
> {quote}
> val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new 
> ConcurrentHashMap[Int, TxnMarkerQueue]().asScala
> val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new 
> TxnMarkerQueue(broker))
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6135) TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException

2017-10-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated KAFKA-6135:
--
Attachment: 6135.out

Test output .

> TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException
> --
>
> Key: KAFKA-6135
> URL: https://issues.apache.org/jira/browse/KAFKA-6135
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ted Yu
> Attachments: 6135.out
>
>
> From 
> https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/2293/testReport/junit/kafka.api/TransactionsTest/testFencingOnCommit/
>  :
> {code}
> org.scalatest.junit.JUnitTestFailedError: Got an unexpected exception from a 
> fenced producer.
>   at 
> org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException(AssertionsForJUnit.scala:100)
>   at 
> org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException$(AssertionsForJUnit.scala:99)
>   at 
> org.scalatest.junit.JUnitSuite.newAssertionFailedException(JUnitSuite.scala:71)
>   at org.scalatest.Assertions.fail(Assertions.scala:1105)
>   at org.scalatest.Assertions.fail$(Assertions.scala:1101)
>   at org.scalatest.junit.JUnitSuite.fail(JUnitSuite.scala:71)
>   at 
> kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:319)
> ...
> Caused by: org.apache.kafka.common.KafkaException: Cannot execute 
> transactional method because we are in an error state
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:782)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginCommit(TransactionManager.java:220)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.commitTransaction(KafkaProducer.java:617)
>   at 
> kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:313)
>   ... 48 more
> Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer 
> attempted an operation with an old epoch. Either there is a newer producer 
> with the same transactionalId, or the producer's transaction has been expired 
> by the broker.
> {code}
> Confirmed with [~apurva] that the above would not be covered by his fix for 
> KAFKA-6119
> Temporarily marking this as bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6135) TransactionsTest#testFencingOnCommit may fail due to unexpected KafkaException

2017-10-26 Thread Ted Yu (JIRA)
Ted Yu created KAFKA-6135:
-

 Summary: TransactionsTest#testFencingOnCommit may fail due to 
unexpected KafkaException
 Key: KAFKA-6135
 URL: https://issues.apache.org/jira/browse/KAFKA-6135
 Project: Kafka
  Issue Type: Bug
Reporter: Ted Yu


>From 
>https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/2293/testReport/junit/kafka.api/TransactionsTest/testFencingOnCommit/
> :
{code}
org.scalatest.junit.JUnitTestFailedError: Got an unexpected exception from a 
fenced producer.
at 
org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException(AssertionsForJUnit.scala:100)
at 
org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException$(AssertionsForJUnit.scala:99)
at 
org.scalatest.junit.JUnitSuite.newAssertionFailedException(JUnitSuite.scala:71)
at org.scalatest.Assertions.fail(Assertions.scala:1105)
at org.scalatest.Assertions.fail$(Assertions.scala:1101)
at org.scalatest.junit.JUnitSuite.fail(JUnitSuite.scala:71)
at 
kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:319)
...
Caused by: org.apache.kafka.common.KafkaException: Cannot execute transactional 
method because we are in an error state
at 
org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:782)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.beginCommit(TransactionManager.java:220)
at 
org.apache.kafka.clients.producer.KafkaProducer.commitTransaction(KafkaProducer.java:617)
at 
kafka.api.TransactionsTest.testFencingOnCommit(TransactionsTest.scala:313)
... 48 more
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer 
attempted an operation with an old epoch. Either there is a newer producer with 
the same transactionalId, or the producer's transaction has been expired by the 
broker.
{code}
Confirmed with [~apurva] that the above would not be covered by his fix for 
KAFKA-6119

Temporarily marking this as bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-4499) Add "getAllKeys" API for querying windowed KTable stores

2017-10-26 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221561#comment-16221561
 ] 

Richard Yu edited comment on KAFKA-4499 at 10/27/17 1:38 AM:
-

I just like to seek confirmation on something: from my understanding, the 
fetch() methods found in ReadOnlyWindowStore, the keys must be sorted when 
returned. Does that neccesarily have to be the case with the new methods as 
well? If so, is it pre or post serialization?



was (Author: yohan123):
I just like to seek confirmation on something: for the fetch() methods found in 
ReadOnlyWindowStore, the keys must be sorted when returned. Does that 
neccesarily have to be the case with the new methods as well?


> Add "getAllKeys" API for querying windowed KTable stores
> 
>
> Key: KAFKA-4499
> URL: https://issues.apache.org/jira/browse/KAFKA-4499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>  Labels: needs-kip
>
> Currently, both {{KTable}} and windowed-{{KTable}} stores can be queried via 
> IQ feature. While {{ReadOnlyKeyValueStore}} (for {{KTable}} stores) provide 
> method {{all()}} to scan the whole store (ie, returns an iterator over all 
> stored key-value pairs), there is no similar API for {{ReadOnlyWindowStore}} 
> (for windowed-{{KTable}} stores).
> This limits the usage of a windowed store, because the user needs to know 
> what keys are stored in order the query it. It would be useful to provide 
> possible APIs like this (only a rough sketch):
>  - {{keys()}} returns all keys available in the store (maybe together with 
> available time ranges)
>  - {{all(long timeFrom, long timeTo)}} that returns all window for a specific 
> time range
>  - {{allLatest()}} that returns the latest window for each key
> Because this feature would require to scan multiple segments (ie, RockDB 
> instances) it would be quite inefficient with current store design. Thus, 
> this feature also required to redesign the underlying window store itself.
> Because this is a major change, a KIP 
> (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals)
>  is required. The KIP should cover the actual API design as well as the store 
> refactoring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6134:
---
Labels: regression  (was: )

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: regression
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6134:
---
Fix Version/s: 0.11.0.2
   1.0.0

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: regression
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics

2017-10-26 Thread Jeff Klukas (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221559#comment-16221559
 ] 

Jeff Klukas commented on KAFKA-3073:


Created a KIP: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-215%3A+Add+topic+regex+support+for+Connect+sinks

> KafkaConnect should support regular expression for topics
> -
>
> Key: KAFKA-3073
> URL: https://issues.apache.org/jira/browse/KAFKA-3073
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Gwen Shapira
>Assignee: Liquan Pei
>  Labels: needs-kip
>
> KafkaConsumer supports both a list of topics or a pattern when subscribing. 
> KafkaConnect only supports a list of topics, which is not just more of a 
> hassle to configure - it also requires more maintenance.
> I suggest adding topics.pattern as a new configuration and letting users 
> choose between 'topics' or 'topics.pattern'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics

2017-10-26 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221528#comment-16221528
 ] 

Ewen Cheslack-Postava commented on KAFKA-3073:
--

We actually already ended up with some uses of Pattern for configs in 
transformations. IIRC, I raised this point when we were discussing those but 
not promising full Pattern support and instead just guaranteeing common regex 
functionality is a reasonable tradeoff. 99% of the time people will just be 
using very basic functionality anyway.

> KafkaConnect should support regular expression for topics
> -
>
> Key: KAFKA-3073
> URL: https://issues.apache.org/jira/browse/KAFKA-3073
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Gwen Shapira
>Assignee: Liquan Pei
>  Labels: needs-kip
>
> KafkaConsumer supports both a list of topics or a pattern when subscribing. 
> KafkaConnect only supports a list of topics, which is not just more of a 
> hassle to configure - it also requires more maintenance.
> I suggest adding topics.pattern as a new configuration and letting users 
> choose between 'topics' or 'topics.pattern'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics

2017-10-26 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221509#comment-16221509
 ] 

Ismael Juma commented on KAFKA-3073:


If we want to support regex in the protocol, we need a cross-platform solution. 
The best option I've seen so far is:

https://github.com/google/re2
https://github.com/google/re2j

> KafkaConnect should support regular expression for topics
> -
>
> Key: KAFKA-3073
> URL: https://issues.apache.org/jira/browse/KAFKA-3073
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Gwen Shapira
>Assignee: Liquan Pei
>  Labels: needs-kip
>
> KafkaConsumer supports both a list of topics or a pattern when subscribing. 
> KafkaConnect only supports a list of topics, which is not just more of a 
> hassle to configure - it also requires more maintenance.
> I suggest adding topics.pattern as a new configuration and letting users 
> choose between 'topics' or 'topics.pattern'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-3073) KafkaConnect should support regular expression for topics

2017-10-26 Thread Jeff Klukas (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221506#comment-16221506
 ] 

Jeff Klukas commented on KAFKA-3073:


Since a significant segment of the Kafka user community come from a Java 
background, I would expect many users would be most comfortable with Java's 
regex library over PCREs. It seems reasonable to me to start there (since the 
Java consumer already includes a subscribe() variant that takes a Java regex 
Pattern) and leave it to a separate effort to find a suitable PCRE library and 
add a new subscribe() variant.

If we add PCRE support in the future, we could control which type of regex is 
used via an additional 'topics.pattern.type' configuration option.


> KafkaConnect should support regular expression for topics
> -
>
> Key: KAFKA-3073
> URL: https://issues.apache.org/jira/browse/KAFKA-3073
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Gwen Shapira
>Assignee: Liquan Pei
>  Labels: needs-kip
>
> KafkaConsumer supports both a list of topics or a pattern when subscribing. 
> KafkaConnect only supports a list of topics, which is not just more of a 
> hassle to configure - it also requires more maintenance.
> I suggest adding topics.pattern as a new configuration and letting users 
> choose between 'topics' or 'topics.pattern'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-6134:
--

Assignee: Jason Gustafson

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221443#comment-16221443
 ] 

Onur Karaman edited comment on KAFKA-6134 at 10/26/17 11:28 PM:


If you want to port a fix to 1.0 without pulling in all of KAFKA-5642, I think 
you can just lazily read the reassignment state upon actually processing the 
PartitionReassignment instead of providing one as part of the 
PartitionReassignment instance so that you'd only have one partition 
reassignment mapping allocated at any point in time.


was (Author: onurkaraman):
If you want to port a fix to 1.0 without pulling in all of KAFKA-5642, I think 
you can just lazily read the reassignment state upon actually processing the 
PartitionReassignment so that you'd only have one partition reassignment 
mapping allocated at any point in time.

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Priority: Critical
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221443#comment-16221443
 ] 

Onur Karaman commented on KAFKA-6134:
-

If you want to port a fix to 1.0 without pulling in all of KAFKA-5642, I think 
you can just lazily read the reassignment state upon actually processing the 
PartitionReassignment so that you'd only have one partition reassignment 
mapping allocated at any point in time.

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Priority: Critical
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes

2017-10-26 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221441#comment-16221441
 ] 

Roger Hoover edited comment on KAFKA-6129 at 10/26/17 11:25 PM:


What did you configure for advertised.listeners?

My guess is that the endpoints returned from the initial metadata request are 
not resolvable (just guess though)


was (Author: theduderog):
What did you configure for advertised.listeners?

My guess is that the endpoints returned from the initial metadata request are 
not resolvable.

> kafka issue when exposing through nodeport in kubernetes
> 
>
> Key: KAFKA-6129
> URL: https://issues.apache.org/jira/browse/KAFKA-6129
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: kubernetes
>Reporter: Francesco vigotti
>Priority: Critical
>
> I've started writing in this issue: 
> https://issues.apache.org/jira/browse/KAFKA-2729
> but then I'm going to open this new issue because I've probably found the 
> cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong 
> in his setup ( and all other application works using the same nodeport 
> redirection , ie: zookeeper )
> kafka brokers fails , silently (randomly in multiple brokers setup)  and with 
> a misleading error from producer so I think that Kafka should be improved, 
> providing more robust pre-startup flight-checks and identifying/reporting the 
> current issue 
> After further investigation from my reply here 
> https://issues.apache.org/jira/browse/KAFKA-2729  with a minimum size cluster 
> ( 1 zk + 1 kafka-broker ) I've found the problem, 
> the problem is with kubernetes, ( I don't know why this issue appeared only 
> now to me , if something changed in recent kube-proxy versions or in kafka 
> 0.10+ , or ... ) 
> anyway my old kafka cluster started being underreplicated and return various 
> problem , 
> the problem happens when in kubernetes pods are created and redirected using 
> a nodeport-service ( over a static ip in my case ) to expose kafka brokers 
> from the host, when using hostNetwork  ( so no redirection ) everything 
> works, what is strange is that zookeeper instead works fine with nodeport ( 
> which create a redirection rule in iptables->nat->prerouting ) the only 
> application I've found problems with this kubernetes configuration is kafka,
> what is weird is that kafka starts correctly without errors, but on multiple 
> broker clusters there are random issues, on single broker cluster instead the 
> console-producer fails with infinite looop of :
> ```
> [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation 
> id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
> (org.apache.kafka.clients.NetworkClient)
> [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation 
> id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
> (org.apache.kafka.clients.NetworkClient)
> [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation 
> id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
> (org.apache.kafka.clients.NetworkClient)
> ```
> , still no errors reported from broker or zookeeper,
> Also I want to say that I've come across this discussion : 
>  
> https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer
>  
> but the proposed solution for the host pod ( to allow self-resolving of 
> advertised hostname) didn't worked 
> ``` 
> hostAliases:
>   - ip: "127.0.0.1"
> hostnames:
> - "---myhosthostname---"
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes

2017-10-26 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221441#comment-16221441
 ] 

Roger Hoover commented on KAFKA-6129:
-

What did you configure for advertised.listeners?

My guess is that the endpoints returned from the initial metadata request are 
not resolvable.

> kafka issue when exposing through nodeport in kubernetes
> 
>
> Key: KAFKA-6129
> URL: https://issues.apache.org/jira/browse/KAFKA-6129
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: kubernetes
>Reporter: Francesco vigotti
>Priority: Critical
>
> I've started writing in this issue: 
> https://issues.apache.org/jira/browse/KAFKA-2729
> but then I'm going to open this new issue because I've probably found the 
> cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong 
> in his setup ( and all other application works using the same nodeport 
> redirection , ie: zookeeper )
> kafka brokers fails , silently (randomly in multiple brokers setup)  and with 
> a misleading error from producer so I think that Kafka should be improved, 
> providing more robust pre-startup flight-checks and identifying/reporting the 
> current issue 
> After further investigation from my reply here 
> https://issues.apache.org/jira/browse/KAFKA-2729  with a minimum size cluster 
> ( 1 zk + 1 kafka-broker ) I've found the problem, 
> the problem is with kubernetes, ( I don't know why this issue appeared only 
> now to me , if something changed in recent kube-proxy versions or in kafka 
> 0.10+ , or ... ) 
> anyway my old kafka cluster started being underreplicated and return various 
> problem , 
> the problem happens when in kubernetes pods are created and redirected using 
> a nodeport-service ( over a static ip in my case ) to expose kafka brokers 
> from the host, when using hostNetwork  ( so no redirection ) everything 
> works, what is strange is that zookeeper instead works fine with nodeport ( 
> which create a redirection rule in iptables->nat->prerouting ) the only 
> application I've found problems with this kubernetes configuration is kafka,
> what is weird is that kafka starts correctly without errors, but on multiple 
> broker clusters there are random issues, on single broker cluster instead the 
> console-producer fails with infinite looop of :
> ```
> [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation 
> id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
> (org.apache.kafka.clients.NetworkClient)
> [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation 
> id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
> (org.apache.kafka.clients.NetworkClient)
> [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation 
> id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
> (org.apache.kafka.clients.NetworkClient)
> ```
> , still no errors reported from broker or zookeeper,
> Also I want to say that I've come across this discussion : 
>  
> https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer
>  
> but the proposed solution for the host pod ( to allow self-resolving of 
> advertised hostname) didn't worked 
> ``` 
> hostAliases:
>   - ip: "127.0.0.1"
> hostnames:
> - "---myhosthostname---"
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221415#comment-16221415
 ] 

Jason Gustafson commented on KAFKA-6134:


[~onurkaraman] I noticed the change in trunk. Do you think the fix can be 
ported to 1.0? 

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Priority: Critical
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221392#comment-16221392
 ] 

Onur Karaman commented on KAFKA-6134:
-

Yes I had noticed the O(N^2) behavior a while ago. I believe this should be 
mitigated after KAFKA-5642 since PartitionReassignment is now a case object 
instead of a case class containing the remaining reassignment mapping.

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Priority: Critical
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-6134:
---
Priority: Critical  (was: Major)

> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
>Priority: Critical
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   
> zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
> {code}
> This triggers the handler above which adds a new event in the queue. So what 
> you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-6134:
---
Description: 
We've had a couple users reporting spikes in memory usage when the controller 
is performing partition reassignment in 0.11. After investigation, we found 
that the controller event queue was using most of the retained memory. In 
particular, we found several thousand {{PartitionReassignment}} objects, each 
one containing one fewer partition than the previous one (see the attached 
image).

>From the code, it seems clear why this is happening. We have a watch on the 
>partition reassignment path which adds the {{PartitionReassignment}} object to 
>the event queue:

{code}
  override def handleDataChange(dataPath: String, data: Any): Unit = {
val partitionReassignment = 
ZkUtils.parsePartitionReassignmentData(data.toString)
eventManager.put(controller.PartitionReassignment(partitionReassignment))
  }
{code}

In the {{PartitionReassignment}} event handler, we iterate through all of the 
partitions in the reassignment. After we complete reassignment for each 
partition, we remove that partition and update the node in zookeeper. 

{code}
// remove this partition from that list
val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
topicAndPartition
// write the new list to zookeeper
  
zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
{code}

This triggers the handler above which adds a new event in the queue. So what 
you get is an n^2 increase in memory where n is the number of partitions.

  was:
We've had a couple users reporting spikes in memory usage when the controller 
is performing partition reassignment in 0.11. After investigation, we found 
that the controller event queue was using most of the retained memory. In 
particular, we found several thousand {{PartitionReassignment}} objects, each 
one containing one fewer partition than the previous one:

!Screen Shot 2017-10-26 at 3.05.40 PM.png|thumbnail!.

>From the code, it seems clear why this is happening. We have a watch on the 
>partition reassignment path which adds the {{PartitionReassignment}} object to 
>the event queue:

{code}
  override def handleDataChange(dataPath: String, data: Any): Unit = {
val partitionReassignment = 
ZkUtils.parsePartitionReassignmentData(data.toString)
eventManager.put(controller.PartitionReassignment(partitionReassignment))
  }
{code}

In the {{PartitionReassignment}} event handler, we iterate through all of the 
partitions in the reassignment. After we complete reassignment for each 
partition, we remove that partition and update the node in zookeeper. 

{code}
// remove this partition from that list
val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
topicAndPartition
// write the new list to zookeeper
  
zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
{code}

This triggers the handler above which adds a new event in the queue. So what 
you get is an n^2 increase in memory where n is the number of partitions.


> High memory usage on controller during partition reassignment
> -
>
> Key: KAFKA-6134
> URL: https://issues.apache.org/jira/browse/KAFKA-6134
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jason Gustafson
> Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png
>
>
> We've had a couple users reporting spikes in memory usage when the controller 
> is performing partition reassignment in 0.11. After investigation, we found 
> that the controller event queue was using most of the retained memory. In 
> particular, we found several thousand {{PartitionReassignment}} objects, each 
> one containing one fewer partition than the previous one (see the attached 
> image).
> From the code, it seems clear why this is happening. We have a watch on the 
> partition reassignment path which adds the {{PartitionReassignment}} object 
> to the event queue:
> {code}
>   override def handleDataChange(dataPath: String, data: Any): Unit = {
> val partitionReassignment = 
> ZkUtils.parsePartitionReassignmentData(data.toString)
> eventManager.put(controller.PartitionReassignment(partitionReassignment))
>   }
> {code}
> In the {{PartitionReassignment}} event handler, we iterate through all of the 
> partitions in the reassignment. After we complete reassignment for each 
> partition, we remove that partition and update the node in zookeeper. 
> {code}
> // remove this partition from that list
> val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
> topicAndPartition
> // write the new list to zookeeper
>   

[jira] [Created] (KAFKA-6134) High memory usage on controller during partition reassignment

2017-10-26 Thread Jason Gustafson (JIRA)
Jason Gustafson created KAFKA-6134:
--

 Summary: High memory usage on controller during partition 
reassignment
 Key: KAFKA-6134
 URL: https://issues.apache.org/jira/browse/KAFKA-6134
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 0.11.0.0, 0.11.0.1
Reporter: Jason Gustafson
 Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png

We've had a couple users reporting spikes in memory usage when the controller 
is performing partition reassignment in 0.11. After investigation, we found 
that the controller event queue was using most of the retained memory. In 
particular, we found several thousand {{PartitionReassignment}} objects, each 
one containing one fewer partition than the previous one:

!Screen Shot 2017-10-26 at 3.05.40 PM.png|thumbnail!.

>From the code, it seems clear why this is happening. We have a watch on the 
>partition reassignment path which adds the {{PartitionReassignment}} object to 
>the event queue:

{code}
  override def handleDataChange(dataPath: String, data: Any): Unit = {
val partitionReassignment = 
ZkUtils.parsePartitionReassignmentData(data.toString)
eventManager.put(controller.PartitionReassignment(partitionReassignment))
  }
{code}

In the {{PartitionReassignment}} event handler, we iterate through all of the 
partitions in the reassignment. After we complete reassignment for each 
partition, we remove that partition and update the node in zookeeper. 

{code}
// remove this partition from that list
val updatedPartitionsBeingReassigned = partitionsBeingReassigned - 
topicAndPartition
// write the new list to zookeeper
  
zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
{code}

This triggers the handler above which adds a new event in the queue. So what 
you get is an n^2 increase in memory where n is the number of partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221330#comment-16221330
 ] 

Jason Gustafson commented on KAFKA-6132:


[~pnowojski] Thanks for the update. If you're confident that it is fixed, 
perhaps you can close this issue?

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6123) MetricsReporter does not get auto-generated client.id

2017-10-26 Thread Kevin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221248#comment-16221248
 ] 

Kevin Lu commented on KAFKA-6123:
-

[~guozhang]

May I work on this?

Does this require a KIP?

> MetricsReporter does not get auto-generated client.id
> -
>
> Key: KAFKA-6123
> URL: https://issues.apache.org/jira/browse/KAFKA-6123
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, metrics
>Affects Versions: 0.11.0.0
>Reporter: Kevin Lu
>Priority: Minor
>  Labels: clients, metrics, newbie++
>
> When a {{MetricsReporter}} is configured for a client, it will receive the 
> user-specified configurations via {{Configurable.configure(Map 
> configs)}}. Likewise, {{ProducerInterceptor}} and {{ConsumerInterceptor}} 
> receive user-specified configurations in their configure methods. 
> The difference is when a user does not specify the {{client.id}} field, Kafka 
> will auto-generate client ids (producer-1, producer-2, consumer-1, 
> consumer-2, etc). This auto-generated {{client.id}} will be passed into the 
> interceptors' configure method, but it is not passed to the 
> {{MetricsReporter}} configure method.
> This makes it harder to directly map {{MetricsReporter}} with the 
> interceptors for the client when users do not specify the {{client.id}} 
> field. The {{client.id}} can be determined from identifying a metric with the 
> {{client.id}} tag, but this is hacky and requires traversal. 
> It would be useful to have auto-generated {{client.id}} field also passed to 
> the {{MetricsReporter}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (KAFKA-6123) MetricsReporter does not get auto-generated client.id

2017-10-26 Thread Kevin Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Lu updated KAFKA-6123:

Comment: was deleted

(was: [~guozhang]

May I work on this?

Does this require a KIP?)

> MetricsReporter does not get auto-generated client.id
> -
>
> Key: KAFKA-6123
> URL: https://issues.apache.org/jira/browse/KAFKA-6123
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, metrics
>Affects Versions: 0.11.0.0
>Reporter: Kevin Lu
>Priority: Minor
>  Labels: clients, metrics, newbie++
>
> When a {{MetricsReporter}} is configured for a client, it will receive the 
> user-specified configurations via {{Configurable.configure(Map 
> configs)}}. Likewise, {{ProducerInterceptor}} and {{ConsumerInterceptor}} 
> receive user-specified configurations in their configure methods. 
> The difference is when a user does not specify the {{client.id}} field, Kafka 
> will auto-generate client ids (producer-1, producer-2, consumer-1, 
> consumer-2, etc). This auto-generated {{client.id}} will be passed into the 
> interceptors' configure method, but it is not passed to the 
> {{MetricsReporter}} configure method.
> This makes it harder to directly map {{MetricsReporter}} with the 
> interceptors for the client when users do not specify the {{client.id}} 
> field. The {{client.id}} can be determined from identifying a metric with the 
> {{client.id}} tag, but this is hacky and requires traversal. 
> It would be useful to have auto-generated {{client.id}} field also passed to 
> the {{MetricsReporter}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6123) MetricsReporter does not get auto-generated client.id

2017-10-26 Thread Kevin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221244#comment-16221244
 ] 

Kevin Lu commented on KAFKA-6123:
-

[~guozhang]

May I work on this?

Does this require a KIP?

> MetricsReporter does not get auto-generated client.id
> -
>
> Key: KAFKA-6123
> URL: https://issues.apache.org/jira/browse/KAFKA-6123
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, metrics
>Affects Versions: 0.11.0.0
>Reporter: Kevin Lu
>Priority: Minor
>  Labels: clients, metrics, newbie++
>
> When a {{MetricsReporter}} is configured for a client, it will receive the 
> user-specified configurations via {{Configurable.configure(Map 
> configs)}}. Likewise, {{ProducerInterceptor}} and {{ConsumerInterceptor}} 
> receive user-specified configurations in their configure methods. 
> The difference is when a user does not specify the {{client.id}} field, Kafka 
> will auto-generate client ids (producer-1, producer-2, consumer-1, 
> consumer-2, etc). This auto-generated {{client.id}} will be passed into the 
> interceptors' configure method, but it is not passed to the 
> {{MetricsReporter}} configure method.
> This makes it harder to directly map {{MetricsReporter}} with the 
> interceptors for the client when users do not specify the {{client.id}} 
> field. The {{client.id}} can be determined from identifying a metric with the 
> {{client.id}} tag, but this is hacky and requires traversal. 
> It would be useful to have auto-generated {{client.id}} field also passed to 
> the {{MetricsReporter}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6100) Streams quick start crashes Java on Windows

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221171#comment-16221171
 ] 

ASF GitHub Bot commented on KAFKA-6100:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4136


> Streams quick start crashes Java on Windows 
> 
>
> Key: KAFKA-6100
> URL: https://issues.apache.org/jira/browse/KAFKA-6100
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
> Environment: Windows 10 VM
>Reporter: Vahid Hashemian
>Assignee: Guozhang Wang
> Fix For: 1.0.0
>
> Attachments: Screen Shot 2017-10-20 at 11.53.14 AM.png, 
> java.exe_171023_115335.dmp.zip
>
>
> *This issue was detected in 1.0.0 RC2.*
> The following step in streams quick start crashes Java on Windows 10:
> {{bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo}}
> I tracked this down to [this 
> change|https://github.com/apache/kafka/commit/196bcfca0c56420793f85514d1602bde564b0651#diff-6512f838e273b79676cac5f72456127fR67],
>  and it seems to new version of RocksDB is to blame.  I tried the quick start 
> with the previous version of RocksDB (5.7.3) and did not run into this issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-6100) Streams quick start crashes Java on Windows

2017-10-26 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-6100.
--
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4136
[https://github.com/apache/kafka/pull/4136]

> Streams quick start crashes Java on Windows 
> 
>
> Key: KAFKA-6100
> URL: https://issues.apache.org/jira/browse/KAFKA-6100
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
> Environment: Windows 10 VM
>Reporter: Vahid Hashemian
> Fix For: 1.0.0
>
> Attachments: Screen Shot 2017-10-20 at 11.53.14 AM.png, 
> java.exe_171023_115335.dmp.zip
>
>
> *This issue was detected in 1.0.0 RC2.*
> The following step in streams quick start crashes Java on Windows 10:
> {{bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo}}
> I tracked this down to [this 
> change|https://github.com/apache/kafka/commit/196bcfca0c56420793f85514d1602bde564b0651#diff-6512f838e273b79676cac5f72456127fR67],
>  and it seems to new version of RocksDB is to blame.  I tried the quick start 
> with the previous version of RocksDB (5.7.3) and did not run into this issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-6100) Streams quick start crashes Java on Windows

2017-10-26 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang reassigned KAFKA-6100:


Assignee: Guozhang Wang

> Streams quick start crashes Java on Windows 
> 
>
> Key: KAFKA-6100
> URL: https://issues.apache.org/jira/browse/KAFKA-6100
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
> Environment: Windows 10 VM
>Reporter: Vahid Hashemian
>Assignee: Guozhang Wang
> Fix For: 1.0.0
>
> Attachments: Screen Shot 2017-10-20 at 11.53.14 AM.png, 
> java.exe_171023_115335.dmp.zip
>
>
> *This issue was detected in 1.0.0 RC2.*
> The following step in streams quick start crashes Java on Windows 10:
> {{bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo}}
> I tracked this down to [this 
> change|https://github.com/apache/kafka/commit/196bcfca0c56420793f85514d1602bde564b0651#diff-6512f838e273b79676cac5f72456127fR67],
>  and it seems to new version of RocksDB is to blame.  I tried the quick start 
> with the previous version of RocksDB (5.7.3) and did not run into this issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6133) NullPointerException in S3 Connector when using rotate.interval.ms

2017-10-26 Thread Elizabeth Bennett (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elizabeth Bennett updated KAFKA-6133:
-
Description: 
I just tried out the new rotate.interval.ms feature in the S3 connector to do 
time based flushing. I am getting this NPE on every event:

{code}[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask)
java.lang.NullPointerException
at 
io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288)
at 
io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[2017-10-20 23:21:35,233] ERROR Task is being killed and will not recover until 
manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask)
[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to 
unrecoverable exception.
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748){code}

I dug into the S3 connect code a bit and it looks like the 
{{rotate.interval.ms}} feature only works if you are using the 
TimeBasedPartitioner. It will get the TimestampExtractor class from the 
TimeBasedPartitioner to determine the timestamp of the event, and will use this 
for the time based flushing.

I'm using a custom partitioner, but I'd still really like to use the 
{{rotate.interval.ms}} feature, using wall clock time to determine the flushing 
behavior.

I'd be willing to work on fixing this issue, but I want to confirm it is 
actually bug, and not that it was specifically designed to only work with the 
TimeBasedPartitioner. Even if it is the latter, it should probably not crash 
with an NPE.

  was:
I just tried out the new rotate.interval.ms feature in the S3 connector to do 
time based flushing. I am getting this NPE on every event:

{code}[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask)
java.lang.NullPointerException
at 
io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288)
at 
io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 

[jira] [Updated] (KAFKA-6133) NullPointerException in S3 Connector when using rotate.interval.ms

2017-10-26 Thread Elizabeth Bennett (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elizabeth Bennett updated KAFKA-6133:
-
Description: 
I just tried out the new rotate.interval.ms feature in the S3 connector to do 
time based flushing. I am getting this NPE on every event:

{code}[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask)
java.lang.NullPointerException
at 
io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288)
at 
io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[2017-10-20 23:21:35,233] ERROR Task is being killed and will not recover until 
manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask)
[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to 
unrecoverable exception.
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748){code}

I dug into the S3 connect code a bit and it looks like the 
{{rotate.interval.ms}} feature only works if you are using the 
TimeBasedPartitioner. It will get the TimestampExtractor class from the 
TimeBasedPartitioner to determine the timestamp of the event, and will use this 
for the time based flushing.

I'm using a custom partitioner, but I'd still really like to use the 
{{rotate.interval.ms}} feature, using wall clock time to determine the flushing 
behavior.

I'd be willing to work on fixing this issue, but I want to confirm it is 
actually bug, and not that it was specifically designed to only work with the 
TimeBasedPartitioner. Even if it is the later, it should probably not crash 
with an NPE.

  was:
I just tried out the new rotate.interval.ms feature in the S3 connector to do 
time based flushing. I am getting this NPE on every event:

{{[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask)
java.lang.NullPointerException
at 
io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288)
at 
io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 

[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221118#comment-16221118
 ] 

Piotr Nowojski commented on KAFKA-6132:
---

Indeed problem seems to be solved with 1.0.0-rc3. Thanks!

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221035#comment-16221035
 ] 

Piotr Nowojski commented on KAFKA-6132:
---

Thanks, it worked. Build is under way, we will see an answer in 20-30 minutes. 
https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293308789


> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6048) Support negative record timestamps

2017-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221014#comment-16221014
 ] 

Børge Svingen commented on KAFKA-6048:
--

@james.c:

We are trying to resolve the same problem at The New York Times. 

Maybe we can work together on this?

> Support negative record timestamps
> --
>
> Key: KAFKA-6048
> URL: https://issues.apache.org/jira/browse/KAFKA-6048
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, core, streams
>Affects Versions: 1.0.0
>Reporter: Matthias J. Sax
>Assignee: james chien
>  Labels: needs-kip
>
> Kafka does not support negative record timestamps, and this prevents the 
> storage of historical data in Kafka. In general, negative timestamps are 
> supported by UNIX system time stamps: 
> From https://en.wikipedia.org/wiki/Unix_time
> {quote}
> The Unix time number is zero at the Unix epoch, and increases by exactly 
> 86,400 per day since the epoch. Thus 2004-09-16T00:00:00Z, 12,677 days after 
> the epoch, is represented by the Unix time number 12,677 × 86,400 = 
> 1095292800. This can be extended backwards from the epoch too, using negative 
> numbers; thus 1957-10-04T00:00:00Z, 4,472 days before the epoch, is 
> represented by the Unix time number −4,472 × 86,400 = −386380800.
> {quote}
> Allowing for negative timestamps would require multiple changes:
>  - while brokers in general do support negative timestamps, broker use {{-1}} 
> as default value if a producer uses an old message format (this would not be 
> compatible with supporting negative timestamps "end-to-end" as {{-1}} cannot 
> be used as "unknown" anymore): we could introduce a message flag indicating a 
> missing timestamp (and let producer throw an exception if 
> {{ConsumerRecord#timestamp()}} is called. Another possible solution might be, 
> to require topics that are used by old producers to be configured with 
> {{LogAppendTime}} semantics and rejecting writes to topics with 
> {{CreateTime}} semantics for older message formats
>  - {{KafkaProducer}} does not allow to send records with negative timestamp 
> and thus this would need to be fixed
>  - Streams API does drop records with negative timestamps (or fails by 
> default) -- also, some internal store implementation for windowed stores 
> assume that there are not negative timestamps to do range queries
> There might be other gaps we need to address. This is just a summary of issue 
> coming to my mind atm.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6048) Support negative record timestamps

2017-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221014#comment-16221014
 ] 

Børge Svingen edited comment on KAFKA-6048 at 10/26/17 6:55 PM:


[~james.c]:

We are trying to resolve the same problem at The New York Times. 

Maybe we can work together on this?


was (Author: bsvingen):
@james.c:

We are trying to resolve the same problem at The New York Times. 

Maybe we can work together on this?

> Support negative record timestamps
> --
>
> Key: KAFKA-6048
> URL: https://issues.apache.org/jira/browse/KAFKA-6048
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, core, streams
>Affects Versions: 1.0.0
>Reporter: Matthias J. Sax
>Assignee: james chien
>  Labels: needs-kip
>
> Kafka does not support negative record timestamps, and this prevents the 
> storage of historical data in Kafka. In general, negative timestamps are 
> supported by UNIX system time stamps: 
> From https://en.wikipedia.org/wiki/Unix_time
> {quote}
> The Unix time number is zero at the Unix epoch, and increases by exactly 
> 86,400 per day since the epoch. Thus 2004-09-16T00:00:00Z, 12,677 days after 
> the epoch, is represented by the Unix time number 12,677 × 86,400 = 
> 1095292800. This can be extended backwards from the epoch too, using negative 
> numbers; thus 1957-10-04T00:00:00Z, 4,472 days before the epoch, is 
> represented by the Unix time number −4,472 × 86,400 = −386380800.
> {quote}
> Allowing for negative timestamps would require multiple changes:
>  - while brokers in general do support negative timestamps, broker use {{-1}} 
> as default value if a producer uses an old message format (this would not be 
> compatible with supporting negative timestamps "end-to-end" as {{-1}} cannot 
> be used as "unknown" anymore): we could introduce a message flag indicating a 
> missing timestamp (and let producer throw an exception if 
> {{ConsumerRecord#timestamp()}} is called. Another possible solution might be, 
> to require topics that are used by old producers to be configured with 
> {{LogAppendTime}} semantics and rejecting writes to topics with 
> {{CreateTime}} semantics for older message formats
>  - {{KafkaProducer}} does not allow to send records with negative timestamp 
> and thus this would need to be fixed
>  - Streams API does drop records with negative timestamps (or fails by 
> default) -- also, some internal store implementation for windowed stores 
> assume that there are not negative timestamps to do range queries
> There might be other gaps we need to address. This is just a summary of issue 
> coming to my mind atm.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221009#comment-16221009
 ] 

Ted Yu commented on KAFKA-6132:
---

Add this to your pom.xml :
{code}
 
staging
https://repository.apache.org/content/groups/staging/

{code}
and specify 1.0.0 as Kafka version.

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221003#comment-16221003
 ] 

Piotr Nowojski commented on KAFKA-6132:
---

Could you give me a hint how to access this RC from staging repository?

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6133) NullPointerException in S3 Connector when using rotate.interval.ms

2017-10-26 Thread Elizabeth Bennett (JIRA)
Elizabeth Bennett created KAFKA-6133:


 Summary: NullPointerException in S3 Connector when using 
rotate.interval.ms
 Key: KAFKA-6133
 URL: https://issues.apache.org/jira/browse/KAFKA-6133
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Affects Versions: 0.11.0.0
Reporter: Elizabeth Bennett


I just tried out the new rotate.interval.ms feature in the S3 connector to do 
time based flushing. I am getting this NPE on every event:

{{[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask)
java.lang.NullPointerException
at 
io.confluent.connect.s3.TopicPartitionWriter.rotateOnTime(TopicPartitionWriter.java:288)
at 
io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:234)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:180)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[2017-10-20 23:21:35,233] ERROR Task is being killed and will not recover until 
manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask)
[2017-10-20 23:21:35,233] ERROR Task foo-to-S3-0 threw an uncaught and 
unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to 
unrecoverable exception.
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at 
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)}}

I dug into the S3 connect code a bit and it looks like the 
{{rotate.interval.ms}} feature only works if you are using the 
TimeBasedPartitioner. It will get the TimestampExtractor class from the 
TimeBasedPartitioner to determine the timestamp of the event, and will use this 
for the time based flushing.

I'm using a custom partitioner, but I'd still really like to use the 
{{rotate.interval.ms}} feature, using wall clock time to determine the flushing 
behavior.

I'd be willing to work on fixing this issue, but I want to confirm it is 
actually bug, and not that it was specifically designed to only work with the 
TimeBasedPartitioner. Even if it is the later, it should probably not crash 
with an NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220986#comment-16220986
 ] 

Ted Yu commented on KAFKA-6132:
---

The RC is in maven staging repo.

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220976#comment-16220976
 ] 

Piotr Nowojski edited comment on KAFKA-6132 at 10/26/17 6:39 PM:
-

I don't know. This fix is not in any released version so I could easily to it 
in pom? 

edit: I would have to include 1.0.0-rc3 somehow into my travis build. I will 
try to think something out.


was (Author: pnowojski):
I don't know. This fix is not in any released version so I could easily to it 
in pom?

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220976#comment-16220976
 ] 

Piotr Nowojski commented on KAFKA-6132:
---

I don't know. This fix is not in any released version so I could easily to it 
in pom?

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220951#comment-16220951
 ] 

Ted Yu commented on KAFKA-6132:
---

Does KAFKA-6042: 'Avoid deadlock between two groups with delayed operations' 
solve the deadlock ?

> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.11.0.0, 0.11.0.1
> Environment: Travis
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I have found some intermittent failures on travis when using Kafka 0.11 
> transactions for writing. One of them is a apparent deadlock with the 
> following stack trace:
> {code:java}
> "KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
> nid=0x1260 waiting on condition [0x7f4b10fa4000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x947048a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
> {code}
> I was unsuccessful to reproduce it locally, however I think I can semi 
> reliably reproduce it on Travis. Scenario includes simultaneous sequence of 
> instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
> closing them interleaved with writing. I have created a stripped down version 
> of this scenario as a github project:
> https://github.com/pnowojski/kafka-init-deadlock
> The code for the test scenario is here:
> https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java
> I have defined 30 build profiles that run this test and in case of detecting 
> a dead lock (5 minutes period of inactivity), stack trace of all threads is 
> being printed out. Example travis run:
> https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
> as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
> scenario all of them are failing/dead locking in exactly same way. 
> I have observed this issue both on 0.11.0.0 and 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220946#comment-16220946
 ] 

ASF GitHub Bot commented on KAFKA-6131:
---

GitHub user rajinisivaram opened a pull request:

https://github.com/apache/kafka/pull/4140

KAFKA-6131: Use atomic putIfAbsent to create txn marker queues



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajinisivaram/kafka 
KAFKA-6131-txn-concurrentmap

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4140.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4140


commit b80fcb5bfc8cb70f270eb558ed6267dca99d1e08
Author: Rajini Sivaram 
Date:   2017-10-26T17:45:47Z

KAFKA-6131: Use atomic putIfAbsent to create txn marker queues




> Transaction markers are sometimes discarded if txns complete concurrently
> -
>
> Key: KAFKA-6131
> URL: https://issues.apache.org/jira/browse/KAFKA-6131
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>
> Concurrent tests being added under KAFKA-6096 for transaction coordinator 
> fail to complete some transactions when multiple transactions are completed 
> concurrently.
> The problem is with the following code snippet - there are two very similar 
> uses of concurrent map in {{TransactionMarkerChannelManager}} and the test 
> fails because some transaction markers are discarded. {{getOrElseUpdate}} in 
> scala maps are not atomic. The test passes consistently with one thread.
> {quote}
> val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new 
> ConcurrentHashMap[Int, TxnMarkerQueue]().asScala
> val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new 
> TxnMarkerQueue(broker))
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated KAFKA-6132:
--
Description: 
I have found some intermittent failures on travis when using Kafka 0.11 
transactions for writing. One of them is a apparent deadlock with the following 
stack trace:

{code:java}
"KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
nid=0x1260 waiting on condition [0x7f4b10fa4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x947048a8> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
at 
org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
{code}

I was unsuccessful to reproduce it locally, however I think I can semi reliably 
reproduce it on Travis. Scenario includes simultaneous sequence of 
instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
closing them interleaved with writing. I have created a stripped down version 
of this scenario as a github project:
https://github.com/pnowojski/kafka-init-deadlock
The code for the test scenario is here:
https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java

I have defined 30 build profiles that run this test and in case of detecting a 
dead lock (5 minutes period of inactivity), stack trace of all threads is being 
printed out. Example travis run:
https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
scenario all of them are failing/dead locking in exactly same way. 

I have observed this issue both on 0.11.0.0 and 0.11.0.1 



  was:
I have found some intermittent failures on travis when using Kafka 0.11 
transactions for writing. One of them is a apparent deadlock with the following 
stack trace:

{code:java}
"KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
nid=0x1260 waiting on condition [0x7f4b10fa4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x947048a8> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
at 
org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
{code}

I was unsuccessful to reproduce it locally, however I think I can semi reliably 
reproduce it on Travis. Scenario includes simultaneous sequence of 
instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
closing them interleaved with writing. I have created a stripped down version 
of this scenario as a github project:
https://github.com/pnowojski/kafka-init-deadlock
The code for test scenario is here:
https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java

I have defined 30 build profiles that run this test and in case of detecting a 
dead lock (5 minutes period of inactivity) stack trace of all threads is being 
printed out. Example travis run:
https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
scenario all of them are failing/dead locking in exactly same way. 

I have observed this issue both on 0.11.0.0 and 0.11.0.1 




> KafkaProducer.initTransactions dead locks
> -
>
> Key: KAFKA-6132
> URL: https://issues.apache.org/jira/browse/KAFKA-6132
> 

[jira] [Created] (KAFKA-6132) KafkaProducer.initTransactions dead locks

2017-10-26 Thread Piotr Nowojski (JIRA)
Piotr Nowojski created KAFKA-6132:
-

 Summary: KafkaProducer.initTransactions dead locks
 Key: KAFKA-6132
 URL: https://issues.apache.org/jira/browse/KAFKA-6132
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.11.0.0, 0.11.0.1
 Environment: Travis
Reporter: Piotr Nowojski
Priority: Critical


I have found some intermittent failures on travis when using Kafka 0.11 
transactions for writing. One of them is a apparent deadlock with the following 
stack trace:

{code:java}
"KafkaTestThread[19, false]" #166 prio=5 os_prio=0 tid=0x7f4b64c29800 
nid=0x1260 waiting on condition [0x7f4b10fa4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x947048a8> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
at 
org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
{code}

I was unsuccessful to reproduce it locally, however I think I can semi reliably 
reproduce it on Travis. Scenario includes simultaneous sequence of 
instantiating new producers, calling {{KafkaProducer.initTransactions}}, 
closing them interleaved with writing. I have created a stripped down version 
of this scenario as a github project:
https://github.com/pnowojski/kafka-init-deadlock
The code for test scenario is here:
https://github.com/pnowojski/kafka-init-deadlock/blob/master/src/test/java/pl/nowojski/KafkaInitDeadLockTest.java

I have defined 30 build profiles that run this test and in case of detecting a 
dead lock (5 minutes period of inactivity) stack trace of all threads is being 
printed out. Example travis run:
https://travis-ci.org/pnowojski/kafka-init-deadlock/builds/293256284
as you can see deadlock occurred in 7 out of 30 builds. It seems like in this 
scenario all of them are failing/dead locking in exactly same way. 

I have observed this issue both on 0.11.0.0 and 0.11.0.1 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6131) Transaction markers are sometimes discarded if txns complete concurrently

2017-10-26 Thread Rajini Sivaram (JIRA)
Rajini Sivaram created KAFKA-6131:
-

 Summary: Transaction markers are sometimes discarded if txns 
complete concurrently
 Key: KAFKA-6131
 URL: https://issues.apache.org/jira/browse/KAFKA-6131
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.0
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


Concurrent tests being added under KAFKA-6096 for transaction coordinator fail 
to complete some transactions when multiple transactions are completed 
concurrently.

The problem is with the following code snippet - there are two very similar 
uses of concurrent map in {{TransactionMarkerChannelManager}} and the test 
fails because some transaction markers are discarded. {{getOrElseUpdate}} in 
scala maps are not atomic. The test passes consistently with one thread.
{quote}
val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = new 
ConcurrentHashMap[Int, TxnMarkerQueue]().asScala
val brokerRequestQueue = markersQueuePerBroker.getOrElseUpdate(brokerId, new 
TxnMarkerQueue(broker))
{quote}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6111) Tests for KafkaZkClient

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6111:
---
Description: Some methods in KafkaZkClient have no tests at the moment and 
we need to fix that.  (was: It has no tests at the moment and we need to fix 
that.)

> Tests for KafkaZkClient
> ---
>
> Key: KAFKA-6111
> URL: https://issues.apache.org/jira/browse/KAFKA-6111
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Ismael Juma
> Fix For: 1.1.0
>
>
> Some methods in KafkaZkClient have no tests at the moment and we need to fix 
> that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5029) cleanup javadocs and logging

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-5029:
---
Fix Version/s: 1.1.0

> cleanup javadocs and logging
> 
>
> Key: KAFKA-5029
> URL: https://issues.apache.org/jira/browse/KAFKA-5029
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Onur Karaman
>Assignee: Onur Karaman
> Fix For: 1.1.0
>
>
> Remove state change logger, splitting it up into the controller logs or 
> broker logs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-6112) SSL + ACL does not seem to work

2017-10-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/KAFKA-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau reassigned KAFKA-6112:
---

Assignee: Sönke Liebau

> SSL + ACL does not seem to work
> ---
>
> Key: KAFKA-6112
> URL: https://issues.apache.org/jira/browse/KAFKA-6112
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jagadish Prasath Ramu
>Assignee: Sönke Liebau
>
> I'm trying to enable ACL for a cluster that has SSL based authentication 
> setup.
> Similar issue (or exceptions) has been reported in the following JIRA:
> https://issues.apache.org/jira/browse/KAFKA-3687 (refer the last 2 exceptions 
> that were posted after the issue was closed).
> error messages seen in Producer:
> {noformat}
> [2017-10-24 18:32:25,254] WARN Error while fetching metadata with correlation 
> id 349 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2017-10-24 18:32:25,362] WARN Error while fetching metadata with correlation 
> id 350 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2017-10-24 18:32:25,470] WARN Error while fetching metadata with correlation 
> id 351 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2017-10-24 18:32:25,575] WARN Error while fetching metadata with correlation 
> id 352 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> {noformat}
> security related kafka config.properties:
> {noformat}
> ssl.keystore.location=kafka.server.keystore.jks
> ssl.keystore.password=abc123
> ssl.key.password=abc123
> ssl.truststore.location=kafka.server.truststore.jks
> ssl.truststore.password=abc123
> ssl.client.auth=required
> ssl.enabled.protocols = TLSv1.2,TLSv1.1,TLSv1
> ssl.keystore.type = JKS
> ssl.truststore.type = JKS
> security.inter.broker.protocol = SSL
> authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer
> allow.everyone.if.no.acl.found=false
> super.users=User:Bob;User:"CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX"
> listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093
> {noformat}
> client configuration file:
> {noformat}
> security.protocol=SSL
> ssl.truststore.location=kafka.client.truststore.jks
> ssl.truststore.password=abc123
> ssl.keystore.location=kafka.client.keystore.jks
> ssl.keystore.password=abc123
> ssl.key.password=abc123
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> ssl.truststore.type=JKS
> ssl.keystore.type=JKS
> group.id=group-1
> {noformat}
> The debug messages of authorizer log does not show any "DENY" messages.
> {noformat}
> [2017-10-24 18:32:26,319] DEBUG operation = Create on resource = 
> Cluster:kafka-cluster from host = 127.0.0.1 is Allow based on acl = 
> User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX has Allow permission for 
> operations: Create from hosts: 127.0.0.1 (kafka.authorizer.logger)
> [2017-10-24 18:32:26,319] DEBUG Principal = 
> User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX is Allowed Operation = 
> Create from host = 127.0.0.1 on resource = Cluster:kafka-cluster 
> (kafka.authorizer.logger)
> {noformat}
> I have followed the scripts stated in the thread:
> http://comments.gmane.org/gmane.comp.apache.kafka.user/12619



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6112) SSL + ACL does not seem to work

2017-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220485#comment-16220485
 ] 

Sönke Liebau commented on KAFKA-6112:
-

Hi [~jagadish.prasath],

I've just tested this and ACLs work fine with SSL as authentication method.

I suspect that the issue is somewhere in your ssl config or certificate setup. 
When I intentionally break my configuration so that the brokers are not 
recognized as superusers I see the error that you have shown above.

The issue probably is the "" around your broker principal. Please try again 
with the line looking like this:


{code}
super.users=User:Bob;User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX
{code}


Let me know if you need further assistance, but I am inclined to close this as 
"not a bug" with the current information.

> SSL + ACL does not seem to work
> ---
>
> Key: KAFKA-6112
> URL: https://issues.apache.org/jira/browse/KAFKA-6112
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Jagadish Prasath Ramu
>
> I'm trying to enable ACL for a cluster that has SSL based authentication 
> setup.
> Similar issue (or exceptions) has been reported in the following JIRA:
> https://issues.apache.org/jira/browse/KAFKA-3687 (refer the last 2 exceptions 
> that were posted after the issue was closed).
> error messages seen in Producer:
> {noformat}
> [2017-10-24 18:32:25,254] WARN Error while fetching metadata with correlation 
> id 349 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2017-10-24 18:32:25,362] WARN Error while fetching metadata with correlation 
> id 350 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2017-10-24 18:32:25,470] WARN Error while fetching metadata with correlation 
> id 351 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2017-10-24 18:32:25,575] WARN Error while fetching metadata with correlation 
> id 352 : {t1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> {noformat}
> security related kafka config.properties:
> {noformat}
> ssl.keystore.location=kafka.server.keystore.jks
> ssl.keystore.password=abc123
> ssl.key.password=abc123
> ssl.truststore.location=kafka.server.truststore.jks
> ssl.truststore.password=abc123
> ssl.client.auth=required
> ssl.enabled.protocols = TLSv1.2,TLSv1.1,TLSv1
> ssl.keystore.type = JKS
> ssl.truststore.type = JKS
> security.inter.broker.protocol = SSL
> authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer
> allow.everyone.if.no.acl.found=false
> super.users=User:Bob;User:"CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX"
> listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093
> {noformat}
> client configuration file:
> {noformat}
> security.protocol=SSL
> ssl.truststore.location=kafka.client.truststore.jks
> ssl.truststore.password=abc123
> ssl.keystore.location=kafka.client.keystore.jks
> ssl.keystore.password=abc123
> ssl.key.password=abc123
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> ssl.truststore.type=JKS
> ssl.keystore.type=JKS
> group.id=group-1
> {noformat}
> The debug messages of authorizer log does not show any "DENY" messages.
> {noformat}
> [2017-10-24 18:32:26,319] DEBUG operation = Create on resource = 
> Cluster:kafka-cluster from host = 127.0.0.1 is Allow based on acl = 
> User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX has Allow permission for 
> operations: Create from hosts: 127.0.0.1 (kafka.authorizer.logger)
> [2017-10-24 18:32:26,319] DEBUG Principal = 
> User:CN=localhost,OU=XXX,O=,L=XXX,ST=XX,C=XX is Allowed Operation = 
> Create from host = 127.0.0.1 on resource = Cluster:kafka-cluster 
> (kafka.authorizer.logger)
> {noformat}
> I have followed the scripts stated in the thread:
> http://comments.gmane.org/gmane.comp.apache.kafka.user/12619



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma resolved KAFKA-6128.

Resolution: Not A Problem

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma reopened KAFKA-6128:


> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6128:
---
Fix Version/s: (was: 0.11.0.1)
   (was: 0.11.0.0)

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes

2017-10-26 Thread Francesco vigotti (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco vigotti updated KAFKA-6129:
-
Description: 
I've started writing in this issue: 
https://issues.apache.org/jira/browse/KAFKA-2729
but then I'm going to open this new issue because I've probably found the cause 
in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his 
setup ( and all other application works using the same nodeport redirection , 
ie: zookeeper )
kafka brokers fails , silently (randomly in multiple brokers setup)  and with a 
misleading error from producer so I think that Kafka should be improved, 
providing more robust pre-startup flight-checks and identifying/reporting the 
current issue 

After further investigation from my reply here 
https://issues.apache.org/jira/browse/KAFKA-2729  with a minimum size cluster ( 
1 zk + 1 kafka-broker ) I've found the problem, 
the problem is with kubernetes, ( I don't know why this issue appeared only now 
to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , 
or ... ) 
anyway my old kafka cluster started being underreplicated and return various 
problem , 

the problem happens when in kubernetes pods are created and redirected using a 
nodeport-service ( over a static ip in my case ) to expose kafka brokers from 
the host, when using hostNetwork  ( so no redirection ) everything works, what 
is strange is that zookeeper instead works fine with nodeport ( which create a 
redirection rule in iptables->nat->prerouting ) the only application I've found 
problems with this kubernetes configuration is kafka,
what is weird is that kafka starts correctly without errors, but on multiple 
broker clusters there are random issues, on single broker cluster instead the 
console-producer fails with infinite looop of :

```
[2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation 
id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
[2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation 
id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
[2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation 
id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
```
, still no errors reported from broker or zookeeper,
Also I want to say that I've come across this discussion : 
 
https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer
 
but the proposed solution for the host pod ( to allow self-resolving of 
advertised hostname) didn't worked 

``` 
hostAliases:
  - ip: "127.0.0.1"
hostnames:
- "---myhosthostname---"





  was:
I've started writing in this issue: 
https://issues.apache.org/jira/browse/KAFKA-2729
but then I'm going to open this new issue because I've probably found the cause 
in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his 
setup ( and all other application works using the same nodeport redirection , 
ie: zookeeper )
kafka brokers fails , silently (randomly in multiple brokers setup)  and with a 
misleading error from producer so I think that Kafka should be improved, 
providing more robust pre-startup flight-checks and identifying/reporting the 
current issue 

After further investigation from my reply here 
https://issues.apache.org/jira/browse/KAFKA-2729  with a minimum size cluster ( 
1 zk + 1 kafka-broker ) I've found the problem, 
the problem is with kubernetes, ( I don't know why this issue appeared only now 
to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , 
or ... ) 
anyway my old kafka cluster started being underreplicated and return various 
problem , 

the problem happens when in kubernetes pods are created and redirected using a 
nodeport-service ( over a static ip in my case ) to expose kafka brokers from 
the host, when using hostNetwork  ( so no redirection ) everything works, what 
is strange is that zookeeper instead works fine with nodeport ( which create a 
redirection rule in iptables->nat->prerouting ) the only application I've found 
problems with this kubernetes configuration is kafka,
what is weird is that kafka starts correctly without errors, but on multiple 
broker clusters there are random issues, on single broker cluster instead the 
console-producer fails with infinite looop of :

```
[2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation 
id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
[2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation 
id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
[2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation 
id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 

[jira] [Resolved] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alastair Munro resolved KAFKA-6128.
---
   Resolution: Fixed
Fix Version/s: 0.11.0.0
   0.11.0.1

Broken zookeeper replication for /brokers/ids/.

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
> Fix For: 0.11.0.1, 0.11.0.0
>
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220385#comment-16220385
 ] 

Alastair Munro commented on KAFKA-6128:
---

We seem to have a broken zookeeper. If I test on another setup, we are good. So 
in summary kafka is connecting to a load balanced zookeeper:2181 cluster, and 
zookeeper.connect uses this. When a node is stopped, the id 
/brokers/ids/ is not removed from some of the zookeeper nodes. On 
restart the broker connects to one of the zookeeper nodes where 
/brokers/ids/ has not been updated and reports the issue of not 
being shutdown properly.

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220293#comment-16220293
 ] 

Alastair Munro edited comment on KAFKA-6128 at 10/26/17 10:45 AM:
--

It does it with 0.11.0.0. It would seem the broker id's are not being 
replicated in zookeeper. So kafka connects to zookeeper:2181 which is a load 
balancer. Then when it terminates, it removes the broker-id on the zookeeper it 
connected to, but the change is not replicated to the other two nodes. As 
zookeeper may scale up and down, we use zookeeper:2181 rather than a list of 
hosts. Testing replication in zookeeper (eg creating /test using zkCli.sh), 
replication works fine. So why doesn't the broker changes replicate?

I scaled this down to have only brokers 0 and 1, but the change is only made to 
the node the departing kafka connected to:

{code}
oc rsh zoo-0 bin/zkCli.sh ls  /brokers/ids
[0, 1]

oc rsh zoo-1 bin/zkCli.sh ls  /brokers/ids
[0, 1, 2]
{code}


was (Author: amunro):
It does it with 0.11.0.0. It would seem the broker id's are not being 
replicated in zookeeper. So kafka connects to zookeeper:2181 which is a load 
balancer. Then when it terminates, it removes the broker-id on the zookeeper it 
connected to, but the change is not replicated to the other two nodes. As 
zookeeper may scale up and down, we use zookeeper:2181 rather than a list of 
hosts. Testing replication in zookeeper (eg creating /test using zkCli.sh), 
replication works fine. So why doesn't the broker changes replicate?

I scaled this down to have only brokers 0 and 1, but some of the zoo nodes 
don't see the change:

{code}
oc rsh zoo-0 bin/zkCli.sh ls  /brokers/ids
[0, 1]

oc rsh zoo-1 bin/zkCli.sh ls  /brokers/ids
[0, 1, 2]
{code}

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220293#comment-16220293
 ] 

Alastair Munro commented on KAFKA-6128:
---

It does it with 0.11.0.0. It would seem the broker id's are not being 
replicated in zookeeper. So kafka connects to zookeeper:2181 which is a load 
balancer. Then when it terminates, it removes the broker-id on the zookeeper it 
connected to, but the change is not replicated to the other two nodes. As 
zookeeper may scale up and down, we use zookeeper:2181 rather than a list of 
hosts. Testing replication in zookeeper (eg creating /test using zkCli.sh), 
replication works fine. So why doesn't the broker changes replicate?

I scaled this down to have only brokers 0 and 1, but some of the zoo nodes 
don't see the change:

{code}
oc rsh zoo-0 bin/zkCli.sh ls  /brokers/ids
[0, 1]

oc rsh zoo-1 bin/zkCli.sh ls  /brokers/ids
[0, 1, 2]
{code}

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6130) VerifiableConsumer with --max-messages doesn't exit

2017-10-26 Thread Tom Bentley (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Bentley updated KAFKA-6130:
---
Summary: VerifiableConsumer with --max-messages doesn't exit  (was: 
VerifiableConsume with --max-messages )

> VerifiableConsumer with --max-messages doesn't exit
> ---
>
> Key: KAFKA-6130
> URL: https://issues.apache.org/jira/browse/KAFKA-6130
> Project: Kafka
>  Issue Type: Bug
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Minor
>
> If I run {{kafka-verifiable-consumer.sh --max-messages=N}} I expect the tool 
> to consume N messages and then exit. It will actually consume as many 
> messages as are in the topic and then block.
> The problem is that although  the max messages will cause the loop in 
> onRecordsReceived() to break, the loop in run() will just call 
> onRecordsReceived() again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6130) VerifiableConsume with --max-messages

2017-10-26 Thread Tom Bentley (JIRA)
Tom Bentley created KAFKA-6130:
--

 Summary: VerifiableConsume with --max-messages 
 Key: KAFKA-6130
 URL: https://issues.apache.org/jira/browse/KAFKA-6130
 Project: Kafka
  Issue Type: Bug
Reporter: Tom Bentley
Assignee: Tom Bentley
Priority: Minor


If I run {{kafka-verifiable-consumer.sh --max-messages=N}} I expect the tool to 
consume N messages and then exit. It will actually consume as many messages as 
are in the topic and then block.

The problem is that although  the max messages will cause the loop in 
onRecordsReceived() to break, the loop in run() will just call 
onRecordsReceived() again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-26 Thread Francesco vigotti (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220284#comment-16220284
 ] 

Francesco vigotti commented on KAFKA-2729:
--

I've maybe found the problem to my issue which maybe is not related to this 
topic because in my case simple broker restart didn't worked, I've create a 
dedicated issue then... https://issues.apache.org/jira/browse/KAFKA-6129


> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6129) kafka issue when exposing through nodeport in kubernetes

2017-10-26 Thread Francesco vigotti (JIRA)
Francesco vigotti created KAFKA-6129:


 Summary: kafka issue when exposing through nodeport in kubernetes
 Key: KAFKA-6129
 URL: https://issues.apache.org/jira/browse/KAFKA-6129
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.2.1
 Environment: kubernetes
Reporter: Francesco vigotti
Priority: Critical


I've started writing in this issue: 
https://issues.apache.org/jira/browse/KAFKA-2729
but then I'm going to open this new issue because I've probably found the cause 
in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his 
setup ( and all other application works using the same nodeport redirection , 
ie: zookeeper )
kafka brokers fails , silently (randomly in multiple brokers setup)  and with a 
misleading error from producer so I think that Kafka should be improved, 
providing more robust pre-startup flight-checks and identifying/reporting the 
current issue 

After further investigation from my reply here 
https://issues.apache.org/jira/browse/KAFKA-2729  with a minimum size cluster ( 
1 zk + 1 kafka-broker ) I've found the problem, 
the problem is with kubernetes, ( I don't know why this issue appeared only now 
to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , 
or ... ) 
anyway my old kafka cluster started being underreplicated and return various 
problem , 

the problem happens when in kubernetes pods are created and redirected using a 
nodeport-service ( over a static ip in my case ) to expose kafka brokers from 
the host, when using hostNetwork  ( so no redirection ) everything works, what 
is strange is that zookeeper instead works fine with nodeport ( which create a 
redirection rule in iptables->nat->prerouting ) the only application I've found 
problems with this kubernetes configuration is kafka,
what is weird is that kafka starts correctly without errors, but on multiple 
broker clusters there are random issues, on single broker cluster instead the 
console-producer fails with infinite looop of :

```
[2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation 
id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
[2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation 
id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
[2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation 
id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} 
(org.apache.kafka.clients.NetworkClient)
```
, still no errors reported from broker or zookeeper,
Also I want to say that I've come across this discussion : 
 
https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer
 
but the proposed solution for the host pod ( to allow self-resolving of 
advertised hostname) didn't worked 
``` 
hostAliases:
  - ip: "127.0.0.1"
hostnames:
- "---myhosthostname---"







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220248#comment-16220248
 ] 

Alastair Munro edited comment on KAFKA-6128 at 10/26/17 9:53 AM:
-

Kafka says it was not shutdown cleanly and shuts down again.

I will retag image back to kafka 0.11.0.0 to confirm and restart pods.


was (Author: amunro):
Kafka says it was not shutdown cleanly and shuts down again.

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220248#comment-16220248
 ] 

Alastair Munro commented on KAFKA-6128:
---

Kafka says it was not shutdown cleanly and shuts down again.

> Shutdown script does not do a clean shutdown
> 
>
> Key: KAFKA-6128
> URL: https://issues.apache.org/jira/browse/KAFKA-6128
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.1
>Reporter: Alastair Munro
>Priority: Minor
>
> Shutdown script (sending term signal) does not do a clean shutdown.
> We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
> runs the shutdown script prior to stopping the pod kafka is running on:
> {code}
> lifecycle:
>   preStop:
> exec:
>   command:
>   - ./bin/kafka-server-stop.sh
> {code}
> This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the 
> same behaviour if we send a TERM signal to the kafka process (same as the 
> shutdown script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6128) Shutdown script does not do a clean shutdown

2017-10-26 Thread Alastair Munro (JIRA)
Alastair Munro created KAFKA-6128:
-

 Summary: Shutdown script does not do a clean shutdown
 Key: KAFKA-6128
 URL: https://issues.apache.org/jira/browse/KAFKA-6128
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.1
Reporter: Alastair Munro
Priority: Minor


Shutdown script (sending term signal) does not do a clean shutdown.

We are running kafka in kubernetes/openshift 0.11.0.0. The statefulset kafka 
runs the shutdown script prior to stopping the pod kafka is running on:

{code}
lifecycle:
  preStop:
exec:
  command:
  - ./bin/kafka-server-stop.sh
{code}

This worked perfectly in 0.11.0.0 but doesn't in 0.11.0.1. Also we see the same 
behaviour if we send a TERM signal to the kafka process (same as the shutdown 
script).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5609) Connect log4j should log to file by default

2017-10-26 Thread Miroslav Slivka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220170#comment-16220170
 ] 

Miroslav Slivka commented on KAFKA-5609:


Can I do this? I suppose it should be DailyRollingFileAppender, right?

> Connect log4j should log to file by default
> ---
>
> Key: KAFKA-5609
> URL: https://issues.apache.org/jira/browse/KAFKA-5609
> Project: Kafka
>  Issue Type: Improvement
>  Components: config, KafkaConnect
>Affects Versions: 0.11.0.0
>Reporter: Yeva Byzek
>Priority: Minor
>  Labels: easyfix
>
> {{https://github.com/apache/kafka/blob/trunk/config/connect-log4j.properties}}
> Currently logs to stdout.  It should also log to a file by default, otherwise 
> it just writes to console and messages can be lost



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)