[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2019-04-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821626#comment-16821626
 ] 

ASF GitHub Bot commented on KAFKA-7965:
---

hachikuji commented on pull request #6608: KAFKA-7965; Fix flaky test 
ConsumerBounceTest
URL: https://github.com/apache/kafka/pull/6608
 
 
   We suspect the problem might be a race condition after broker startup where 
the consumer has yet to find the coordinator and rebalance. The fix here rolls 
all the brokers first and then waits for the expected exception.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821620#comment-16821620
 ] 

Matthias J. Sax commented on KAFKA-8108:


[https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3876/testReport/junit/kafka.api/UserQuotaTest/testThrottledProducerConsumer/]
 with
should have been throttled
error

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8108:
---
Component/s: core

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: Guozhang Wang
>Priority: Critical
> Fix For: 2.3.0
>
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8108:
---
Affects Version/s: 2.3.0

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.3.0
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8108:
---
Fix Version/s: 2.3.0

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Guozhang Wang
>Priority: Critical
> Fix For: 2.3.0
>
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8108:
---
Issue Type: Bug  (was: Improvement)

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8108:
---
Priority: Critical  (was: Major)

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Guozhang Wang
>Priority: Critical
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8108) Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8108:
---
Labels: flaky-test  (was: )

> Flaky Test kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
> 
>
> Key: KAFKA-8108
> URL: https://issues.apache.org/jira/browse/KAFKA-8108
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: Guozhang Wang
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> {code}
> java.lang.AssertionError: Client with id=QuotasTestProducer-!@#$%^&*() should 
> have been throttled
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> kafka.api.QuotaTestClients.verifyThrottleTimeMetric(BaseQuotaTest.scala:229)
>   at 
> kafka.api.QuotaTestClients.verifyProduceThrottle(BaseQuotaTest.scala:215)
>   at 
> kafka.api.BaseQuotaTest.testThrottledProducerConsumer(BaseQuotaTest.scala:82)
> {code}
> https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3230/testReport/junit/kafka.api/ClientIdQuotaTest/testThrottledProducerConsumer/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-4212) Add a key-value store that is a TTL persistent cache

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821608#comment-16821608
 ] 

Matthias J. Sax commented on KAFKA-4212:


If you configures topic cleanup policy as "delete,compact" you might want to 
use a windowed store that allows you expire data. You might need to use 
Processor API instead of KTable abstraction though.

For the broker, it would be pretty difficult to send tombstones, because if 
data is expired, the deleted log segment may contain a mix of records: some 
that don't have any future updates and some that have future updates. 
Tombstones would only be correct for those records with no further updates. To 
distinguish both, the broker would need to read the whole topic. Also, there 
would be a race between expiring, sending tombstones and newly appended records.

Last but not least, the read-path is quite different and it would be difficult 
or impossible to inject additional tombstones.

> Add a key-value store that is a TTL persistent cache
> 
>
> Key: KAFKA-4212
> URL: https://issues.apache.org/jira/browse/KAFKA-4212
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.10.0.1
>Reporter: Elias Levy
>Priority: Major
>  Labels: api
>
> Some jobs needs to maintain as state a large set of key-values for some 
> period of time.  I.e. they need to maintain a TTL cache of values potentially 
> larger than memory. 
> Currently Kafka Streams provides non-windowed and windowed key-value stores.  
> Neither is an exact fit to this use case.  
> The {{RocksDBStore}}, a {{KeyValueStore}}, stores one value per key as 
> required, but does not support expiration.  The TTL option of RocksDB is 
> explicitly not used.
> The {{RocksDBWindowsStore}}, a {{WindowsStore}}, can expire items via segment 
> dropping, but it stores multiple items per key, based on their timestamp.  
> But this store can be repurposed as a cache by fetching the items in reverse 
> chronological order and returning the first item found.
> KAFKA-2594 introduced a fixed-capacity in-memory LRU caching store, but here 
> we desire a variable-capacity memory-overflowing TTL caching store.
> Although {{RocksDBWindowsStore}} can be repurposed as a cache, it would be 
> useful to have an official and proper TTL cache API and implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2019-04-18 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reopened KAFKA-7965:

  Assignee: Jason Gustafson  (was: huxihx)

I'm still seeing this: 
[https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3890/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/].
 

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7647) Flaky test LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821586#comment-16821586
 ] 

Matthias J. Sax commented on KAFKA-7647:


[https://builds.apache.org/blue/organizations/jenkins/kafka-2.1-jdk8/detail/kafka-2.1-jdk8/166/tests]

> Flaky test 
> LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic
> -
>
> Key: KAFKA-7647
> URL: https://issues.apache.org/jira/browse/KAFKA-7647
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core, unit tests
>Affects Versions: 2.1.1, 2.3.0
>Reporter: Dong Lin
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> {code}
> kafka.log.LogCleanerParameterizedIntegrationTest >
> testCleansCombinedCompactAndDeleteTopic[3] FAILED
>     java.lang.AssertionError: Contents of the map shouldn't change
> expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
> (354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
> 2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
> (348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
> was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
> 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
> (299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
> (355,355))>
>         at org.junit.Assert.fail(Assert.java:88)
>         at org.junit.Assert.failNotEquals(Assert.java:834)
>         at org.junit.Assert.assertEquals(Assert.java:118)
>         at
> kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7647) Flaky test LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-7647:
---
Affects Version/s: 2.1.1

> Flaky test 
> LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic
> -
>
> Key: KAFKA-7647
> URL: https://issues.apache.org/jira/browse/KAFKA-7647
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core, unit tests
>Affects Versions: 2.1.1, 2.3.0
>Reporter: Dong Lin
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> {code}
> kafka.log.LogCleanerParameterizedIntegrationTest >
> testCleansCombinedCompactAndDeleteTopic[3] FAILED
>     java.lang.AssertionError: Contents of the map shouldn't change
> expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
> (354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
> 2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
> (348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
> was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
> 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
> (299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
> (355,355))>
>         at org.junit.Assert.fail(Assert.java:88)
>         at org.junit.Assert.failNotEquals(Assert.java:834)
>         at org.junit.Assert.assertEquals(Assert.java:118)
>         at
> kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8248) Producer may fail IllegalStateException

2019-04-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821561#comment-16821561
 ] 

ASF GitHub Bot commented on KAFKA-8248:
---

mjsax commented on pull request #6607: KAFKA-8248: Fix IllegalStateException in 
Producer
URL: https://github.com/apache/kafka/pull/6607
 
 
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Producer may fail IllegalStateException
> ---
>
> Key: KAFKA-8248
> URL: https://issues.apache.org/jira/browse/KAFKA-8248
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.0.0
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> In a Kafka Streams application, we observed the following log from the 
> producer:
> {quote}2019-04-17T01:58:25.898Z 17466081 [kafka-producer-network-thread | 
> client-id-enrichment-gcoint-StreamThread-7-0_10-producer] ERROR 
> org.apache.kafka.clients.producer.internals.Sender - [Producer 
> clientId=client-id-enrichment-gcoint-StreamThread-7-0_10-producer, 
> transactionalId=application-id-enrichment-gcoint-0_10] Uncaught error in 
> kafka producer I/O thread: 
> 2019-04-17T01:58:25.898Z java.lang.IllegalStateException: Attempt to send a 
> request to node 1 which is not ready.
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.NetworkClient.doSend(NetworkClient.java:430)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.NetworkClient.send(NetworkClient.java:411)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendTransactionalRequest(Sender.java:362)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:214)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
> 2019-04-17T01:58:25.898Z at java.lang.Thread.run(Thread.java:748)
> {quote}
> Later, Kafka Streams (running with EOS enabled) shuts down with a 
> `TimeoutException` that occurs during rebalance. It seem that the above error 
> results in this `TimeoutException`. However, and `IllegalStateException` seem 
> to indicate a bug in the producer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6958) Allow to define custom processor names with KStreams DSL

2019-04-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821548#comment-16821548
 ] 

ASF GitHub Bot commented on KAFKA-6958:
---

bbejeck commented on pull request #6410: KAFKA-6958: Allow to name operation 
using parameter classes
URL: https://github.com/apache/kafka/pull/6410
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow to define custom processor names with KStreams DSL
> 
>
> Key: KAFKA-6958
> URL: https://issues.apache.org/jira/browse/KAFKA-6958
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 1.1.0
>Reporter: Florian Hussonnois
>Assignee: Florian Hussonnois
>Priority: Minor
>  Labels: kip
>
> Currently, while building a new Topology through the KStreams DSL the 
> processors are automatically named.
> The genarated names are prefixed depending of the operation (i.e 
> KSTREAM-SOURCE, KSTREAM-FILTER, KSTREAM-MAP, etc).
> To debug/understand a topology it is possible to display the processor 
> lineage with the method Topology#describe(). However, a complex topology with 
> dozens of operations can be hard to understand if the processor names are not 
> relevant.
> It would be useful to be able to set more meaningful names. For example, a 
> processor name could describe the business rule performed by a map() 
> operation.
> [KIP-307|https://cwiki.apache.org/confluence/display/KAFKA/KIP-307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7672) The local state not fully restored after KafkaStream rebalanced, resulting in data loss

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821530#comment-16821530
 ] 

Matthias J. Sax commented on KAFKA-7672:


[~guozhang] Do  you think it might be worth to backport this to older versions, 
too? The PR was rather complex, and I am not sure how severe we think this 
issue is?

> The local state not fully restored after KafkaStream rebalanced, resulting in 
> data loss
> ---
>
> Key: KAFKA-7672
> URL: https://issues.apache.org/jira/browse/KAFKA-7672
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.1.0, 1.1.1, 2.0.0, 2.1.0
>Reporter: linyue li
>Assignee: linyue li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Normally, when a task is migrated to a new thread and no checkpoint file was 
> found under its task folder, Kafka Stream needs to restore the local state 
> for remote changelog topic completely and then resume running. However, in 
> some scenarios, we found that Kafka Stream *NOT* restore this state even no 
> checkpoint was found, but just clean the state folder and transition to 
> running state directly, resulting the historic data loss. 
> To be specific, I will give the detailed logs for Kafka Stream in our project 
> to show this scenario: 
> {quote}2018-10-23 08:27:07,684 INFO  
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer 
> clientId=AuditTrailBatch-StreamThread-1-consumer, groupId=AuditTrailBatch] 
> Revoking previously assigned partitions [AuditTrailBatch-0-5]
> 2018-10-23 08:27:07,684 INFO  
> org.apache.kafka.streams.processor.internals.StreamThread    - stream-thread 
> [AuditTrailBatch-StreamThread-1] State transition from PARTITIONS_ASSIGNED to 
> PARTITIONS_REVOKED
> 2018-10-23 08:27:10,856 INFO  
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer 
> clientId=AuditTrailBatch-StreamThread-1-consumer, groupId=AuditTrailBatch] 
> (Re-)joining group
> 2018-10-23 08:27:53,153 INFO  
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer 
> clientId=AuditTrailBatch-StreamThread-1-consumer, groupId=AuditTrailBatch] 
> Successfully joined group with generation 323
> 2018-10-23 08:27:53,153 INFO  
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer 
> clientId=AuditTrailBatch-StreamThread-1-consumer, groupId=AuditTrailBatch] 
> Setting newly assigned partitions [AuditTrailBatch-store1-repartition-1]
> 2018-10-23 08:27:53,153 INFO  
> org.apache.kafka.streams.processor.internals.StreamThread    - stream-thread 
> [AuditTrailBatch-StreamThread-1] State transition from PARTITIONS_REVOKED to 
> PARTITIONS_ASSIGNED
> 2018-10-23 08:27:53,153 INFO  
> org.apache.kafka.streams.processor.internals.StreamThread    - stream-thread 
> [AuditTrailBatch-StreamThread-1] *Creating producer client for task 1_1*
> 2018-10-23 08:27:53,622 INFO  
> org.apache.kafka.streams.processor.internals.StreamThread    - stream-thread 
> [AuditTrailBatch-StreamThread-1] partition assignment took 469 ms.
> 2018-10-23 08:27:54,357 INFO  
> org.apache.kafka.streams.processor.internals.StoreChangelogReader - 
> stream-thread [AuditTrailBatch-StreamThread-1]*No checkpoint found for task 
> 1_1 state store AuditTrailBatch-store1-changelog-1 with EOS turned on.* 
> *Reinitializing the task and restore its state from the beginning.*
> 2018-10-23 08:27:54,357 INFO  
> org.apache.kafka.clients.consumer.internals.Fetcher  - [Consumer 
> clientId=AuditTrailBatch-StreamThread-1-restore-consumer, groupId=]*Resetting 
> offset for partition AuditTrailBatch-store1-changelog-1 to offset 0.*
> 2018-10-23 08:27:54,653 INFO  
> org.apache.kafka.streams.processor.internals.StreamThread    - stream-thread 
> [AuditTrailBatch-StreamThread-1]*State transition from PARTITIONS_ASSIGNED to 
> RUNNING*
> {quote}
> From the logs above, we can get the procedure for thread 
> AuditTrailBatch-StreamThread-1:
>  # the previous running task assigned to thread 1 is task 0_5 (the 
> corresponding partition is AuditTrailBatch-0-5)
>  # group begins to rebalance, the new task 1_1 is assigned to thread 1.
>  # no checkpoint was found under 1_1 state folder, so reset the offset to 0 
> and clean the local state folder.
>  # thread 1 transitions to RUNNING state directly without the restoration for 
> task 1_1, so the historic data for state 1_1 is lost for thread 1. 
> *ThoubleShoot*
> To investigate the cause for this issue, we analysis the source code in 
> KafkaStream and found the key is the variable named "completedRestorers".
> This is the definition of the variable:
> {code:java}
> private final Set completedRestorers = new HashSet<>();{code}
> Each thread object has its own 

[jira] [Commented] (KAFKA-8153) Streaming application with state stores takes up to 1 hour to restart

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821528#comment-16821528
 ] 

Matthias J. Sax commented on KAFKA-8153:


>  Perhaps tasks are reshuffled among different instances?

That can happen. Kafka Streasm tries to assign tasks in a sticky way, however, 
it's best effort and stickyness may also be overruled by load balancing 
decisions. This issue is currently addressed via KIP-345.

It seems, we can close this ticket as "not a problem"?

> Streaming application with state stores takes up to 1 hour to restart
> -
>
> Key: KAFKA-8153
> URL: https://issues.apache.org/jira/browse/KAFKA-8153
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.1.0
>Reporter: Michael Melsen
>Priority: Major
>
> We are using spring cloud stream with Kafka streams 2.0.1 and utilizing the 
> InteractiveQueryService to fetch data from the stores. There are 4 stores 
> that persist data on disk after aggregating data. The code for the topology 
> looks like this:
> {code:java}
> @Slf4j
> @EnableBinding(SensorMeasurementBinding.class)
> public class Consumer {
>   public static final String RETENTION_MS = "retention.ms";
>   public static final String CLEANUP_POLICY = "cleanup.policy";
>   @Value("${windowstore.retention.ms}")
>   private String retention;
> /**
>  * Process the data flowing in from a Kafka topic. Aggregate the data to:
>  * - 2 minute
>  * - 15 minutes
>  * - one hour
>  * - 12 hours
>  *
>  * @param stream
>  */
> @StreamListener(SensorMeasurementBinding.ERROR_SCORE_IN)
> public void process(KStream stream) {
> Map topicConfig = new HashMap<>();
> topicConfig.put(RETENTION_MS, retention);
> topicConfig.put(CLEANUP_POLICY, "delete");
> log.info("Changelog and local window store retention.ms: {} and 
> cleanup.policy: {}",
> topicConfig.get(RETENTION_MS),
> topicConfig.get(CLEANUP_POLICY));
> createWindowStore(LocalStore.TWO_MINUTES_STORE, topicConfig, stream);
> createWindowStore(LocalStore.FIFTEEN_MINUTES_STORE, topicConfig, stream);
> createWindowStore(LocalStore.ONE_HOUR_STORE, topicConfig, stream);
> createWindowStore(LocalStore.TWELVE_HOURS_STORE, topicConfig, stream);
>   }
>   private void createWindowStore(
> LocalStore localStore,
> Map topicConfig,
> KStream stream) {
> // Configure how the statestore should be materialized using the provide 
> storeName
> Materialized> materialized 
> = Materialized
> .as(localStore.getStoreName());
> // Set retention of changelog topic
> materialized.withLoggingEnabled(topicConfig);
> // Configure how windows looks like and how long data will be retained in 
> local stores
> TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
> localStore.getTimeUnit(), 
> Long.parseLong(topicConfig.get(RETENTION_MS)));
> // Processing description:
> // The input data are 'samples' with key 
> :::
> // 1. With the map we add the Tag to the key and we extract the error 
> score from the data
> // 2. With the groupByKey we group  the data on the new key
> // 3. With windowedBy we split up the data in time intervals depending on 
> the provided LocalStore enum
> // 4. With reduce we determine the maximum value in the time window
> // 5. Materialized will make it stored in a table
> stream
> .map(getInstallationAssetModelAlgorithmTagKeyMapper())
> .groupByKey()
> .windowedBy(configuredTimeWindows)
> .reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, 
> newValue), materialized);
>   }
>   private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long 
> retentionMs) {
> TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
> timeWindows.until(retentionMs);
> return timeWindows;
>   }
>   /**
>* Determine the max error score to keep by looking at the aggregated error 
> signal and
>* freshly consumed error signal
>*
>* @param aggValue
>* @param newValue
>* @return
>*/
>   private ErrorScore getMaxErrorScore(ErrorScore aggValue, ErrorScore 
> newValue) {
> if(aggValue.getErrorSignal() > newValue.getErrorSignal()) {
> return aggValue;
> }
> return newValue;
>   }
>   private KeyValueMapper KeyValue> 
> getInstallationAssetModelAlgorithmTagKeyMapper() {
> return (s, sensorMeasurement) -> new KeyValue<>(s + "::" + 
> sensorMeasurement.getT(),
> new ErrorScore(sensorMeasurement.getTs(), 
> sensorMeasurement.getE(), sensorMeasurement.getO()));
>   }
> }
> {code}
> So we are materializing aggregated data to four different stores after 
> determining the max value within a specific 

[jira] [Updated] (KAFKA-7847) KIP-421: Automatically resolve external configurations.

2019-04-18 Thread TEJAL ADSUL (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TEJAL ADSUL updated KAFKA-7847:
---
Summary: KIP-421: Automatically resolve external configurations.  (was: 
KIP-421: Support resolving externalized secrets in AbstractConfig)

> KIP-421: Automatically resolve external configurations.
> ---
>
> Key: KAFKA-7847
> URL: https://issues.apache.org/jira/browse/KAFKA-7847
> Project: Kafka
>  Issue Type: Improvement
>  Components: config
>Reporter: TEJAL ADSUL
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> This proposal intends to enhance the AbstractConfig base class to support 
> replacing variables in configurations just prior to parsing and validation. 
> This simple change will make it very easy for client applications, Kafka 
> Connect, and Kafka Streams to use shared code to easily incorporate 
> externalized secrets and other variable replacements within their 
> configurations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8245) Flaky Test DeleteConsumerGroupsTest#testDeleteCmdAllGroups

2019-04-18 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821491#comment-16821491
 ] 

John Roesler commented on KAFKA-8245:
-

Saw this again:
https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3874/testReport/junit/kafka.admin/DeleteConsumerGroupsTest/testDeleteCmdAllGroups/

{noformat}
Error Message
java.lang.AssertionError: The group did become empty as expected.
Stacktrace
java.lang.AssertionError: The group did become empty as expected.
at kafka.utils.TestUtils$.fail(TestUtils.scala:382)
at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:792)
at 
kafka.admin.DeleteConsumerGroupsTest.testDeleteCmdAllGroups(DeleteConsumerGroupsTest.scala:148)
{noformat}

{noformat}
Standard Output
Error: Deletion of some consumer groups failed:
* Group 'test.group' could not be deleted due to: 
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.GroupNotEmptyException: The group is not empty.
[2019-04-18 18:57:09,800] WARN Unable to read additional data from client 
sessionid 0x10505a1cff70001, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376)

Error: Deletion of some consumer groups failed:
* Group 'missing.group' could not be deleted due to: 
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.GroupIdNotFoundException: The group id does not 
exist.
Deletion of requested consumer groups ('test.group') was successful.

Error: Deletion of some consumer groups failed:
* Group 'missing.group' could not be deleted due to: 
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.GroupIdNotFoundException: The group id does not 
exist.

These consumer groups were deleted successfully: 'test.group'
{noformat}

> Flaky Test DeleteConsumerGroupsTest#testDeleteCmdAllGroups
> --
>
> Key: KAFKA-8245
> URL: https://issues.apache.org/jira/browse/KAFKA-8245
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.3.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3781/testReport/junit/kafka.admin/DeleteConsumerGroupsTest/testDeleteCmdAllGroups/]
> {quote}java.lang.AssertionError: The group did become empty as expected. at 
> kafka.utils.TestUtils$.fail(TestUtils.scala:381) at 
> kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:791) at 
> kafka.admin.DeleteConsumerGroupsTest.testDeleteCmdAllGroups(DeleteConsumerGroupsTest.scala:148){quote}
> STDOUT
> {quote}Error: Deletion of some consumer groups failed: * Group 'test.group' 
> could not be deleted due to: java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.GroupNotEmptyException: The group is not 
> empty. Error: Deletion of some consumer groups failed: * Group 
> 'missing.group' could not be deleted due to: 
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.GroupIdNotFoundException: The group id does 
> not exist. [2019-04-16 09:42:02,316] WARN Unable to read additional data from 
> client sessionid 0x104f958dba3, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) Deletion of requested 
> consumer groups ('test.group') was successful. Error: Deletion of some 
> consumer groups failed: * Group 'missing.group' could not be deleted due to: 
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.GroupIdNotFoundException: The group id does 
> not exist. These consumer groups were deleted successfully: 
> 'test.group'{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8260) Flaky test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2019-04-18 Thread John Roesler (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Roesler updated KAFKA-8260:

Description: 
I have seen this fail again just now. See also KAFKA-7965 and KAFKA-7936.

https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3874/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/

{noformat}
Error Message
org.scalatest.junit.JUnitTestFailedError: Should have received an class 
org.apache.kafka.common.errors.GroupMaxSizeReachedException during the cluster 
roll
Stacktrace
org.scalatest.junit.JUnitTestFailedError: Should have received an class 
org.apache.kafka.common.errors.GroupMaxSizeReachedException during the cluster 
roll
at 
org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException(AssertionsForJUnit.scala:100)
at 
org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException$(AssertionsForJUnit.scala:99)
at 
org.scalatest.junit.JUnitSuite.newAssertionFailedException(JUnitSuite.scala:71)
at org.scalatest.Assertions.fail(Assertions.scala:1089)
at org.scalatest.Assertions.fail$(Assertions.scala:1085)
at org.scalatest.junit.JUnitSuite.fail(JUnitSuite.scala:71)
at 
kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:350)

{noformat}

{noformat}
Standard Output
[2019-04-18 18:26:47,487] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:47,676] ERROR [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=0] Error for partition topic-0 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:47,677] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=0] Error for partition topic-1 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:47,698] ERROR [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=0] Error for partition topic-1 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:48,023] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
fetcherId=0] Error for partition closetest-8 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:48,023] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
fetcherId=0] Error for partition closetest-2 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:48,023] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
fetcherId=0] Error for partition closetest-5 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:26:52,136] WARN Unable to read additional data from client 
sessionid 0x105058603f80001, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376)
[2019-04-18 18:27:00,021] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:27:00,025] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:27:00,198] ERROR [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=0] Error for partition topic-0 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:27:00,198] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=0] Error for partition topic-1 at offset 0 
(kafka.server.ReplicaFetcherThread:76)
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
[2019-04-18 18:27:14,032] ERROR [Consumer clientId=ConsumerTestConsumer, 
groupId=group2] Offset commit failed on partition topic-0 at offset 10: This is 
not the 

[jira] [Assigned] (KAFKA-6819) Refactor build-in StreamsMetrics internal implementations

2019-04-18 Thread Bruno Cadonna (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Cadonna reassigned KAFKA-6819:


Assignee: Bruno Cadonna

> Refactor build-in StreamsMetrics internal implementations
> -
>
> Key: KAFKA-6819
> URL: https://issues.apache.org/jira/browse/KAFKA-6819
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Bruno Cadonna
>Priority: Major
>
> Our current internal implementations of StreamsMetrics and different layered 
> metrics like StreamMetricsThreadImpl, TaskMetrics, NodeMetrics etc are a bit 
> messy nowadays. We could improve on the current situation by doing the 
> following:
> 0. For thread-level metrics, refactor the {{StreamsMetricsThreadImpl}} class 
> to {{ThreadMetrics}} such that a) it does not extend from 
> {{StreamsMetricsImpl}} but just include the {{StreamsMetricsThreadImpl}} as 
> its constructor parameters. And make its constructor, replacing with a static 
> {{addAllSensors(threadName)}} that tries to register all the thread-level 
> sensors for the given thread name.
> 1. Add a static function for each of the built-in sensors of the thread-level 
> metrics in {{ThreadMetrics}} that relies on the internal 
> {{StreamsMetricsConventions}} to get thread level sensor names. If the sensor 
> cannot be found from the internal {{Metrics}} registry, create the sensor 
> on-the-fly.
> 2.a Add a static {{removeAllSensors(threadName)}} function in 
> {{ThreadMetrics}} that tries to de-register all the thread-level metrics for 
> this thread, if there is no sensors then it will be a no-op. In 
> {{StreamThread#close()}} we will trigger this function; and similarly in 
> `TopologyTestDriver` when we close the driver we will also call this function 
> as well. As a result, the {{ThreadMetrics}} class itself would only contain 
> static functions with no member fields at all.
> 2.b We can consider doing the same for {{TaskMetrics}}, {{NodeMetrics}} and 
> {{NamedCacheMetrics}} as well, and add a {{StoreMetrics}} following the 
> similar pattern: although these metrics are not accessed externally to their 
> enclosing class in the future this may be changed as well.
> 3. Then, we only pass {{StreamsMetricsImpl}} around between the internal 
> classes, to access the specific sensor whenever trying to record it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8260) Flaky test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2019-04-18 Thread John Roesler (JIRA)
John Roesler created KAFKA-8260:
---

 Summary: Flaky test 
ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
 Key: KAFKA-8260
 URL: https://issues.apache.org/jira/browse/KAFKA-8260
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 2.3.0
Reporter: John Roesler


I have seen this fail again just now. See also KAFKA-7965 and KAFKA-7936.

https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/3874/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/

{noformat}
Error Message
org.scalatest.junit.JUnitTestFailedError: Should have received an class 
org.apache.kafka.common.errors.GroupMaxSizeReachedException during the cluster 
roll
Stacktrace
org.scalatest.junit.JUnitTestFailedError: Should have received an class 
org.apache.kafka.common.errors.GroupMaxSizeReachedException during the cluster 
roll
at 
org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException(AssertionsForJUnit.scala:100)
at 
org.scalatest.junit.AssertionsForJUnit.newAssertionFailedException$(AssertionsForJUnit.scala:99)
at 
org.scalatest.junit.JUnitSuite.newAssertionFailedException(JUnitSuite.scala:71)
at org.scalatest.Assertions.fail(Assertions.scala:1089)
at org.scalatest.Assertions.fail$(Assertions.scala:1085)
at org.scalatest.junit.JUnitSuite.fail(JUnitSuite.scala:71)
at 
kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:350)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8259) Build RPM for Kafka

2019-04-18 Thread Patrick Dignan (JIRA)
Patrick Dignan created KAFKA-8259:
-

 Summary: Build RPM for Kafka
 Key: KAFKA-8259
 URL: https://issues.apache.org/jira/browse/KAFKA-8259
 Project: Kafka
  Issue Type: Improvement
  Components: build
Reporter: Patrick Dignan


RPM packaging eases the installation and deployment of Kafka to make it much 
more standard.

I noticed in https://issues.apache.org/jira/browse/KAFKA-1324 [~jkreps] closed 
the issue because other sources provide packaging.  I think it's worthwhile for 
the standard, open source project to provide this as a base to reduce redundant 
work and provide this functionality for users.  Other similar open source 
software like Elasticsearch create an RPM [(Elasticsearch 
RPM)|[https://github.com/elastic/elasticsearch/blob/0ad3d90a36529bf369813ea6253f305e11aff2e9/distribution/packages/build.gradle]].
  This also makes forking internally more maintainable by reducing the amount 
of work to be done for each version upgrade.

I have a patch to add this functionality that I will clean up and PR on Github.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7760) Add broker configuration to set minimum value for segment.bytes and segment.ms

2019-04-18 Thread Brandt Newton (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandt Newton reassigned KAFKA-7760:


Assignee: Brandt Newton

> Add broker configuration to set minimum value for segment.bytes and segment.ms
> --
>
> Key: KAFKA-7760
> URL: https://issues.apache.org/jira/browse/KAFKA-7760
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Badai Aqrandista
>Assignee: Brandt Newton
>Priority: Major
>  Labels: kip, newbie
>
> If someone set segment.bytes or segment.ms at topic level to a very small 
> value (e.g. segment.bytes=1000 or segment.ms=1000), Kafka will generate a 
> very high number of segment files. This can bring down the whole broker due 
> to hitting the maximum open file (for log) or maximum number of mmap-ed file 
> (for index).
> To prevent that from happening, I would like to suggest adding two new items 
> to the broker configuration:
>  * min.topic.segment.bytes, defaults to 1048576: The minimum value for 
> segment.bytes. When someone sets topic configuration segment.bytes to a value 
> lower than this, Kafka throws an error INVALID VALUE.
>  * min.topic.segment.ms, defaults to 360: The minimum value for 
> segment.ms. When someone sets topic configuration segment.ms to a value lower 
> than this, Kafka throws an error INVALID VALUE.
> Thanks
> Badai



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8258) Verbose logs in org.apache.kafka.clients.consumer.internals.Fetcher

2019-04-18 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated KAFKA-8258:

Description: 
We noticed that the Spark's Kafka connector outputs a lot of following verbose 
logs:
{code:java}
19/04/18 04:31:06 INFO Fetcher: [Consumer clientId=consumer-2, groupId=...] 
Resetting offset for partition ... to offset  
{code}
It comes from 
[https://github.com/hachikuji/kafka/blob/76c796ca128c3c97231f3ebda994a07bb06b26aa/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L567]

This log was added in [https://github.com/apache/kafka/pull/4557]

In Spark, we use `seekToEnd` to discover latest offsets of a topic. If there 
are thousands of partitions in this topic, it will output thousands of INFO 
logs every call

Is it intentional? If not, can we change it to DEBUG?

  was:
We noticed that the Spark's Kafka connector outputs a lot of following verbose 
logs:
{code}
19/04/18 04:31:06 INFO Fetcher: [Consumer clientId=consumer-2, groupId=...] 
Resetting offset for partition ... to offset  
{code}

It comes from 
https://github.com/hachikuji/kafka/blob/76c796ca128c3c97231f3ebda994a07bb06b26aa/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L567

This log was added in https://github.com/apache/kafka/pull/4557

In Spark, we use `seekToEnd` to discover latest offsets of a topic. If there 
are thousands of partitions in this topic, it will output thousands of INFO 
logs.

Is it intentional? If not, can we change it to DEBUG?


> Verbose logs in org.apache.kafka.clients.consumer.internals.Fetcher
> ---
>
> Key: KAFKA-8258
> URL: https://issues.apache.org/jira/browse/KAFKA-8258
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> We noticed that the Spark's Kafka connector outputs a lot of following 
> verbose logs:
> {code:java}
> 19/04/18 04:31:06 INFO Fetcher: [Consumer clientId=consumer-2, groupId=...] 
> Resetting offset for partition ... to offset  
> {code}
> It comes from 
> [https://github.com/hachikuji/kafka/blob/76c796ca128c3c97231f3ebda994a07bb06b26aa/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L567]
> This log was added in [https://github.com/apache/kafka/pull/4557]
> In Spark, we use `seekToEnd` to discover latest offsets of a topic. If there 
> are thousands of partitions in this topic, it will output thousands of INFO 
> logs every call
> Is it intentional? If not, can we change it to DEBUG?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8258) Verbose logs in org.apache.kafka.clients.consumer.internals.Fetcher

2019-04-18 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created KAFKA-8258:
---

 Summary: Verbose logs in 
org.apache.kafka.clients.consumer.internals.Fetcher
 Key: KAFKA-8258
 URL: https://issues.apache.org/jira/browse/KAFKA-8258
 Project: Kafka
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Shixiong Zhu


We noticed that the Spark's Kafka connector outputs a lot of following verbose 
logs:
{code}
19/04/18 04:31:06 INFO Fetcher: [Consumer clientId=consumer-2, groupId=...] 
Resetting offset for partition ... to offset  
{code}

It comes from 
https://github.com/hachikuji/kafka/blob/76c796ca128c3c97231f3ebda994a07bb06b26aa/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L567

This log was added in https://github.com/apache/kafka/pull/4557

In Spark, we use `seekToEnd` to discover latest offsets of a topic. If there 
are thousands of partitions in this topic, it will output thousands of INFO 
logs.

Is it intentional? If not, can we change it to DEBUG?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2019-04-18 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821419#comment-16821419
 ] 

Michael K. Edwards commented on KAFKA-7026:
---

Good enough for me.  Thank you.

> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
> Fix For: 2.3.0
>
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8248) Producer may fail IllegalStateException

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821393#comment-16821393
 ] 

Matthias J. Sax commented on KAFKA-8248:


It was reported for `2.0.0` but we assume it affects newer (and even older 
versions), too.

> Producer may fail IllegalStateException
> ---
>
> Key: KAFKA-8248
> URL: https://issues.apache.org/jira/browse/KAFKA-8248
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.0.0
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> In a Kafka Streams application, we observed the following log from the 
> producer:
> {quote}2019-04-17T01:58:25.898Z 17466081 [kafka-producer-network-thread | 
> client-id-enrichment-gcoint-StreamThread-7-0_10-producer] ERROR 
> org.apache.kafka.clients.producer.internals.Sender - [Producer 
> clientId=client-id-enrichment-gcoint-StreamThread-7-0_10-producer, 
> transactionalId=application-id-enrichment-gcoint-0_10] Uncaught error in 
> kafka producer I/O thread: 
> 2019-04-17T01:58:25.898Z java.lang.IllegalStateException: Attempt to send a 
> request to node 1 which is not ready.
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.NetworkClient.doSend(NetworkClient.java:430)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.NetworkClient.send(NetworkClient.java:411)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendTransactionalRequest(Sender.java:362)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:214)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
> 2019-04-17T01:58:25.898Z at java.lang.Thread.run(Thread.java:748)
> {quote}
> Later, Kafka Streams (running with EOS enabled) shuts down with a 
> `TimeoutException` that occurs during rebalance. It seem that the above error 
> results in this `TimeoutException`. However, and `IllegalStateException` seem 
> to indicate a bug in the producer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-8248) Producer may fail IllegalStateException

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax reassigned KAFKA-8248:
--

Assignee: Matthias J. Sax

> Producer may fail IllegalStateException
> ---
>
> Key: KAFKA-8248
> URL: https://issues.apache.org/jira/browse/KAFKA-8248
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.0.0
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> In a Kafka Streams application, we observed the following log from the 
> producer:
> {quote}2019-04-17T01:58:25.898Z 17466081 [kafka-producer-network-thread | 
> client-id-enrichment-gcoint-StreamThread-7-0_10-producer] ERROR 
> org.apache.kafka.clients.producer.internals.Sender - [Producer 
> clientId=client-id-enrichment-gcoint-StreamThread-7-0_10-producer, 
> transactionalId=application-id-enrichment-gcoint-0_10] Uncaught error in 
> kafka producer I/O thread: 
> 2019-04-17T01:58:25.898Z java.lang.IllegalStateException: Attempt to send a 
> request to node 1 which is not ready.
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.NetworkClient.doSend(NetworkClient.java:430)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.NetworkClient.send(NetworkClient.java:411)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendTransactionalRequest(Sender.java:362)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:214)
> 2019-04-17T01:58:25.898Z at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
> 2019-04-17T01:58:25.898Z at java.lang.Thread.run(Thread.java:748)
> {quote}
> Later, Kafka Streams (running with EOS enabled) shuts down with a 
> `TimeoutException` that occurs during rebalance. It seem that the above error 
> results in this `TimeoutException`. However, and `IllegalStateException` seem 
> to indicate a bug in the producer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7866) Duplicate offsets after transaction index append failure

2019-04-18 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-7866.

   Resolution: Fixed
 Assignee: Jason Gustafson
Fix Version/s: 2.2.1
   2.1.2
   2.0.2

> Duplicate offsets after transaction index append failure
> 
>
> Key: KAFKA-7866
> URL: https://issues.apache.org/jira/browse/KAFKA-7866
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.0.2, 2.1.2, 2.2.1
>
>
> We have encountered a situation in which an ABORT marker was written 
> successfully to the log, but failed to be written to the transaction index. 
> This prevented the log end offset from being incremented. This resulted in 
> duplicate offsets when the next append was attempted. The broker was using 
> JBOD and we would normally expect IOExceptions to cause the log directory to 
> be failed. That did not seem to happen here and the duplicates continued for 
> several hours.
> Unfortunately, we are not sure what the cause of the failure was. 
> Significantly, the first duplicate was also the first ABORT marker in the 
> log. Unlike the offset and timestamp index, the transaction index is created 
> on demand after the first aborted transction. It is likely that the attempt 
> to create and open the transaction index failed. There is some suggestion 
> that the process may have bumped into the open file limit. Whatever the 
> problem was, it also prevented log collection, so we cannot confirm our 
> guesses. 
> Without knowing the underlying cause, we can still consider some potential 
> improvements:
> 1. We probably should be catching non-IO exceptions in the append process. If 
> the append to one of the indexes fails, we potentially truncate the log or 
> re-throw it as an IOException to ensure that the log directory is no longer 
> used.
> 2. Even without the unexpected exception, there is a small window during 
> which even an IOException could lead to duplicate offsets. Marking a log 
> directory offline is an asynchronous operation and there is no guarantee that 
> another append cannot happen first. Given this, we probably need to detect 
> and truncate duplicates during the log recovery process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821391#comment-16821391
 ] 

Matthias J. Sax commented on KAFKA-7026:


Pretty confident.

> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
> Fix For: 2.3.0
>
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821388#comment-16821388
 ] 

Matthias J. Sax commented on KAFKA-7965:


[~bbejeck] The patch was merged yesterday – can we verify if the build contains 
the patch? If yes, we should reopen this ticket.

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Assignee: huxihx
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8257) Flaky Test DynamicConnectionQuotaTest#testDynamicListenerConnectionQuota

2019-04-18 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created KAFKA-8257:
--

 Summary: Flaky Test 
DynamicConnectionQuotaTest#testDynamicListenerConnectionQuota
 Key: KAFKA-8257
 URL: https://issues.apache.org/jira/browse/KAFKA-8257
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 2.3.0
Reporter: Matthias J. Sax
 Fix For: 2.3.0


[https://builds.apache.org/blue/organizations/jenkins/kafka-trunk-jdk8/detail/kafka-trunk-jdk8/3566/tests]
{quote}java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at kafka.server.BaseRequestTest.receiveResponse(BaseRequestTest.scala:87)
at kafka.server.BaseRequestTest.sendAndReceive(BaseRequestTest.scala:148)
at 
kafka.network.DynamicConnectionQuotaTest.verifyConnection(DynamicConnectionQuotaTest.scala:229)
at 
kafka.network.DynamicConnectionQuotaTest.$anonfun$testDynamicListenerConnectionQuota$4(DynamicConnectionQuotaTest.scala:133)
at 
kafka.network.DynamicConnectionQuotaTest.$anonfun$testDynamicListenerConnectionQuota$4$adapted(DynamicConnectionQuotaTest.scala:133)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
kafka.network.DynamicConnectionQuotaTest.testDynamicListenerConnectionQuota(DynamicConnectionQuotaTest.scala:133){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8257) Flaky Test DynamicConnectionQuotaTest#testDynamicListenerConnectionQuota

2019-04-18 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-8257:
---
Issue Type: Bug  (was: Improvement)

> Flaky Test DynamicConnectionQuotaTest#testDynamicListenerConnectionQuota
> 
>
> Key: KAFKA-8257
> URL: https://issues.apache.org/jira/browse/KAFKA-8257
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> [https://builds.apache.org/blue/organizations/jenkins/kafka-trunk-jdk8/detail/kafka-trunk-jdk8/3566/tests]
> {quote}java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at kafka.server.BaseRequestTest.receiveResponse(BaseRequestTest.scala:87)
> at kafka.server.BaseRequestTest.sendAndReceive(BaseRequestTest.scala:148)
> at 
> kafka.network.DynamicConnectionQuotaTest.verifyConnection(DynamicConnectionQuotaTest.scala:229)
> at 
> kafka.network.DynamicConnectionQuotaTest.$anonfun$testDynamicListenerConnectionQuota$4(DynamicConnectionQuotaTest.scala:133)
> at 
> kafka.network.DynamicConnectionQuotaTest.$anonfun$testDynamicListenerConnectionQuota$4$adapted(DynamicConnectionQuotaTest.scala:133)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at 
> kafka.network.DynamicConnectionQuotaTest.testDynamicListenerConnectionQuota(DynamicConnectionQuotaTest.scala:133){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8030) Flaky Test TopicCommandWithAdminClientTest#testDescribeUnderMinIsrPartitionsMixed

2019-04-18 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821368#comment-16821368
 ] 

Matthias J. Sax commented on KAFKA-8030:


Failed again: 
[https://builds.apache.org/blue/organizations/jenkins/kafka-trunk-jdk11/detail/kafka-trunk-jdk11/445/tests]

> Flaky Test 
> TopicCommandWithAdminClientTest#testDescribeUnderMinIsrPartitionsMixed
> -
>
> Key: KAFKA-8030
> URL: https://issues.apache.org/jira/browse/KAFKA-8030
> Project: Kafka
>  Issue Type: Bug
>  Components: admin, unit tests
>Affects Versions: 2.3.0
>Reporter: Matthias J. Sax
>Assignee: Viktor Somogyi-Vass
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/2830/testReport/junit/kafka.admin/TopicCommandWithAdminClientTest/testDescribeUnderMinIsrPartitionsMixed/]
> {quote}java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at 
> org.junit.Assert.assertTrue(Assert.java:42) at 
> org.junit.Assert.assertTrue(Assert.java:53) at 
> kafka.admin.TopicCommandWithAdminClientTest.testDescribeUnderMinIsrPartitionsMixed(TopicCommandWithAdminClientTest.scala:602){quote}
> STDERR
> {quote}Option "[replica-assignment]" can't be used with option 
> "[partitions]"{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8235) NoSuchElementException when restoring state after a clean shutdown of a Kafka Streams application

2019-04-18 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821363#comment-16821363
 ] 

John Roesler commented on KAFKA-8235:
-

Hey [~AndrewRK],

Sorry about that. I failed to notice that nuance of the TreeMap API. The fix is 
now part of my PR. Would you like to give it a try again?

If there are any further issues, you can just leave comments on the PR.

Thanks for helping to test the change! It was good to catch this before 
releasing it!

-John

> NoSuchElementException when restoring state after a clean shutdown of a Kafka 
> Streams application
> -
>
> Key: KAFKA-8235
> URL: https://issues.apache.org/jira/browse/KAFKA-8235
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0
> Environment: Linux, CentOS 7, Java 1.8.0_191, 50 partitions per 
> topic, replication factor 3
>Reporter: Andrew Klopper
>Priority: Major
>
> While performing a larger scale test of a new Kafka Streams application that 
> performs aggregation and suppression, we have discovered that we are unable 
> to restart the application after a clean shutdown. The error that is logged 
> is:
> {code:java}
> [rollupd-5ee1997b-8302-4e88-b3db-a74c8b805a6c-StreamThread-1] ERROR 
> org.apache.kafka.streams.processor.internals.StreamThread - stream-thread 
> [rollupd-5ee1997b-8302-4e88-b3db-a74c8b805a6c-StreamThread-1] Encountered the 
> following error during processing:
> java.util.NoSuchElementException
> at java.util.TreeMap.key(TreeMap.java:1327)
> at java.util.TreeMap.firstKey(TreeMap.java:290)
> at 
> org.apache.kafka.streams.state.internals.InMemoryTimeOrderedKeyValueBuffer.restoreBatch(InMemoryTimeOrderedKeyValueBuffer.java:288)
> at 
> org.apache.kafka.streams.processor.internals.CompositeRestoreListener.restoreBatch(CompositeRestoreListener.java:89)
> at 
> org.apache.kafka.streams.processor.internals.StateRestorer.restore(StateRestorer.java:92)
> at 
> org.apache.kafka.streams.processor.internals.StoreChangelogReader.processNext(StoreChangelogReader.java:317)
> at 
> org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:92)
> at 
> org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
> {code}
> The issue doesn't seem to occur for small amounts of data, but it doesn't 
> take a particularly large amount of data to trigger the problem either.
> Any assistance would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8256) Replace Heartbeat request/response with automated protocol

2019-04-18 Thread Mickael Maison (JIRA)
Mickael Maison created KAFKA-8256:
-

 Summary: Replace Heartbeat request/response with automated protocol
 Key: KAFKA-8256
 URL: https://issues.apache.org/jira/browse/KAFKA-8256
 Project: Kafka
  Issue Type: Sub-task
Reporter: Mickael Maison
Assignee: Mickael Maison






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8185) Controller becomes stale and not able to failover the leadership for the partitions

2019-04-18 Thread Kang H Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kang H Lee updated KAFKA-8185:
--
Attachment: (was: broker9.zip)

> Controller becomes stale and not able to failover the leadership for the 
> partitions
> ---
>
> Key: KAFKA-8185
> URL: https://issues.apache.org/jira/browse/KAFKA-8185
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 1.1.1
>Reporter: Kang H Lee
>Priority: Critical
>
> Description:
> After broker 9 went offline, all partitions led by it went offline. The 
> controller attempted to move leadership but ran into an exception while doing 
> so:
> {code:java}
> // [2019-03-26 01:23:34,114] ERROR [PartitionStateMachine controllerId=12] 
> Error while moving some partitions to OnlinePartition state 
> (kafka.controller.PartitionStateMachine)
> java.util.NoSuchElementException: key not found: me-test-1
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> kafka.controller.PartitionStateMachine$$anonfun$14.apply(PartitionStateMachine.scala:202)
> at 
> kafka.controller.PartitionStateMachine$$anonfun$14.apply(PartitionStateMachine.scala:202)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> kafka.controller.PartitionStateMachine.initializeLeaderAndIsrForPartitions(PartitionStateMachine.scala:202)
> at 
> kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:167)
> at 
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:116)
> at 
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:106)
> at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onReplicasBecomeOffline(KafkaController.scala:437)
> at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onBrokerFailure(KafkaController.scala:405)
> at 
> kafka.controller.KafkaController$BrokerChange$.process(KafkaController.scala:1246)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> The controller was unable to move leadership of partitions led by broker 9 as 
> a result. It's worth noting that the controller ran into the same exception 
> when the broker came back up online. The controller thinks `me-test-1` is a 
> new partition and when attempting to transition it to an online partition, it 
> is unable to retrieve its replica assignment from 
> ControllerContext#partitionReplicaAssignment. I need to look through the code 
> to figure out if there's a race condition or situations where we remove the 
> partition from ControllerContext#partitionReplicaAssignment but might still 
> leave it in PartitionStateMachine#partitionState.
> They had to change the controller to recover from the offline status.
> Sequential event:
> * Broker 9 got restated in between : 2019-03-26 01:22:54,236 - 2019-03-26 
> 01:27:30,967: This was unclean shutdown.
> * From 2019-03-26 01:27:30,967, broker 9 was rebuilding indexes. Broker 9 
> wasn't able to process data at this moment.
> * At 2019-03-26 01:29:36,741, broker 9 was starting to load replica.
> * [2019-03-26 01:29:36,202] ERROR [KafkaApi-9] Number of alive brokers '0' 
> does not meet the required replication factor '3' for the offsets topic 
> (configured via 'offsets.topic.replication.factor'). This error can be 
> ignored if the cluster is starting up and not all brokers are up yet. 
> (kafka.server.KafkaApis)
> * At 2019-03-26 01:29:37,270, broker 9 started report offline partitions.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8185) Controller becomes stale and not able to failover the leadership for the partitions

2019-04-18 Thread Kang H Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kang H Lee updated KAFKA-8185:
--
Attachment: (was: broker12.zip)

> Controller becomes stale and not able to failover the leadership for the 
> partitions
> ---
>
> Key: KAFKA-8185
> URL: https://issues.apache.org/jira/browse/KAFKA-8185
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 1.1.1
>Reporter: Kang H Lee
>Priority: Critical
>
> Description:
> After broker 9 went offline, all partitions led by it went offline. The 
> controller attempted to move leadership but ran into an exception while doing 
> so:
> {code:java}
> // [2019-03-26 01:23:34,114] ERROR [PartitionStateMachine controllerId=12] 
> Error while moving some partitions to OnlinePartition state 
> (kafka.controller.PartitionStateMachine)
> java.util.NoSuchElementException: key not found: me-test-1
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> kafka.controller.PartitionStateMachine$$anonfun$14.apply(PartitionStateMachine.scala:202)
> at 
> kafka.controller.PartitionStateMachine$$anonfun$14.apply(PartitionStateMachine.scala:202)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> kafka.controller.PartitionStateMachine.initializeLeaderAndIsrForPartitions(PartitionStateMachine.scala:202)
> at 
> kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:167)
> at 
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:116)
> at 
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:106)
> at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onReplicasBecomeOffline(KafkaController.scala:437)
> at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onBrokerFailure(KafkaController.scala:405)
> at 
> kafka.controller.KafkaController$BrokerChange$.process(KafkaController.scala:1246)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> The controller was unable to move leadership of partitions led by broker 9 as 
> a result. It's worth noting that the controller ran into the same exception 
> when the broker came back up online. The controller thinks `me-test-1` is a 
> new partition and when attempting to transition it to an online partition, it 
> is unable to retrieve its replica assignment from 
> ControllerContext#partitionReplicaAssignment. I need to look through the code 
> to figure out if there's a race condition or situations where we remove the 
> partition from ControllerContext#partitionReplicaAssignment but might still 
> leave it in PartitionStateMachine#partitionState.
> They had to change the controller to recover from the offline status.
> Sequential event:
> * Broker 9 got restated in between : 2019-03-26 01:22:54,236 - 2019-03-26 
> 01:27:30,967: This was unclean shutdown.
> * From 2019-03-26 01:27:30,967, broker 9 was rebuilding indexes. Broker 9 
> wasn't able to process data at this moment.
> * At 2019-03-26 01:29:36,741, broker 9 was starting to load replica.
> * [2019-03-26 01:29:36,202] ERROR [KafkaApi-9] Number of alive brokers '0' 
> does not meet the required replication factor '3' for the offsets topic 
> (configured via 'offsets.topic.replication.factor'). This error can be 
> ignored if the cluster is starting up and not all brokers are up yet. 
> (kafka.server.KafkaApis)
> * At 2019-03-26 01:29:37,270, broker 9 started report offline partitions.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8185) Controller becomes stale and not able to failover the leadership for the partitions

2019-04-18 Thread Kang H Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kang H Lee updated KAFKA-8185:
--
Attachment: (was: zookeeper.zip)

> Controller becomes stale and not able to failover the leadership for the 
> partitions
> ---
>
> Key: KAFKA-8185
> URL: https://issues.apache.org/jira/browse/KAFKA-8185
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 1.1.1
>Reporter: Kang H Lee
>Priority: Critical
>
> Description:
> After broker 9 went offline, all partitions led by it went offline. The 
> controller attempted to move leadership but ran into an exception while doing 
> so:
> {code:java}
> // [2019-03-26 01:23:34,114] ERROR [PartitionStateMachine controllerId=12] 
> Error while moving some partitions to OnlinePartition state 
> (kafka.controller.PartitionStateMachine)
> java.util.NoSuchElementException: key not found: me-test-1
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> kafka.controller.PartitionStateMachine$$anonfun$14.apply(PartitionStateMachine.scala:202)
> at 
> kafka.controller.PartitionStateMachine$$anonfun$14.apply(PartitionStateMachine.scala:202)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> kafka.controller.PartitionStateMachine.initializeLeaderAndIsrForPartitions(PartitionStateMachine.scala:202)
> at 
> kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:167)
> at 
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:116)
> at 
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:106)
> at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onReplicasBecomeOffline(KafkaController.scala:437)
> at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onBrokerFailure(KafkaController.scala:405)
> at 
> kafka.controller.KafkaController$BrokerChange$.process(KafkaController.scala:1246)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
> at 
> kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> The controller was unable to move leadership of partitions led by broker 9 as 
> a result. It's worth noting that the controller ran into the same exception 
> when the broker came back up online. The controller thinks `me-test-1` is a 
> new partition and when attempting to transition it to an online partition, it 
> is unable to retrieve its replica assignment from 
> ControllerContext#partitionReplicaAssignment. I need to look through the code 
> to figure out if there's a race condition or situations where we remove the 
> partition from ControllerContext#partitionReplicaAssignment but might still 
> leave it in PartitionStateMachine#partitionState.
> They had to change the controller to recover from the offline status.
> Sequential event:
> * Broker 9 got restated in between : 2019-03-26 01:22:54,236 - 2019-03-26 
> 01:27:30,967: This was unclean shutdown.
> * From 2019-03-26 01:27:30,967, broker 9 was rebuilding indexes. Broker 9 
> wasn't able to process data at this moment.
> * At 2019-03-26 01:29:36,741, broker 9 was starting to load replica.
> * [2019-03-26 01:29:36,202] ERROR [KafkaApi-9] Number of alive brokers '0' 
> does not meet the required replication factor '3' for the offsets topic 
> (configured via 'offsets.topic.replication.factor'). This error can be 
> ignored if the cluster is starting up and not all brokers are up yet. 
> (kafka.server.KafkaApis)
> * At 2019-03-26 01:29:37,270, broker 9 started report offline partitions.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8255) Replica fetcher thread exits with OffsetOutOfRangeException

2019-04-18 Thread Colin P. McCabe (JIRA)
Colin P. McCabe created KAFKA-8255:
--

 Summary: Replica fetcher thread exits with 
OffsetOutOfRangeException
 Key: KAFKA-8255
 URL: https://issues.apache.org/jira/browse/KAFKA-8255
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Colin P. McCabe


Replica fetcher threads can exits with OffsetOutOfRangeException when the log 
start offset has advanced beyond the high water mark on the fetching broker.

Example stack trace:
{code}
org.apache.kafka.common.KafkaException: Error processing data for partition 
__consumer_offsets-46 offset 18761
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2$$anonfun$apply$mcV$sp$3$$anonfun$apply$10.apply(AbstractFetcherThread.scala:335)
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2$$anonfun$apply$mcV$sp$3$$anonfun$apply$10.apply(AbstractFetcherThread.scala:294)
at scala.Option.foreach(Option.scala:257)
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2$$anonfun$apply$mcV$sp$3.apply(AbstractFetcherThread.scala:294)
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2$$anonfun$apply$mcV$sp$3.apply(AbstractFetcherThread.scala:293)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:293)
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2.apply(AbstractFetcherThread.scala:293)
at 
kafka.server.AbstractFetcherThread$$anonfun$kafka$server$AbstractFetcherThread$$processFetchRequest$2.apply(AbstractFetcherThread.scala:293)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at 
kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:292)
at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:132)
at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:131)
at scala.Option.foreach(Option.scala:257)
at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131)
at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: org.apache.kafka.common.errors.OffsetOutOfRangeException: Cannot 
increment the log start offset to 4808819 of partition __consumer_offsets-46 
since it is larger than the high watermark 18761
[2019-04-16 14:16:42,257] INFO [ReplicaFetcher replicaId=1001, leaderId=1003, 
fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
{code}

It seems that we should not terminate the replica fetcher thread in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2019-04-18 Thread Bill Bejeck (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821276#comment-16821276
 ] 

Bill Bejeck commented on KAFKA-7965:


Failed again, 

[https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/21017/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/]

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Assignee: huxihx
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2019-04-18 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821274#comment-16821274
 ] 

Michael K. Edwards commented on KAFKA-7026:
---

How confident are we that analogous flaws do not exist in 
StreamsPartitionAssignor?

> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
> Fix For: 2.3.0
>
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7652) Kafka Streams Session store performance degradation from 0.10.2.2 to 0.11.0.0

2019-04-18 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-7652.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

It should be fixed by the latest PR (details about the performance benchmarks 
can be found inside the PR).

> Kafka Streams Session store performance degradation from 0.10.2.2 to 0.11.0.0
> -
>
> Key: KAFKA-7652
> URL: https://issues.apache.org/jira/browse/KAFKA-7652
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0, 0.11.0.1, 0.11.0.2, 0.11.0.3, 1.1.1, 2.0.0, 
> 2.0.1
>Reporter: Jonathan Gordon
>Assignee: Guozhang Wang
>Priority: Major
>  Labels: kip
> Fix For: 2.3.0
>
> Attachments: 0.10.2.1-NamedCache.txt, 2.2.0-rc0_b-NamedCache.txt, 
> 2.3.0-7652-NamedCache.txt, kafka_10_2_1_flushes.txt, kafka_11_0_3_flushes.txt
>
>
> I'm creating this issue in response to [~guozhang]'s request on the mailing 
> list:
> [https://lists.apache.org/thread.html/97d620f4fd76be070ca4e2c70e2fda53cafe051e8fc4505dbcca0321@%3Cusers.kafka.apache.org%3E]
> We are attempting to upgrade our Kafka Streams application from 0.10.2.1 but 
> experience a severe performance degradation. The highest amount of CPU time 
> seems spent in retrieving from the local cache. Here's an example thread 
> profile with 0.11.0.0:
> [https://i.imgur.com/l5VEsC2.png]
> When things are running smoothly we're gated by retrieving from the state 
> store with acceptable performance. Here's an example thread profile with 
> 0.10.2.1:
> [https://i.imgur.com/IHxC2cZ.png]
> Some investigation reveals that it appears we're performing about 3 orders 
> magnitude more lookups on the NamedCache over a comparable time period. I've 
> attached logs of the NamedCache flush logs for 0.10.2.1 and 0.11.0.3.
> We're using session windows and have the app configured for 
> commit.interval.ms = 30 * 1000 and cache.max.bytes.buffering = 10485760
> I'm happy to share more details if they would be helpful. Also happy to run 
> tests on our data.
> I also found this issue, which seems like it may be related:
> https://issues.apache.org/jira/browse/KAFKA-4904
>  
> KIP-420: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-420%3A+Add+Single+Value+Fetch+in+Session+Stores]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6598) Kafka to support using ETCD beside Zookeeper

2019-04-18 Thread K. C. Tessarek (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821255#comment-16821255
 ] 

K. C. Tessarek commented on KAFKA-6598:
---

I'd also like to know the status of this. I really would like to use etcd 
instead of Zookeeper.

> Kafka to support using ETCD beside Zookeeper
> 
>
> Key: KAFKA-6598
> URL: https://issues.apache.org/jira/browse/KAFKA-6598
> Project: Kafka
>  Issue Type: New Feature
>  Components: clients, core
>Reporter: Sebastian Toader
>Priority: Major
>
> The current Kafka implementation is bound to {{Zookeeper}} to store its 
> metadata for forming a cluster of nodes (producer/consumer/broker). 
> As Kafka is becoming popular for streaming in various environments where 
> {{Zookeeper}} is either not easy to deploy/manage or there are better 
> alternatives to it there is a need 
> to run Kafka with other metastore implementation than {{Zookeeper}}.
> {{etcd}} can provide the same semantics as {{Zookeeper}} for Kafka and since 
> {{etcd}} is the favorable choice in certain environments (e.g. Kubernetes) 
> Kafka should be able to run with {{etcd}}.
> From the user's point of view should be straightforward to configure to use 
> {{etcd}} by just simply specifying a connection string that point to {{etcd}} 
> cluster.
> To avoid introducing instability the original interfaces should be kept and 
> only the low level {{Zookeeper}} API calls should be replaced with \{{etcd}} 
> API calls in case Kafka is configured 
> to use {{etcd}}.
> On the long run (which is out of scope of this jira) there should be an 
> abstract layer in Kafka which then various metastore implementations would 
> implement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2019-04-18 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821254#comment-16821254
 ] 

Michael K. Edwards commented on KAFKA-7026:
---

Thank you [~vahid] and reviewers!  We will be eagerly anticipating a release 
that contains this fix.  Is it a candidate for backport to the 2.2.x 
maintenance branch?

> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
> Fix For: 2.3.0
>
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7866) Duplicate offsets after transaction index append failure

2019-04-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821238#comment-16821238
 ] 

ASF GitHub Bot commented on KAFKA-7866:
---

hachikuji commented on pull request #6570: KAFKA-7866; Ensure no duplicate 
offsets after txn index append failure
URL: https://github.com/apache/kafka/pull/6570
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Duplicate offsets after transaction index append failure
> 
>
> Key: KAFKA-7866
> URL: https://issues.apache.org/jira/browse/KAFKA-7866
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> We have encountered a situation in which an ABORT marker was written 
> successfully to the log, but failed to be written to the transaction index. 
> This prevented the log end offset from being incremented. This resulted in 
> duplicate offsets when the next append was attempted. The broker was using 
> JBOD and we would normally expect IOExceptions to cause the log directory to 
> be failed. That did not seem to happen here and the duplicates continued for 
> several hours.
> Unfortunately, we are not sure what the cause of the failure was. 
> Significantly, the first duplicate was also the first ABORT marker in the 
> log. Unlike the offset and timestamp index, the transaction index is created 
> on demand after the first aborted transction. It is likely that the attempt 
> to create and open the transaction index failed. There is some suggestion 
> that the process may have bumped into the open file limit. Whatever the 
> problem was, it also prevented log collection, so we cannot confirm our 
> guesses. 
> Without knowing the underlying cause, we can still consider some potential 
> improvements:
> 1. We probably should be catching non-IO exceptions in the append process. If 
> the append to one of the indexes fails, we potentially truncate the log or 
> re-throw it as an IOException to ensure that the log directory is no longer 
> used.
> 2. Even without the unexpected exception, there is a small window during 
> which even an IOException could lead to duplicate offsets. Marking a log 
> directory offline is an asynchronous operation and there is no guarantee that 
> another append cannot happen first. Given this, we probably need to detect 
> and truncate duplicates during the log recovery process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8254) Suppress incorrectly passes a null topic to the serdes

2019-04-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821230#comment-16821230
 ] 

ASF GitHub Bot commented on KAFKA-8254:
---

vvcephei commented on pull request #6602: KAFKA-8254: Pass Changelog as Topic 
in Suppress Serdes
URL: https://github.com/apache/kafka/pull/6602
 
 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Suppress incorrectly passes a null topic to the serdes
> --
>
> Key: KAFKA-8254
> URL: https://issues.apache.org/jira/browse/KAFKA-8254
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.1.0, 2.2.0, 2.1.1
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
> Fix For: 2.3.0, 2.1.2, 2.2.1
>
>
> For example, in KTableSuppressProcessor, we do:
> {noformat}
> final Bytes serializedKey = Bytes.wrap(keySerde.serializer().serialize(null, 
> key));
> {noformat}
> This violates the contract of Serializer (and Deserializer), and breaks 
> integration with known Serde implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2019-04-18 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-7026.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
> Fix For: 2.3.0
>
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2019-04-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821228#comment-16821228
 ] 

ASF GitHub Bot commented on KAFKA-7026:
---

hachikuji commented on pull request #5291: KAFKA-7026: Sticky Assignor 
Partition Assignment Improvement (KIP-341)
URL: https://github.com/apache/kafka/pull/5291
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-8254) Suppress incorrectly passes a null topic to the serdes

2019-04-18 Thread John Roesler (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Roesler reassigned KAFKA-8254:
---

Assignee: John Roesler

> Suppress incorrectly passes a null topic to the serdes
> --
>
> Key: KAFKA-8254
> URL: https://issues.apache.org/jira/browse/KAFKA-8254
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.1.0, 2.2.0, 2.1.1
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
> Fix For: 2.3.0, 2.1.2, 2.2.1
>
>
> For example, in KTableSuppressProcessor, we do:
> {noformat}
> final Bytes serializedKey = Bytes.wrap(keySerde.serializer().serialize(null, 
> key));
> {noformat}
> This violates the contract of Serializer (and Deserializer), and breaks 
> integration with known Serde implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8153) Streaming application with state stores takes up to 1 hour to restart

2019-04-18 Thread Michael Melsen (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820821#comment-16820821
 ] 

Michael Melsen commented on KAFKA-8153:
---

No unfortunately disabling the clean up didn't solve the problem. However I do 
have a theory what happens:

 

We run different instances of the same streaming application. Each instance is 
running in a docker container on top of kubernetes. Each container has its own 
persistence. When restarting the instances, I think that some or sometimes all 
instances got linked to other persistent volumes instead of the once they were 
previously using. Perhaps tasks are reshuffled among different instances? This 
caused the state stores to become obsolete and restoration process to kick in. 
We solved this by utilizing NFS to share the persistent volumes in a way that 
all instances would point to the same state store directory structure. This 
seems to solve the issue

> Streaming application with state stores takes up to 1 hour to restart
> -
>
> Key: KAFKA-8153
> URL: https://issues.apache.org/jira/browse/KAFKA-8153
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.1.0
>Reporter: Michael Melsen
>Priority: Major
>
> We are using spring cloud stream with Kafka streams 2.0.1 and utilizing the 
> InteractiveQueryService to fetch data from the stores. There are 4 stores 
> that persist data on disk after aggregating data. The code for the topology 
> looks like this:
> {code:java}
> @Slf4j
> @EnableBinding(SensorMeasurementBinding.class)
> public class Consumer {
>   public static final String RETENTION_MS = "retention.ms";
>   public static final String CLEANUP_POLICY = "cleanup.policy";
>   @Value("${windowstore.retention.ms}")
>   private String retention;
> /**
>  * Process the data flowing in from a Kafka topic. Aggregate the data to:
>  * - 2 minute
>  * - 15 minutes
>  * - one hour
>  * - 12 hours
>  *
>  * @param stream
>  */
> @StreamListener(SensorMeasurementBinding.ERROR_SCORE_IN)
> public void process(KStream stream) {
> Map topicConfig = new HashMap<>();
> topicConfig.put(RETENTION_MS, retention);
> topicConfig.put(CLEANUP_POLICY, "delete");
> log.info("Changelog and local window store retention.ms: {} and 
> cleanup.policy: {}",
> topicConfig.get(RETENTION_MS),
> topicConfig.get(CLEANUP_POLICY));
> createWindowStore(LocalStore.TWO_MINUTES_STORE, topicConfig, stream);
> createWindowStore(LocalStore.FIFTEEN_MINUTES_STORE, topicConfig, stream);
> createWindowStore(LocalStore.ONE_HOUR_STORE, topicConfig, stream);
> createWindowStore(LocalStore.TWELVE_HOURS_STORE, topicConfig, stream);
>   }
>   private void createWindowStore(
> LocalStore localStore,
> Map topicConfig,
> KStream stream) {
> // Configure how the statestore should be materialized using the provide 
> storeName
> Materialized> materialized 
> = Materialized
> .as(localStore.getStoreName());
> // Set retention of changelog topic
> materialized.withLoggingEnabled(topicConfig);
> // Configure how windows looks like and how long data will be retained in 
> local stores
> TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
> localStore.getTimeUnit(), 
> Long.parseLong(topicConfig.get(RETENTION_MS)));
> // Processing description:
> // The input data are 'samples' with key 
> :::
> // 1. With the map we add the Tag to the key and we extract the error 
> score from the data
> // 2. With the groupByKey we group  the data on the new key
> // 3. With windowedBy we split up the data in time intervals depending on 
> the provided LocalStore enum
> // 4. With reduce we determine the maximum value in the time window
> // 5. Materialized will make it stored in a table
> stream
> .map(getInstallationAssetModelAlgorithmTagKeyMapper())
> .groupByKey()
> .windowedBy(configuredTimeWindows)
> .reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, 
> newValue), materialized);
>   }
>   private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long 
> retentionMs) {
> TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
> timeWindows.until(retentionMs);
> return timeWindows;
>   }
>   /**
>* Determine the max error score to keep by looking at the aggregated error 
> signal and
>* freshly consumed error signal
>*
>* @param aggValue
>* @param newValue
>* @return
>*/
>   private ErrorScore getMaxErrorScore(ErrorScore aggValue, ErrorScore 
> newValue) {
> if(aggValue.getErrorSignal() > newValue.getErrorSignal()) {
> return aggValue;
> }
> return newValue;
>   }
>   

[jira] [Created] (KAFKA-8253) Unrecoverable state when a broker's INPUT access is blocked (Zombie broker)

2019-04-18 Thread Gowtham Gutha (JIRA)
Gowtham Gutha created KAFKA-8253:


 Summary: Unrecoverable state when a broker's INPUT access is 
blocked (Zombie broker)
 Key: KAFKA-8253
 URL: https://issues.apache.org/jira/browse/KAFKA-8253
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.1.1
Reporter: Gowtham Gutha


I carried out a test as mentioned in this particular SO question.

[https://stackoverflow.com/questions/55706589/what-happens-if-the-leader-is-not-dead-but-unable-to-receive-messages-in-kafka]

*Gist of the test:*

A broker's INPUT access is blocked. So it is not able to receive any messages.

But still it can send heartbeats to ZK, so that a leader election will not 
happen.

So any message produced to the partition lead by this _zombie_ broker is never 
produced leaving the system in an unrecoverable state.

 

*Possible resolution:*

There should be a 2 way communication such that, if a broker is not able to 
have any INPUT access, the ZK MUST know of it by sending some ping messages to 
the brokers.

If there is no response from the broker, elect a new one. Since, if the broker 
is not ping-able by ZK, that broker is as good as dead for its purpose.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8253) Unrecoverable state when a broker's INPUT access is blocked (Zombie broker)

2019-04-18 Thread Gowtham Gutha (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gowtham Gutha updated KAFKA-8253:
-
Description: 
I carried out a test as mentioned in my SO question

[https://stackoverflow.com/questions/55706589/what-happens-if-the-leader-is-not-dead-but-unable-to-receive-messages-in-kafka]

*Gist of the test:*

A broker's INPUT access is blocked. So it is not able to receive any messages.

But still it can send heartbeats to ZK, so that a leader election will not 
happen.

So any message produced to the partition lead by this _zombie_ broker is never 
produced leaving the system in an unrecoverable state.

 

*Possible resolution:*

There should be a 2 way communication such that, if a broker is not able to 
have any INPUT access, the ZK MUST know of it by sending some ping messages to 
the brokers.

If there is no response from the broker, elect a new one. Since, if the broker 
is not ping-able by ZK, that broker is as good as dead for its purpose.

 

 

  was:
I carried out a test as mentioned in this particular SO question.

[https://stackoverflow.com/questions/55706589/what-happens-if-the-leader-is-not-dead-but-unable-to-receive-messages-in-kafka]

*Gist of the test:*

A broker's INPUT access is blocked. So it is not able to receive any messages.

But still it can send heartbeats to ZK, so that a leader election will not 
happen.

So any message produced to the partition lead by this _zombie_ broker is never 
produced leaving the system in an unrecoverable state.

 

*Possible resolution:*

There should be a 2 way communication such that, if a broker is not able to 
have any INPUT access, the ZK MUST know of it by sending some ping messages to 
the brokers.

If there is no response from the broker, elect a new one. Since, if the broker 
is not ping-able by ZK, that broker is as good as dead for its purpose.

 

 


> Unrecoverable state when a broker's INPUT access is blocked (Zombie broker)
> ---
>
> Key: KAFKA-8253
> URL: https://issues.apache.org/jira/browse/KAFKA-8253
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: Gowtham Gutha
>Priority: Critical
>
> I carried out a test as mentioned in my SO question
> [https://stackoverflow.com/questions/55706589/what-happens-if-the-leader-is-not-dead-but-unable-to-receive-messages-in-kafka]
> *Gist of the test:*
> A broker's INPUT access is blocked. So it is not able to receive any messages.
> But still it can send heartbeats to ZK, so that a leader election will not 
> happen.
> So any message produced to the partition lead by this _zombie_ broker is 
> never produced leaving the system in an unrecoverable state.
>  
> *Possible resolution:*
> There should be a 2 way communication such that, if a broker is not able to 
> have any INPUT access, the ZK MUST know of it by sending some ping messages 
> to the brokers.
> If there is no response from the broker, elect a new one. Since, if the 
> broker is not ping-able by ZK, that broker is as good as dead for its purpose.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)