[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299415#comment-16299415 ] Matthias J. Sax commented on KAFKA-6323: [~guozhang] I agree that we should pass in current wall-clock time {{NOW}} when calling a punctuation. But from my understanding, that is what we are doing atm anyway. And this holds for wall-clock as well as stream-time. Or do I miss something. > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299408#comment-16299408 ] ASF GitHub Bot commented on KAFKA-6366: --- GitHub user hachikuji opened a pull request: https://github.com/apache/kafka/pull/4349 KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits during coordinator disconnect When the coordinator is marked unknown, we explicitly disconnect its connection and cancel pending requests. Currently the disconnect happens before the coordinator state is set to null, which means that callbacks which inspect the coordinator state will see it still as active. This can lead to further requests being sent. In pathological cases, the disconnect itself is not able to return because new requests are sent to the coordinator before the disconnect can complete, which leads to the stack overflow error. To fix the problem, I have reordered the disconnect to happen after the coordinator is set to null. I have added a basic test case to verify that callbacks for in-flight or unsent requests see the coordinator as unknown which prevents them from attempting to resend. We may need additional test cases after we determine whether this is in fact was it happening in the reported ticket. Note that I have also included some minor cleanups which I noticed along the way. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) You can merge this pull request into a Git repository by running: $ git pull https://github.com/hachikuji/kafka KAFKA-6366 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4349.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4349 commit 488de3dca5be6111fd447980c8e79477259dc99a Author: Jason Gustafson Date: 2017-12-18T18:53:38Z KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits during coordinator disconnect > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke >Assignee: Jason Gustafson > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.ap
[jira] [Commented] (KAFKA-6126) Reduce rebalance time by not checking if created topics are available
[ https://issues.apache.org/jira/browse/KAFKA-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299406#comment-16299406 ] ASF GitHub Bot commented on KAFKA-6126: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4322 > Reduce rebalance time by not checking if created topics are available > - > > Key: KAFKA-6126 > URL: https://issues.apache.org/jira/browse/KAFKA-6126 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax > Fix For: 1.1.0 > > > Within {{StreamPartitionAssignor#assign}} we create new topics and afterwards > wait in an "infinite loop" until topic metadata propagated throughout the > cluster. We do this, to make sure topics are available when we start > processing. > However, with this approach we "extend" the time in the rebalance phase and > thus are not responsive (no calls to `poll` for liveness check and > {{KafkaStreams#close}} suffers). Thus, we might want to remove this check and > handle potential "topic not found" exceptions in the main thread gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-5863) Potential null dereference in DistributedHerder#reconfigureConnector()
[ https://issues.apache.org/jira/browse/KAFKA-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated KAFKA-5863: -- Description: Here is the call chain: {code} RestServer.httpRequest(reconfigUrl, "POST", taskProps, null); {code} In httpRequest(): {code} } else if (responseCode >= 200 && responseCode < 300) { InputStream is = connection.getInputStream(); T result = JSON_SERDE.readValue(is, responseFormat); {code} For readValue(): {code} public T readValue(InputStream src, TypeReference valueTypeRef) throws IOException, JsonParseException, JsonMappingException { return (T) _readMapAndClose(_jsonFactory.createParser(src), _typeFactory.constructType(valueTypeRef)); {code} Then there would be NPE in constructType(): {code} public JavaType constructType(TypeReference typeRef) { // 19-Oct-2015, tatu: Simpler variant like so should work return _fromAny(null, typeRef.getType(), EMPTY_BINDINGS); {code} was: Here is the call chain: {code} RestServer.httpRequest(reconfigUrl, "POST", taskProps, null); {code} In httpRequest(): {code} } else if (responseCode >= 200 && responseCode < 300) { InputStream is = connection.getInputStream(); T result = JSON_SERDE.readValue(is, responseFormat); {code} For readValue(): {code} public T readValue(InputStream src, TypeReference valueTypeRef) throws IOException, JsonParseException, JsonMappingException { return (T) _readMapAndClose(_jsonFactory.createParser(src), _typeFactory.constructType(valueTypeRef)); {code} Then there would be NPE in constructType(): {code} public JavaType constructType(TypeReference typeRef) { // 19-Oct-2015, tatu: Simpler variant like so should work return _fromAny(null, typeRef.getType(), EMPTY_BINDINGS); {code} > Potential null dereference in DistributedHerder#reconfigureConnector() > -- > > Key: KAFKA-5863 > URL: https://issues.apache.org/jira/browse/KAFKA-5863 > Project: Kafka > Issue Type: Bug >Reporter: Ted Yu >Priority: Minor > > Here is the call chain: > {code} > RestServer.httpRequest(reconfigUrl, "POST", > taskProps, null); > {code} > In httpRequest(): > {code} > } else if (responseCode >= 200 && responseCode < 300) { > InputStream is = connection.getInputStream(); > T result = JSON_SERDE.readValue(is, responseFormat); > {code} > For readValue(): > {code} > public T readValue(InputStream src, TypeReference valueTypeRef) > throws IOException, JsonParseException, JsonMappingException > { > return (T) _readMapAndClose(_jsonFactory.createParser(src), > _typeFactory.constructType(valueTypeRef)); > {code} > Then there would be NPE in constructType(): > {code} > public JavaType constructType(TypeReference typeRef) > { > // 19-Oct-2015, tatu: Simpler variant like so should work > return _fromAny(null, typeRef.getType(), EMPTY_BINDINGS); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.
[ https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299357#comment-16299357 ] huxihx commented on KAFKA-6375: --- Disable the windows firewall and retry to see if this exception disappears. > Follower replicas can never catch up to be ISR due to creating > ReplicaFetcherThread failed. > --- > > Key: KAFKA-6375 > URL: https://issues.apache.org/jira/browse/KAFKA-6375 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.0 > Environment: Windows, 23 brokers KafkaCluster >Reporter: Rong Tang > > Hi, I met with a case that in one broker, the out of sync replicas never > catch up. > When the broker starts up, it receives LeaderAndISR requests from controller, > which will call createFetcherThread, the thread creation failed, with > exceptions below. > And then, there is no fetcher for these follower replicas, and it is out of > sync forever. Unless, later, it receives LeaderAndISR requests that has > higher leader EPOCH. The broker had 260 out of 330 replicas out of sync for > one day, until I restarted it. > Restart the broker can mitigate the issue. > I have 2 questions. > First, Why NEW ReplicaFetcherThread failed? > *Second, should Kafka do something to fail over, instead of letting the > broker in abnormal state.* > It is a 23 brokers Kafka cluster running on Windows. each broker has 330 > replicas. > [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing > LeaderAndIsr request with correlationId 1 received from controller 427703487 > epoch 22 (state.change.logger) > org.apache.kafka.common.KafkaException: java.io.IOException: Unable to > establish loopback connection > at org.apache.kafka.common.network.Selector.(Selector.java:124) > at > kafka.server.ReplicaFetcherThread.(ReplicaFetcherThread.scala:87) > at > kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35) > at > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83) > at > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78) > at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869) > at > kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689) > at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149) > at kafka.server.KafkaApis.handle(KafkaApis.scala:83) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Unable to establish loopback connection > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94) > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at sun.nio.ch.PipeImpl.(PipeImpl.java:171) > at > sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50) > at java.nio.channels.Pipe.open(Pipe.java:155) > at sun.nio.ch.WindowsSelectorImpl.(WindowsSelectorImpl.java:127) > at > sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44) > at java.nio.channels.Selector.open(Selector.java:227) > at org.apache.kafka.common.network.Selector.(Selector.java:122) > ... 16 more > Caused by: java.net.ConnectException: Connection timed out: connect > at sun.nio.ch.Net.connect0(Native Method) > at sun.nio.ch.Net.connect(Net.java:454) > at sun.nio.ch.Net.connect(Net.java:446) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) > at java.nio.channels.SocketChannel.open(SocketChannel.java:189) > at > sun.nio.ch.PipeImpl$Initializer$LoopbackConnector.run(PipeImpl.java:127) > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:76) > ... 25 more -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299347#comment-16299347 ] Frederic Arno commented on KAFKA-6323: -- Thanks for your comments, I'll consider a KIP for the added API method when I'm back from a 2 weeks leave. > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.
[ https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299342#comment-16299342 ] Rong Tang edited comment on KAFKA-6375 at 12/21/17 12:53 AM: - Hi, [~huxi_2b] Any thought on why the "exception: Unable to establish loopback connection" happen, or any way to handle this exception? Another broker met this exception again, and its replicas stayed out of sync for 2 days until I restarted it. both brokers had been controller before I restarted, not sure if related. And I only see the exception when starting broker. Thanks. was (Author: trjianjianjiao): Hi, huxihx Any thought on why the "exception: Unable to establish loopback connection" happen, or any way to handle this exception? Another broker met this exception again, and its replicas stayed out of sync for 2 days until I restarted it. both brokers had been controller before I restarted, not sure if related. And I only see the exception when starting broker. Thanks. > Follower replicas can never catch up to be ISR due to creating > ReplicaFetcherThread failed. > --- > > Key: KAFKA-6375 > URL: https://issues.apache.org/jira/browse/KAFKA-6375 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.0 > Environment: Windows, 23 brokers KafkaCluster >Reporter: Rong Tang > > Hi, I met with a case that in one broker, the out of sync replicas never > catch up. > When the broker starts up, it receives LeaderAndISR requests from controller, > which will call createFetcherThread, the thread creation failed, with > exceptions below. > And then, there is no fetcher for these follower replicas, and it is out of > sync forever. Unless, later, it receives LeaderAndISR requests that has > higher leader EPOCH. The broker had 260 out of 330 replicas out of sync for > one day, until I restarted it. > Restart the broker can mitigate the issue. > I have 2 questions. > First, Why NEW ReplicaFetcherThread failed? > *Second, should Kafka do something to fail over, instead of letting the > broker in abnormal state.* > It is a 23 brokers Kafka cluster running on Windows. each broker has 330 > replicas. > [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing > LeaderAndIsr request with correlationId 1 received from controller 427703487 > epoch 22 (state.change.logger) > org.apache.kafka.common.KafkaException: java.io.IOException: Unable to > establish loopback connection > at org.apache.kafka.common.network.Selector.(Selector.java:124) > at > kafka.server.ReplicaFetcherThread.(ReplicaFetcherThread.scala:87) > at > kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35) > at > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83) > at > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78) > at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869) > at > kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689) > at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149) > at kafka.server.KafkaApis.handle(KafkaApis.scala:83) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Unable to establish loopback connection > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94) > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at sun.nio.ch.PipeImpl.(PipeImpl.java:171) > at > sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50) > at java.nio.channels.Pipe.open(Pipe.java:155) > at sun.nio.ch.WindowsSelectorImpl.(WindowsSelectorImpl.java:127) > at > sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44) > at java.nio.channels.Selector.open(Selector.java:227) > at org.apache.kafka.common.network.Selector.(Select
[jira] [Commented] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.
[ https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299342#comment-16299342 ] Rong Tang commented on KAFKA-6375: -- Hi, huxihx Any thought on why the "exception: Unable to establish loopback connection" happen, or any way to handle this exception? Another broker met this exception again, and its replicas stayed out of sync for 2 days until I restarted it. both brokers had been controller before I restarted, not sure if related. And I only see the exception when starting broker. Thanks. > Follower replicas can never catch up to be ISR due to creating > ReplicaFetcherThread failed. > --- > > Key: KAFKA-6375 > URL: https://issues.apache.org/jira/browse/KAFKA-6375 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.0 > Environment: Windows, 23 brokers KafkaCluster >Reporter: Rong Tang > > Hi, I met with a case that in one broker, the out of sync replicas never > catch up. > When the broker starts up, it receives LeaderAndISR requests from controller, > which will call createFetcherThread, the thread creation failed, with > exceptions below. > And then, there is no fetcher for these follower replicas, and it is out of > sync forever. Unless, later, it receives LeaderAndISR requests that has > higher leader EPOCH. The broker had 260 out of 330 replicas out of sync for > one day, until I restarted it. > Restart the broker can mitigate the issue. > I have 2 questions. > First, Why NEW ReplicaFetcherThread failed? > *Second, should Kafka do something to fail over, instead of letting the > broker in abnormal state.* > It is a 23 brokers Kafka cluster running on Windows. each broker has 330 > replicas. > [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing > LeaderAndIsr request with correlationId 1 received from controller 427703487 > epoch 22 (state.change.logger) > org.apache.kafka.common.KafkaException: java.io.IOException: Unable to > establish loopback connection > at org.apache.kafka.common.network.Selector.(Selector.java:124) > at > kafka.server.ReplicaFetcherThread.(ReplicaFetcherThread.scala:87) > at > kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35) > at > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83) > at > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78) > at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869) > at > kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689) > at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149) > at kafka.server.KafkaApis.handle(KafkaApis.scala:83) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Unable to establish loopback connection > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94) > at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at sun.nio.ch.PipeImpl.(PipeImpl.java:171) > at > sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50) > at java.nio.channels.Pipe.open(Pipe.java:155) > at sun.nio.ch.WindowsSelectorImpl.(WindowsSelectorImpl.java:127) > at > sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44) > at java.nio.channels.Selector.open(Selector.java:227) > at org.apache.kafka.common.network.Selector.(Selector.java:122) > ... 16 more > Caused by: java.net.ConnectException: Connection timed out: connect > at sun.nio.ch.Net.connect0(Native Method) > at sun.nio.ch.Net.connect(Net.java:454) > at sun.nio.ch.Net.connect(Net.java:446) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) > at java.nio.channels.SocketChannel.open(SocketChannel.java:189) > at > sun.nio.ch.PipeImpl$Initializer$LoopbackConnector.run(PipeIm
[jira] [Assigned] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson reassigned KAFKA-6366: -- Assignee: Jason Gustafson > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke >Assignee: Jason Gustafson > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299311#comment-16299311 ] Jason Gustafson commented on KAFKA-6366: [~joerg.heinicke] Ok, we may need to take a look at the more verbose logs (DEBUG is probably good enough). One idea I had is the following. Suppose the consumer has a large number of records buffered. It might be possible to hit a race condition in which the foreground thread is hitting {{poll()}} and {{commitAsync()}} in a tight loop because {{max.poll.records=50}} is immediately satisfied with the buffered records. At some point, the heartbeat thread might see the coordinator disconnect and attempt to mark it dead, but the new offset commit requests are piling up as fast as the background thread can cancel them in {{disconnect()}}, which ultimately causes the stack overflow. I think ultimately the fix for this issue is going to be setting the coordinator to null in {{AbstractCoordinator.coordinatorDead()}} prior to disconnecting. I will go ahead and submit a patch to do this. It would be good to confirm from the logging if the scenario I mentioned above is happening or if it's something else. If we still can't figure out the cause, perhaps we can at least test with the patch. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureComplet
[jira] [Commented] (KAFKA-6335) SimpleAclAuthorizerTest#testHighConcurrencyModificationOfResourceAcls fails intermittently
[ https://issues.apache.org/jira/browse/KAFKA-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299303#comment-16299303 ] Ted Yu commented on KAFKA-6335: --- If needed, we can add logging into codebase so that it is easier to figure out the cause. > SimpleAclAuthorizerTest#testHighConcurrencyModificationOfResourceAcls fails > intermittently > -- > > Key: KAFKA-6335 > URL: https://issues.apache.org/jira/browse/KAFKA-6335 > Project: Kafka > Issue Type: Test >Reporter: Ted Yu >Assignee: Manikumar > Fix For: 1.1.0 > > > From > https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/3045/testReport/junit/kafka.security.auth/SimpleAclAuthorizerTest/testHighConcurrencyModificationOfResourceAcls/ > : > {code} > java.lang.AssertionError: expected acls Set(User:36 has Allow permission for > operations: Read from hosts: *, User:7 has Allow permission for operations: > Read from hosts: *, User:21 has Allow permission for operations: Read from > hosts: *, User:39 has Allow permission for operations: Read from hosts: *, > User:43 has Allow permission for operations: Read from hosts: *, User:3 has > Allow permission for operations: Read from hosts: *, User:35 has Allow > permission for operations: Read from hosts: *, User:15 has Allow permission > for operations: Read from hosts: *, User:16 has Allow permission for > operations: Read from hosts: *, User:22 has Allow permission for operations: > Read from hosts: *, User:26 has Allow permission for operations: Read from > hosts: *, User:11 has Allow permission for operations: Read from hosts: *, > User:38 has Allow permission for operations: Read from hosts: *, User:8 has > Allow permission for operations: Read from hosts: *, User:28 has Allow > permission for operations: Read from hosts: *, User:32 has Allow permission > for operations: Read from hosts: *, User:25 has Allow permission for > operations: Read from hosts: *, User:41 has Allow permission for operations: > Read from hosts: *, User:44 has Allow permission for operations: Read from > hosts: *, User:48 has Allow permission for operations: Read from hosts: *, > User:2 has Allow permission for operations: Read from hosts: *, User:9 has > Allow permission for operations: Read from hosts: *, User:14 has Allow > permission for operations: Read from hosts: *, User:46 has Allow permission > for operations: Read from hosts: *, User:13 has Allow permission for > operations: Read from hosts: *, User:5 has Allow permission for operations: > Read from hosts: *, User:29 has Allow permission for operations: Read from > hosts: *, User:45 has Allow permission for operations: Read from hosts: *, > User:6 has Allow permission for operations: Read from hosts: *, User:37 has > Allow permission for operations: Read from hosts: *, User:23 has Allow > permission for operations: Read from hosts: *, User:19 has Allow permission > for operations: Read from hosts: *, User:24 has Allow permission for > operations: Read from hosts: *, User:17 has Allow permission for operations: > Read from hosts: *, User:34 has Allow permission for operations: Read from > hosts: *, User:12 has Allow permission for operations: Read from hosts: *, > User:42 has Allow permission for operations: Read from hosts: *, User:4 has > Allow permission for operations: Read from hosts: *, User:47 has Allow > permission for operations: Read from hosts: *, User:18 has Allow permission > for operations: Read from hosts: *, User:31 has Allow permission for > operations: Read from hosts: *, User:49 has Allow permission for operations: > Read from hosts: *, User:33 has Allow permission for operations: Read from > hosts: *, User:1 has Allow permission for operations: Read from hosts: *, > User:27 has Allow permission for operations: Read from hosts: *) but got > Set(User:36 has Allow permission for operations: Read from hosts: *, User:7 > has Allow permission for operations: Read from hosts: *, User:21 has Allow > permission for operations: Read from hosts: *, User:39 has Allow permission > for operations: Read from hosts: *, User:43 has Allow permission for > operations: Read from hosts: *, User:3 has Allow permission for operations: > Read from hosts: *, User:35 has Allow permission for operations: Read from > hosts: *, User:15 has Allow permission for operations: Read from hosts: *, > User:16 has Allow permission for operations: Read from hosts: *, User:22 has > Allow permission for operations: Read from hosts: *, User:26 has Allow > permission for operations: Read from hosts: *, User:11 has Allow permission > for operations: Read from hosts: *, User:38 has Allow permission for > operations: Read from hosts: *, User:8 has Allow permission for operations: > Read from
[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298915#comment-16298915 ] Ted Yu edited comment on KAFKA-6366 at 12/21/17 12:19 AM: -- Should I log another JIRA for the above ? One aspect we need to pay attention is to avoid flooding the log file, since the stack trace is much longer compared to the single sentence. To prevent (repeated) stack traces flooding the log, we can keep Map from stack trace to count (number of times the stack trace occurred). was (Author: yuzhih...@gmail.com): Should I log another JIRA for the above ? One aspect we need to pay attention is to avoid flooding the log file, since the stack trace is much longer compared to the single sentence. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRe
[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278 ] Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:10 AM: -- Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. [^Screenshot-2017-12-19 21.35-22.10 processing.png] (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. was (Author: joerg.heinicke): Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. !Screenshot-2017-12-19 21.35-22.10 processing.png|thumbnail! (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.l
[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278 ] Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:10 AM: -- Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough ([^Screenshot-2017-12-19 21.35-22.10 processing.png], Timings here are UTC + 1 while in the log file it's UTC.) while often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. was (Author: joerg.heinicke): Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. [^Screenshot-2017-12-19 21.35-22.10 processing.png] (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers
[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278 ] Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:09 AM: -- Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. !Screenshot-2017-12-19 21.35-22.10 processing.png|thumbnail! (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. was (Author: joerg.heinicke): Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. [^Screenshot-2017-12-19 21.35-22.10 processing.png] (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.l
[jira] [Updated] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Heinicke updated KAFKA-6366: -- Attachment: (was: Screenshot-2017-12-21 processing.png) > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache
[jira] [Updated] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Heinicke updated KAFKA-6366: -- Attachment: Screenshot-2017-12-19 21.35-22.10 processing.png > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apach
[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278 ] Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:08 AM: -- Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. [^Screenshot-2017-12-19 21.35-22.10 processing.png] (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. was (Author: joerg.heinicke): Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. [^Screenshot-2017-12-19 processing.png] (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-19 21.35-22.10 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.Appender
[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278 ] Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:07 AM: -- Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretically steady processing (in this particular incident processing seemed steady enough even though often it starts fluctuating strongly) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. [^Screenshot-2017-12-19 processing.png] (Timings here are UTC + 1 while in the log file it's UTC.) Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. was (Author: joerg.heinicke): Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretical steady processing (unfortunately it's not, but usually in this erroneous case the processing throughput is highly fluctuating) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-21 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Cate
[jira] [Updated] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Heinicke updated KAFKA-6366: -- Attachment: Screenshot-2017-12-21 processing.png > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip, > Screenshot-2017-12-21 processing.png > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278 ] Joerg Heinicke commented on KAFKA-6366: --- Simply for the quantity structure: Our system has a throughput of about 100 k messages per minute. The topic has 30 partitions. The consumer group matches those and consists of 5 service instances with 6 KafkaConsumers each. Eventually with a theoretical steady processing (unfortunately it's not, but usually in this erroneous case the processing throughput is highly fluctuating) this means around 3k messages per minute per thread or 50 per messages per second. The batch size is also rather small with just 50 messages, so 1 batch and thereby one async commit per second. The number of async commit failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times as high than expected in case all commits fail within that time. Btw., we are aware of the underlying issue with the infrastructure: heavily over-committed VMs in terms of CPU and rather low storage throughput. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.i
[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299273#comment-16299273 ] Guozhang Wang commented on KAFKA-6323: -- My concern of doing alignment with wall-clock time is that we would intentionally trigger {{puncuate(T2)}} where the passed in parameter {{T}} is actually not the current system wall-clock time {{NOW}}, but would be smaller to it. For user punctuation logics that rely on the accuracy of the passed in system time that might be a real problem (on the other hand, for stream time I consider this a much less of an issue since it is data drive anyways and hence stream time is defined on record timestamps only but not on when it is being processed). That being said, I do not feel strong against aligning with wall-clock time, just throwing my two cents here. If people are in favor of doing this I'm OK as well. Just remind that we need to document the behavior clearly in javadoc. > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299258#comment-16299258 ] Joerg Heinicke commented on KAFKA-6366: --- This is our commit async code block: {code:java} this.kafkaConsumer.commitAsync((offsets, exception) -> { offsets.forEach((k, v) -> { log.debug(k + "\t" + v); }); if (exception != null) { log.error(KafkaConsumer.class.getSimpleName() + " failed committing offets asynchronously! ", exception); } else { log.debug("Committing Consumer Offset succeeded!"); } }); {code} I don't see any explicit or can't even imagine implicit retry logic. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClien
[jira] [Commented] (KAFKA-4263) QueryableStateIntegrationTest.concurrentAccess is failing occasionally in jenkins builds
[ https://issues.apache.org/jira/browse/KAFKA-4263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299251#comment-16299251 ] ASF GitHub Bot commented on KAFKA-4263: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4342 > QueryableStateIntegrationTest.concurrentAccess is failing occasionally in > jenkins builds > > > Key: KAFKA-4263 > URL: https://issues.apache.org/jira/browse/KAFKA-4263 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.1.0 >Reporter: Damian Guy >Assignee: Matthias J. Sax > Fix For: 1.1.0, 1.0.1 > > > We are seeing occasional failures of this test in jenkins, however it isn't > failing when running locally (confirmed by multiple people). Needs > investingating -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-4263) QueryableStateIntegrationTest.concurrentAccess is failing occasionally in jenkins builds
[ https://issues.apache.org/jira/browse/KAFKA-4263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang resolved KAFKA-4263. -- Resolution: Fixed Fix Version/s: (was: 0.10.1.1) 1.0.1 1.1.0 Issue resolved by pull request 4342 [https://github.com/apache/kafka/pull/4342] > QueryableStateIntegrationTest.concurrentAccess is failing occasionally in > jenkins builds > > > Key: KAFKA-4263 > URL: https://issues.apache.org/jira/browse/KAFKA-4263 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.1.0 >Reporter: Damian Guy >Assignee: Matthias J. Sax > Fix For: 1.1.0, 1.0.1 > > > We are seeing occasional failures of this test in jenkins, however it isn't > failing when running locally (confirmed by multiple people). Needs > investingating -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299140#comment-16299140 ] Matthias J. Sax commented on KAFKA-6323: [~frederica] We cannot simply change the interface of `schedule` method -- this is a public API change and requires a KIP (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals). If you think it's a valuable addition, feel free to do a KIP on it (including a new JIRA to cover the change). [~guozhang] I see your point that aligning start wall-clock time punctuations might not be as valuable as aligning stream-time ones. However, I agree with [~frederica] that if we move from `T2 (T2 >= T1)` to `T2 + T` punctuation shift into the future and I think this would be undesired behavior. For long GC pauses etc, we would just skip the corresponding punctuation similarly to the skipping behavior for stream-time in case stream-time make a larger advance that 2x punctuation interval. > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299140#comment-16299140 ] Matthias J. Sax edited comment on KAFKA-6323 at 12/20/17 9:40 PM: -- [~frederica] We cannot simply change the interface of {{schedule()}} method -- this is a public API change and requires a KIP (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals). If you think it's a valuable addition, feel free to do a KIP on it (including a new JIRA to cover the change). [~guozhang] I see your point that aligning start wall-clock time punctuations might not be as valuable as aligning stream-time ones. However, I agree with [~frederica] that if we move from {{T2 (T2 >= T1)}} to {{T2 + T}} punctuation shift into the future and I think this would be undesired behavior. For long GC pauses etc, we would just skip the corresponding punctuation similarly to the skipping behavior for stream-time in case stream-time make a larger advance that 2x punctuation interval. was (Author: mjsax): [~frederica] We cannot simply change the interface of `schedule` method -- this is a public API change and requires a KIP (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals). If you think it's a valuable addition, feel free to do a KIP on it (including a new JIRA to cover the change). [~guozhang] I see your point that aligning start wall-clock time punctuations might not be as valuable as aligning stream-time ones. However, I agree with [~frederica] that if we move from `T2 (T2 >= T1)` to `T2 + T` punctuation shift into the future and I think this would be undesired behavior. For long GC pauses etc, we would just skip the corresponding punctuation similarly to the skipping behavior for stream-time in case stream-time make a larger advance that 2x punctuation interval. > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5849) Add process stop faults, round trip workload, partitioned produce-consume test
[ https://issues.apache.org/jira/browse/KAFKA-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299131#comment-16299131 ] ASF GitHub Bot commented on KAFKA-5849: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4323 > Add process stop faults, round trip workload, partitioned produce-consume test > -- > > Key: KAFKA-5849 > URL: https://issues.apache.org/jira/browse/KAFKA-5849 > Project: Kafka > Issue Type: Sub-task >Reporter: Colin P. McCabe >Assignee: Colin P. McCabe > Fix For: 1.1.0 > > > Add partitioned produce consume test -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5849) Add process stop faults, round trip workload, partitioned produce-consume test
[ https://issues.apache.org/jira/browse/KAFKA-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-5849. --- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 4323 [https://github.com/apache/kafka/pull/4323] > Add process stop faults, round trip workload, partitioned produce-consume test > -- > > Key: KAFKA-5849 > URL: https://issues.apache.org/jira/browse/KAFKA-5849 > Project: Kafka > Issue Type: Sub-task >Reporter: Colin P. McCabe >Assignee: Colin P. McCabe > Fix For: 1.1.0 > > > Add partitioned produce consume test -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6394) Prevent misconfiguration of advertised listeners
Jason Gustafson created KAFKA-6394: -- Summary: Prevent misconfiguration of advertised listeners Key: KAFKA-6394 URL: https://issues.apache.org/jira/browse/KAFKA-6394 Project: Kafka Issue Type: Bug Reporter: Jason Gustafson We don't really have any protection from misconfiguration of the advertised listeners. Sometimes users will copy the config from one host to another during an upgrade. They may remember to update the broker id, but forget about the advertised listeners. It can be surprisingly difficult to detect this unless you know to look for it (e.g. you might just see a lot of NotLeaderForPartition errors as the fetchers connect to the wrong broker). It may not be totally foolproof, but it's probably enough for the common misconfiguration case to check existing brokers to see whether there are any which have already registered the advertised listener. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5647) Use async ZookeeperClient for Admin operations
[ https://issues.apache.org/jira/browse/KAFKA-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299024#comment-16299024 ] ASF GitHub Bot commented on KAFKA-5647: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4260 > Use async ZookeeperClient for Admin operations > -- > > Key: KAFKA-5647 > URL: https://issues.apache.org/jira/browse/KAFKA-5647 > Project: Kafka > Issue Type: Sub-task >Reporter: Ismael Juma >Assignee: Manikumar > Fix For: 1.1.0 > > > Since we will be removing the ZK dependency in most of the admin clients, we > only need to change the admin operations used on the server side. This > includes converting AdminManager and the remaining usage of zkUtils in > KafkaApis to use ZookeeperClient/KafkaZkClient. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6393) Add tool to view active brokers
Jason Gustafson created KAFKA-6393: -- Summary: Add tool to view active brokers Key: KAFKA-6393 URL: https://issues.apache.org/jira/browse/KAFKA-6393 Project: Kafka Issue Type: Improvement Reporter: Jason Gustafson It would be helpful to have a tool to view the active brokers in the cluster. For example, it could include the following: 1. Broker id and version (maybe detected through ApiVersions request) 2. Broker listener information 3. Whether broker is online 4. Which broker is the active controller 5. Maybe some key configs (e.g. inter-broker version and message format version) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298915#comment-16298915 ] Ted Yu commented on KAFKA-6366: --- Should I log another JIRA for the above ? One aspect we need to pay attention is to avoid flooding the log file, since the stack trace is much longer compared to the single sentence. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298908#comment-16298908 ] Jason Gustafson commented on KAFKA-6366: [~tedyu] Agreed. I think we should just include the cause when we throw. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) >
[jira] [Created] (KAFKA-6392) Do not permit message down-conversion for replicas
Jason Gustafson created KAFKA-6392: -- Summary: Do not permit message down-conversion for replicas Key: KAFKA-6392 URL: https://issues.apache.org/jira/browse/KAFKA-6392 Project: Kafka Issue Type: Improvement Reporter: Jason Gustafson We have seen several cases where down-conversion caused replicas to diverge from the leader in subtle ways. Generally speaking, even if we addressed all of the edge cases so that down-conversion worked correctly as far as consistency of offsets, it would probably still be a bad idea to permit down-conversion. For example, this can cause message timestamps to be lost if down-converting from v1 to v0, or transactional data could be lost if down-converting from v2 to v1 or v0. With that in mind, it would better to forbid down-conversion for replica fetches. Following the normal upgrade procedure, down-conversion is not needed anyway, but users often skip updating the inter-broker version. It is probably better in these cases to let the ISR shrink until the replicas have been updated as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298880#comment-16298880 ] Ted Yu commented on KAFKA-6366: --- {code} completedOffsetCommits.add(new OffsetCommitCompletion(callback, offsets, RetriableCommitFailedException.withUnderlyingMessage(e.getMessage(; {code} In Joerg's case, e.getMessage() was null. I wonder if we can provide more information to the user when e.getMessage() is null. e.g. log e. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) >
[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists
[ https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298798#comment-16298798 ] David Hay commented on KAFKA-6388: -- Ok...I'll look for that the next time we attempt to upgrade to 1.0. For now we've rolled back to 0.8.2.2 and things seem to be (mostly) stable. (We have one cluster that refuses to spin up the ReplicaFetcherThreads...and no exceptions or log messages to indicate why) > Error while trying to roll a segment that already exists > > > Key: KAFKA-6388 > URL: https://issues.apache.org/jira/browse/KAFKA-6388 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: David Hay >Priority: Blocker > > Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in > our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2). > After spending 30 min or more spewing log messages like this: > {noformat} > [2017-12-19 16:44:28,998] INFO Replica loaded for partition > screening.save.results.screening.save.results.processor.error-43 with initial > high watermark 0 (kafka.cluster.Replica) > {noformat} > Eventually, the replica thread throws the error below (also referenced in the > original issue). If I remove that partition from the data directory and > bounce the broker, it eventually rebalances (assuming it doesn't hit a > different partition with the same error). > {noformat} > 2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.log already exists; deleting it first (kafka.log.Log) > [2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.index already exists; deleting it first (kafka.log.Log) > [2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.timeindex already exists; deleting it first > (kafka.log.Log) > [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed > fetcher for partitions __consumer_offsets-20 > (kafka.server.ReplicaFetcherManager) > [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, > fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread) > kafka.common.KafkaException: Error processing data for partition > sr.new.sr.new.processor.error-38 offset 2 > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > Caused by: kafka.common.KafkaException: Trying to roll a new log segment for > topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it > already exists. > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338) > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.roll(Log.scala:1297) > at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:624) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.append(Log.scala:624) > at kafka.log.Log.appendAsFollower(Log.scala:607) > at > kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:102) > at > kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scal
[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists
[ https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298791#comment-16298791 ] Jason Gustafson commented on KAFKA-6388: [~dhay] Let me explain in a little more detail. The specific case is when the latest log segment (i.e. the one with the largest offset) and its index are both empty. There is a bug in the append logic which causes the broker to treat the zero-sized index file as if it were full, which triggers the log rolling logic. But since the log segment is empty, the new rolled log segment will have the same offset and we'll get the exception you saw initially. To fix the problem, you should shutdown the broker, remove the latest index file, and restart. I am not sure that is what is happening here, but it's the first thing I would check for. > Error while trying to roll a segment that already exists > > > Key: KAFKA-6388 > URL: https://issues.apache.org/jira/browse/KAFKA-6388 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: David Hay >Priority: Blocker > > Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in > our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2). > After spending 30 min or more spewing log messages like this: > {noformat} > [2017-12-19 16:44:28,998] INFO Replica loaded for partition > screening.save.results.screening.save.results.processor.error-43 with initial > high watermark 0 (kafka.cluster.Replica) > {noformat} > Eventually, the replica thread throws the error below (also referenced in the > original issue). If I remove that partition from the data directory and > bounce the broker, it eventually rebalances (assuming it doesn't hit a > different partition with the same error). > {noformat} > 2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.log already exists; deleting it first (kafka.log.Log) > [2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.index already exists; deleting it first (kafka.log.Log) > [2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.timeindex already exists; deleting it first > (kafka.log.Log) > [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed > fetcher for partitions __consumer_offsets-20 > (kafka.server.ReplicaFetcherManager) > [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, > fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread) > kafka.common.KafkaException: Error processing data for partition > sr.new.sr.new.processor.error-38 offset 2 > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > Caused by: kafka.common.KafkaException: Trying to roll a new log segment for > topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it > already exists. > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338) > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.roll(Log.scala:1297) > at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710) > at kafka.log.Log$$anonfun$append$2.apply(L
[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread
[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298761#comment-16298761 ] Jason Gustafson commented on KAFKA-6366: [~joerg.heinicke] Thanks for sharing the logs. One thing that immediately stands out is the large number of async offset commit failures. I counted 13,359 instances. Considering the "Marking coordinator dead" messages, there are about 10,862 instances. This is just a guess, but do you have any retry logic implemented for when async offset commits fail? That would explain the large number of "Marking coordinator dead" messages as well as the stack overflow. > StackOverflowError in kafka-coordinator-heartbeat-thread > > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 1.0.0 >Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator : (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) >
[jira] [Commented] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them
[ https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298661#comment-16298661 ] ASF GitHub Bot commented on KAFKA-6391: --- GitHub user cvaliente opened a pull request: https://github.com/apache/kafka/pull/4347 KAFKA-6391 ensure topics are created with correct partitions BEFORE building the… ensure topics are created with correct partitions BEFORE building the metadata for our stream tasks First ensureCoPartitioning() on repartitionTopicMetadata before creating allRepartitionTopicPartitions ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) You can merge this pull request into a Git repository by running: $ git pull https://github.com/cvaliente/kafka KAFKA-6391 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4347 commit bda1803d50d984ef4860579d508c37487df9781a Author: Clemens Valiente Date: 2017-12-20T15:45:41Z ensure topics are created with correct partitions BEFORE building the metadata for our stream tasks > output from ensure copartitioning is not used for Cluster metadata, resulting > in partitions without tasks working on them > - > > Key: KAFKA-6391 > URL: https://issues.apache.org/jira/browse/KAFKA-6391 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Clemens Valiente >Assignee: Clemens Valiente > Original Estimate: 20m > Remaining Estimate: 20m > > https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394 > Map allRepartitionTopicPartitions is created > from repartitionTopicMetadata > THEN we do ensureCoPartitioning on repartitionTopicMetadata > THEN we create topics and partitions according to repartitionTopicMetadata > THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata > THEN we use fullMetadata to assign the tasks and no longer use > repartitionTopicMetadata > This results in any change to repartitionTopicMetadata in > ensureCoPartitioning to be used for creating partitions but no tasks are ever > created for any partition added by ensureCoPartitioning() > the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata > before creating allRepartitionTopicPartitions -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them
[ https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298657#comment-16298657 ] Clemens Valiente commented on KAFKA-6391: - I dunno if I am too dumb to understand it correctly because it seems to be a relatively basic thing to get wrong.. > output from ensure copartitioning is not used for Cluster metadata, resulting > in partitions without tasks working on them > - > > Key: KAFKA-6391 > URL: https://issues.apache.org/jira/browse/KAFKA-6391 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Clemens Valiente >Assignee: Clemens Valiente > Original Estimate: 20m > Remaining Estimate: 20m > > https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394 > Map allRepartitionTopicPartitions is created > from repartitionTopicMetadata > THEN we do ensureCoPartitioning on repartitionTopicMetadata > THEN we create topics and partitions according to repartitionTopicMetadata > THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata > THEN we use fullMetadata to assign the tasks and no longer use > repartitionTopicMetadata > This results in any change to repartitionTopicMetadata in > ensureCoPartitioning to be used for creating partitions but no tasks are ever > created for any partition added by ensureCoPartitioning() > the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata > before creating allRepartitionTopicPartitions -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata
[ https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clemens Valiente updated KAFKA-6391: Description: https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394 Map allRepartitionTopicPartitions is created from repartitionTopicMetadata THEN we do ensureCoPartitioning on repartitionTopicMetadata THEN we create topics and partitions according to repartitionTopicMetadata THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata THEN we use fullMetadata to assign the tasks and no longer use repartitionTopicMetadata This results in any change to repartitionTopicMetadata in ensureCoPartitioning to be used for creating partitions but no tasks are ever created for any partition added by ensureCoPartitioning() the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata before creating allRepartitionTopicPartitions was: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L366 Map allRepartitionTopicPartitions is created from repartitionTopicMetadata THEN we do ensureCoPartitioning on repartitionTopicMetadata THEN we create topics and partitions according to repartitionTopicMetadata THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata THEN we use fullMetadata to assign the tasks and no longer use repartitionTopicMetadata This results in any change to repartitionTopicMetadata in ensureCoPartitioning to be used for creating partitions but no tasks are ever created for any partition added by ensureCoPartitioning() the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata before creating allRepartitionTopicPartitions > output from ensure copartitioning is not used for Cluster metadata > -- > > Key: KAFKA-6391 > URL: https://issues.apache.org/jira/browse/KAFKA-6391 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Clemens Valiente >Assignee: Clemens Valiente > Original Estimate: 20m > Remaining Estimate: 20m > > https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394 > Map allRepartitionTopicPartitions is created > from repartitionTopicMetadata > THEN we do ensureCoPartitioning on repartitionTopicMetadata > THEN we create topics and partitions according to repartitionTopicMetadata > THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata > THEN we use fullMetadata to assign the tasks and no longer use > repartitionTopicMetadata > This results in any change to repartitionTopicMetadata in > ensureCoPartitioning to be used for creating partitions but no tasks are ever > created for any partition added by ensureCoPartitioning() > the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata > before creating allRepartitionTopicPartitions -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them
[ https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clemens Valiente updated KAFKA-6391: Summary: output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them (was: output from ensure copartitioning is not used for Cluster metadata) > output from ensure copartitioning is not used for Cluster metadata, resulting > in partitions without tasks working on them > - > > Key: KAFKA-6391 > URL: https://issues.apache.org/jira/browse/KAFKA-6391 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Clemens Valiente >Assignee: Clemens Valiente > Original Estimate: 20m > Remaining Estimate: 20m > > https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394 > Map allRepartitionTopicPartitions is created > from repartitionTopicMetadata > THEN we do ensureCoPartitioning on repartitionTopicMetadata > THEN we create topics and partitions according to repartitionTopicMetadata > THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata > THEN we use fullMetadata to assign the tasks and no longer use > repartitionTopicMetadata > This results in any change to repartitionTopicMetadata in > ensureCoPartitioning to be used for creating partitions but no tasks are ever > created for any partition added by ensureCoPartitioning() > the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata > before creating allRepartitionTopicPartitions -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata
Clemens Valiente created KAFKA-6391: --- Summary: output from ensure copartitioning is not used for Cluster metadata Key: KAFKA-6391 URL: https://issues.apache.org/jira/browse/KAFKA-6391 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 1.0.0 Reporter: Clemens Valiente Assignee: Clemens Valiente https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L366 Map allRepartitionTopicPartitions is created from repartitionTopicMetadata THEN we do ensureCoPartitioning on repartitionTopicMetadata THEN we create topics and partitions according to repartitionTopicMetadata THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata THEN we use fullMetadata to assign the tasks and no longer use repartitionTopicMetadata This results in any change to repartitionTopicMetadata in ensureCoPartitioning to be used for creating partitions but no tasks are ever created for any partition added by ensureCoPartitioning() the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata before creating allRepartitionTopicPartitions -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists
[ https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298571#comment-16298571 ] David Hay commented on KAFKA-6388: -- [~hachikuji], The exception in the description is the new stack trace. That is, I copied it from our logs, it's not the stack trace from the original issue. It doesn't seem to be limited to the last partition (assuming you mean the partition with the largest id). Most of our topics have a default 53 partitions, and the id of the partition that has problems is all over the place. If I'm understanding correctly. When we see this error, we should be able to recover by shutting down the node, deleting the index files and restarting? > Error while trying to roll a segment that already exists > > > Key: KAFKA-6388 > URL: https://issues.apache.org/jira/browse/KAFKA-6388 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: David Hay >Priority: Blocker > > Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in > our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2). > After spending 30 min or more spewing log messages like this: > {noformat} > [2017-12-19 16:44:28,998] INFO Replica loaded for partition > screening.save.results.screening.save.results.processor.error-43 with initial > high watermark 0 (kafka.cluster.Replica) > {noformat} > Eventually, the replica thread throws the error below (also referenced in the > original issue). If I remove that partition from the data directory and > bounce the broker, it eventually rebalances (assuming it doesn't hit a > different partition with the same error). > {noformat} > 2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.log already exists; deleting it first (kafka.log.Log) > [2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.index already exists; deleting it first (kafka.log.Log) > [2017-12-19 15:16:24,227] WARN Newly rolled segment file > 0002.timeindex already exists; deleting it first > (kafka.log.Log) > [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed > fetcher for partitions __consumer_offsets-20 > (kafka.server.ReplicaFetcherManager) > [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, > fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread) > kafka.common.KafkaException: Error processing data for partition > sr.new.sr.new.processor.error-38 offset 2 > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > Caused by: kafka.common.KafkaException: Trying to roll a new log segment for > topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it > already exists. > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338) > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.roll(Log.scala:1297) > at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:624) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.append(Log
[jira] [Commented] (KAFKA-5746) Add new metrics to support health checks
[ https://issues.apache.org/jira/browse/KAFKA-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298520#comment-16298520 ] ASF GitHub Bot commented on KAFKA-5746: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4026 > Add new metrics to support health checks > > > Key: KAFKA-5746 > URL: https://issues.apache.org/jira/browse/KAFKA-5746 > Project: Kafka > Issue Type: New Feature > Components: metrics >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram > Fix For: 1.0.0 > > > It will be useful to have some additional metrics to support health checks. > Details are in > [KIP-188|https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+Add+new+metrics+to+support+health+checks] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-5460) Documentation on website uses word-breaks resulting in confusion
[ https://issues.apache.org/jira/browse/KAFKA-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuri Khan updated KAFKA-5460: - Attachment: 20171220-kafka-doc-tables.png I’d like to add my vote to the dl/dt/dd style. I use a 24" monitor and tile my browser window with an editor, and here’s what the docs look like for me. That is _not_ readable. !20171220-kafka-doc-tables.png|thumbnail! > Documentation on website uses word-breaks resulting in confusion > > > Key: KAFKA-5460 > URL: https://issues.apache.org/jira/browse/KAFKA-5460 > Project: Kafka > Issue Type: Bug >Reporter: Karel Vervaeke > Attachments: 20171220-kafka-doc-tables.png, Screen Shot 2017-06-16 at > 14.45.40.png, Screenshot from 2017-06-23 14-45-02.png > > > Documentation seems to suggest there is a configuration property > auto.off-set.reset but it really is auto.offset.reset. > We should look into disabling the word-break css properties (globally or at > least in the configuration reference tables) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-6331) Transient failure in kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs
[ https://issues.apache.org/jira/browse/KAFKA-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma resolved KAFKA-6331. Resolution: Fixed Fix Version/s: 1.1.0 > Transient failure in > kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs > -- > > Key: KAFKA-6331 > URL: https://issues.apache.org/jira/browse/KAFKA-6331 > Project: Kafka > Issue Type: Bug > Components: unit tests >Reporter: Guozhang Wang >Assignee: Dong Lin > Fix For: 1.1.0 > > > Saw this error once on Jenkins: > https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/3025/testReport/junit/kafka.api/AdminClientIntegrationTest/testAlterReplicaLogDirs/ > {code} > Stacktrace > java.lang.AssertionError: timed out waiting for message produce > at kafka.utils.TestUtils$.fail(TestUtils.scala:347) > at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:861) > at > kafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs(AdminClientIntegrationTest.scala:357) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at java.base/java.lang.Thread.run(Thread.java:844) > Standard Output > [2017-12-07 19:22:56,297] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:22:59,447] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:22:59,453] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:01,335] ERROR Error while creating ephemeral at > /controller, node already exists and owner '99134641238966279' does not match > current session '99134641238966277' > (kafka.zk.KafkaZkClient$CheckedEphemeral:71) > [2017-12-07 19:23:04,695] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:04,760] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:06,764] ERROR Error while creating ephemeral at > /controller, node already exists and owner '99134641586700293' does not match > current session '99134641586700295' > (kafka.zk.KafkaZkClient$CheckedEphemeral:71) > [2017-12-07 19:23:09,379] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:09,387] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:11,533] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:11,539] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN serve
[jira] [Commented] (KAFKA-6331) Transient failure in kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs
[ https://issues.apache.org/jira/browse/KAFKA-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298357#comment-16298357 ] ASF GitHub Bot commented on KAFKA-6331: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/4306 > Transient failure in > kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs > -- > > Key: KAFKA-6331 > URL: https://issues.apache.org/jira/browse/KAFKA-6331 > Project: Kafka > Issue Type: Bug > Components: unit tests >Reporter: Guozhang Wang >Assignee: Dong Lin > > Saw this error once on Jenkins: > https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/3025/testReport/junit/kafka.api/AdminClientIntegrationTest/testAlterReplicaLogDirs/ > {code} > Stacktrace > java.lang.AssertionError: timed out waiting for message produce > at kafka.utils.TestUtils$.fail(TestUtils.scala:347) > at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:861) > at > kafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs(AdminClientIntegrationTest.scala:357) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at java.base/java.lang.Thread.run(Thread.java:844) > Standard Output > [2017-12-07 19:22:56,297] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:22:59,447] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:22:59,453] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:01,335] ERROR Error while creating ephemeral at > /controller, node already exists and owner '99134641238966279' does not match > current session '99134641238966277' > (kafka.zk.KafkaZkClient$CheckedEphemeral:71) > [2017-12-07 19:23:04,695] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:04,760] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:06,764] ERROR Error while creating ephemeral at > /controller, node already exists and owner '99134641586700293' does not match > current session '99134641586700295' > (kafka.zk.KafkaZkClient$CheckedEphemeral:71) > [2017-12-07 19:23:09,379] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:09,387] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:11,533] ERROR ZKShutdownHandler is not registered, so > ZooKeeper server won't take any action on ERROR or SHUTDOWN server state > changes (org.apache.zookeeper.server.ZooKeeperServer:472) > [2017-12-07 19:23:11,539] ERROR ZKShutdownHandler is not registe
[jira] [Commented] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed
[ https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298267#comment-16298267 ] Ismael Juma commented on KAFKA-6188: [~chubao], would you be able to test kafka trunk? https://github.com/apache/kafka/commit/a5cd34d7962ff5da9b99d0229ef5a9a5fcb3f318 fixes a case where we would try to delete files that were still open (which only fails in some file systems). > Broker fails with FATAL Shutdown - log dirs have failed > --- > > Key: KAFKA-6188 > URL: https://issues.apache.org/jira/browse/KAFKA-6188 > Project: Kafka > Issue Type: Bug > Components: clients, log >Affects Versions: 1.0.0 > Environment: Windows 10 >Reporter: Valentina Baljak >Priority: Blocker > > Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The > test environment is very simple, with only one producer and one consumer. > Initially, everything started fine, stand alone tests worked as expected. > However, running my code, Kafka clients fail after approximately 10 minutes. > Kafka won't start after that and it fails with the same error. > Deleting logs helps to start again, and the same problem occurs. > Here is the error traceback: > [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 30 > ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of > 9223372036854775807 ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. > (kafka.network.Acceptor) > [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor > threads (kafka.network.SocketServer) > [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting > (kafka.server.ReplicaManager$LogDirFailureHandler) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving > replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions are > offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed > fetcher for partitions (kafka.server.ReplicaFetcherManager) > [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped > fetcher for partitions because they are in the failed log dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager) > [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in > C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6390) Update ZooKeeper to 3.4.11, Gradle and other minor updates
[ https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6390: --- Fix Version/s: 1.1.0 > Update ZooKeeper to 3.4.11, Gradle and other minor updates > -- > > Key: KAFKA-6390 > URL: https://issues.apache.org/jira/browse/KAFKA-6390 > Project: Kafka > Issue Type: Bug >Reporter: Ismael Juma >Assignee: Ismael Juma > Fix For: 1.1.0 > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6390) Update ZooKeeper to 3.4.11 and other minor updates
[ https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298252#comment-16298252 ] ASF GitHub Bot commented on KAFKA-6390: --- GitHub user ijuma opened a pull request: https://github.com/apache/kafka/pull/4345 KAFKA-6390: Update ZooKeeper to 3.4.11, Gradle and other minor updates Updates: - Gradle, gradle plugins and maven artifact updated - Bug fix updates for ZooKeeper, Jackson, EasyMock and Snappy Not updated: - RocksDB as it often causes issues, so better done separately - args4j as our test coverage is weak and the update was a feature release Release notes for ZooKeeper 3.4.11: https://zookeeper.apache.org/doc/r3.4.11/releasenotes.html Notable fix is improved handling of UnknownHostException: https://issues.apache.org/jira/browse/ZOOKEEPER-2614 Manually tested that IntelliJ import and build still works. Relying on existing test suite otherwise. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) You can merge this pull request into a Git repository by running: $ git pull https://github.com/ijuma/kafka kafka-6390-zk-3.4.11-and-other-updates Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4345.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4345 commit 5df853f9ea5425e08794507d1d104d050b56dde2 Author: Ismael Juma Date: 2017-12-20T10:27:57Z KAFKA-6390: Update ZooKeeper to 3.4.11, Gradle and other minor updates Updates: - Gradle, gradle plugins and maven artifact updated - Bug fix updates for ZooKeeper, Jackson, EasyMock and Snappy Release notes for ZooKeeper 3.4.11: https://zookeeper.apache.org/doc/r3.4.11/releasenotes.html Notable fix is improved handling of UnknownHostException: https://issues.apache.org/jira/browse/ZOOKEEPER-2614 > Update ZooKeeper to 3.4.11 and other minor updates > -- > > Key: KAFKA-6390 > URL: https://issues.apache.org/jira/browse/KAFKA-6390 > Project: Kafka > Issue Type: Bug >Reporter: Ismael Juma >Assignee: Ismael Juma > > https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6390) Update ZooKeeper to 3.4.11, Gradle and other minor updates
[ https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6390: --- Summary: Update ZooKeeper to 3.4.11, Gradle and other minor updates (was: Update ZooKeeper to 3.4.11 and other minor updates) > Update ZooKeeper to 3.4.11, Gradle and other minor updates > -- > > Key: KAFKA-6390 > URL: https://issues.apache.org/jira/browse/KAFKA-6390 > Project: Kafka > Issue Type: Bug >Reporter: Ismael Juma >Assignee: Ismael Juma > > https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KAFKA-6390) Update ZooKeeper to 3.4.11 and other minor updates
[ https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6390: --- Description: https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix. > Update ZooKeeper to 3.4.11 and other minor updates > -- > > Key: KAFKA-6390 > URL: https://issues.apache.org/jira/browse/KAFKA-6390 > Project: Kafka > Issue Type: Bug >Reporter: Ismael Juma >Assignee: Ismael Juma > > https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6390) Update ZooKeeper to 3.4.11 and other minor updates
Ismael Juma created KAFKA-6390: -- Summary: Update ZooKeeper to 3.4.11 and other minor updates Key: KAFKA-6390 URL: https://issues.apache.org/jira/browse/KAFKA-6390 Project: Kafka Issue Type: Bug Reporter: Ismael Juma Assignee: Ismael Juma -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed
[ https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298188#comment-16298188 ] David Cheung edited comment on KAFKA-6188 at 12/20/17 9:54 AM: --- Hi, I am facing exactly the same problem here. My stack: kafka (4.0 tag from https://hub.docker.com/r/confluentinc/cp-enterprise-kafka/) running on docker swarm under Amazon ec2 instances. The storage I used is Amazon's EFS. In my case, some log files cannot be deleted which will trigger this bug: {code:xml} Caused by: java.nio.file.FileSystemException: /var/lib/kafka/data/ksql_transient_8376289768731246768_1513675960541-KSTREAM-REDUCE-STATE-STORE-03-changelog-1.a9edc755278d425e9227bb03eb0cd55f-delete/.nfs937861751206a94a0fa2: Device or resource busy ... ... [2017-12-19 10:56:37,681] INFO Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager) [2017-12-19 10:56:37,682] FATAL Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogManager) {code} was (Author: chubao): Hi, I am facing exactly the same problem here. My stack: kafka (4.0 tag from https://hub.docker.com/r/confluentinc/cp-enterprise-kafka/) running on docker swarm under Amazon ec2 instances. The storage I used is Amazon's EFS. In my case, some log files cannot be deleted which will trigger this bug: {code:xml} Caused by: java.nio.file.FileSystemException: /var/lib/kafka/data/ksql_transient_8376289768731246768_1513675960541-KSTREAM-REDUCE-STATE-STORE-03-changelog-1.a9edc755278d425e9227bb03eb0cd55f-delete/.nfs937861751206a94a0fa2: Device or resource busy {code} > Broker fails with FATAL Shutdown - log dirs have failed > --- > > Key: KAFKA-6188 > URL: https://issues.apache.org/jira/browse/KAFKA-6188 > Project: Kafka > Issue Type: Bug > Components: clients, log >Affects Versions: 1.0.0 > Environment: Windows 10 >Reporter: Valentina Baljak >Priority: Blocker > > Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The > test environment is very simple, with only one producer and one consumer. > Initially, everything started fine, stand alone tests worked as expected. > However, running my code, Kafka clients fail after approximately 10 minutes. > Kafka won't start after that and it fails with the same error. > Deleting logs helps to start again, and the same problem occurs. > Here is the error traceback: > [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 30 > ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of > 9223372036854775807 ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. > (kafka.network.Acceptor) > [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor > threads (kafka.network.SocketServer) > [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting > (kafka.server.ReplicaManager$LogDirFailureHandler) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving > replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions are > offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed > fetcher for partitions (kafka.server.ReplicaFetcherManager) > [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped > fetcher for partitions because they are in the failed log dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager) > [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in > C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed
[ https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298188#comment-16298188 ] David Cheung commented on KAFKA-6188: - Hi, I am facing exactly the same problem here. My stack: kafka (4.0 tag from https://hub.docker.com/r/confluentinc/cp-enterprise-kafka/) running on docker swarm under Amazon ec2 instances. The storage I used is Amazon's EFS. In my case, some log files cannot be deleted which will trigger this bug: {code:xml} Caused by: java.nio.file.FileSystemException: /var/lib/kafka/data/ksql_transient_8376289768731246768_1513675960541-KSTREAM-REDUCE-STATE-STORE-03-changelog-1.a9edc755278d425e9227bb03eb0cd55f-delete/.nfs937861751206a94a0fa2: Device or resource busy {code} > Broker fails with FATAL Shutdown - log dirs have failed > --- > > Key: KAFKA-6188 > URL: https://issues.apache.org/jira/browse/KAFKA-6188 > Project: Kafka > Issue Type: Bug > Components: clients, log >Affects Versions: 1.0.0 > Environment: Windows 10 >Reporter: Valentina Baljak >Priority: Blocker > > Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The > test environment is very simple, with only one producer and one consumer. > Initially, everything started fine, stand alone tests worked as expected. > However, running my code, Kafka clients fail after approximately 10 minutes. > Kafka won't start after that and it fails with the same error. > Deleting logs helps to start again, and the same problem occurs. > Here is the error traceback: > [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 30 > ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of > 9223372036854775807 ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. > (kafka.network.Acceptor) > [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor > threads (kafka.network.SocketServer) > [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting > (kafka.server.ReplicaManager$LogDirFailureHandler) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving > replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions are > offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed > fetcher for partitions (kafka.server.ReplicaFetcherManager) > [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped > fetcher for partitions because they are in the failed log dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager) > [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in > C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6389) Expose transaction metrics via JMX
Florent Ramière created KAFKA-6389: -- Summary: Expose transaction metrics via JMX Key: KAFKA-6389 URL: https://issues.apache.org/jira/browse/KAFKA-6389 Project: Kafka Issue Type: Improvement Components: metrics Affects Versions: 1.0.0 Reporter: Florent Ramière Expose various metrics from https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka Especially * number of transactions * number of current transactions * timeout -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298169#comment-16298169 ] ASF GitHub Bot commented on KAFKA-6323: --- Github user fredfp closed the pull request at: https://github.com/apache/kafka/pull/4304 > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298166#comment-16298166 ] Frederic Arno commented on KAFKA-6323: -- I'm fine with most of that, I only have a doubt about not aligning wall-clock punctuations on {{now + N*interval}}, which could effectively make punctuations calls drift away. Do you have use cases where spacing punctuations by at least interval is critical and requires that behavior? I've pushed updated code, in which I do not allow punctuation time drift (this makes the behavior more coherent between stream-time and wall-clock-time punctuation). By default, the new code aligns punctuations as discussed above. I've also added an overload, enabling users to choose the first punctuation time, the first punctuation time then becomes the reference on which further punctuations are aligned. {code:java} public Cancellable schedule(final long startTime, final long interval, final PunctuationType type, final Punctuator punctuator) {code} In my use case, I use wall-clock time punctuation to punctuate every day at 2am. I would use the new API the following way, allowing me to call {{context.schedule()}} once instead of twice currently: {code:java} context.schedule(timeUntil2Am, 24 * 60 * 60 * 1000, WALL_CLOCK_TIME, (callTime) -> doStuffRightAfter2am(callTime)) {code} > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately
[ https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298167#comment-16298167 ] Frederic Arno commented on KAFKA-6323: -- I'm fine with most of that, I only have a doubt about not aligning wall-clock punctuations on {{now + N*interval}}, which could effectively make punctuations calls drift away. Do you have use cases where spacing punctuations by at least interval is critical and requires that behavior? I've pushed updated code, in which I do not allow punctuation time drift (this makes the behavior more coherent between stream-time and wall-clock-time punctuation). By default, the new code aligns punctuations as discussed above. I've also added an overload, enabling users to choose the first punctuation time, the first punctuation time then becomes the reference on which further punctuations are aligned. {code:java} public Cancellable schedule(final long startTime, final long interval, final PunctuationType type, final Punctuator punctuator) {code} In my use case, I use wall-clock time punctuation to punctuate every day at 2am. I would use the new API the following way, allowing me to call {{context.schedule()}} once instead of twice currently: {code:java} context.schedule(timeUntil2Am, 24 * 60 * 60 * 1000, WALL_CLOCK_TIME, (callTime) -> doStuffRightAfter2am(callTime)) {code} > punctuate with WALL_CLOCK_TIME triggered immediately > > > Key: KAFKA-6323 > URL: https://issues.apache.org/jira/browse/KAFKA-6323 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.0 >Reporter: Frederic Arno >Assignee: Frederic Arno > Fix For: 1.1.0, 1.0.1 > > > When working on a custom Processor from which I am scheduling a punctuation > using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I > set, a call to my Punctuator is always triggered immediately. > Having a quick look at kafka-streams' code, I could find that all > PunctuationSchedule's timestamps are matched against the current time in > order to decide whether or not to trigger the punctuator > (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). > However, I've only seen code that initializes PunctuationSchedule's timestamp > to 0, which I guess is what is causing an immediate punctuation. > At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's > timestamp be initialized to current time + interval? -- This message was sent by Atlassian JIRA (v6.4.14#64029)