[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299415#comment-16299415
 ] 

Matthias J. Sax commented on KAFKA-6323:


[~guozhang] I agree that we should pass in current wall-clock time {{NOW}} when 
calling a punctuation. But from my understanding, that is what we are doing atm 
anyway. And this holds for wall-clock as well as stream-time. Or do I miss 
something.

> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299408#comment-16299408
 ] 

ASF GitHub Bot commented on KAFKA-6366:
---

GitHub user hachikuji opened a pull request:

https://github.com/apache/kafka/pull/4349

KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits 
during coordinator disconnect

When the coordinator is marked unknown, we explicitly disconnect its 
connection and cancel pending requests. Currently the disconnect happens before 
the coordinator state is set to null, which means that callbacks which inspect 
the coordinator state will see it still as active. This can lead to further 
requests being sent. In pathological cases, the disconnect itself is not able 
to return because new requests are sent to the coordinator before the 
disconnect can complete, which leads to the stack overflow error. To fix the 
problem, I have reordered the disconnect to happen after the coordinator is set 
to null.

I have added a basic test case to verify that callbacks for in-flight or 
unsent requests see the coordinator as unknown which prevents them from 
attempting to resend. We may need additional test cases after we determine 
whether this is in fact was it happening in the reported ticket.

Note that I have also included some minor cleanups which I noticed along 
the way.

### Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation 
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hachikuji/kafka KAFKA-6366

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4349.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4349


commit 488de3dca5be6111fd447980c8e79477259dc99a
Author: Jason Gustafson 
Date:   2017-12-18T18:53:38Z

KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits 
during coordinator disconnect




> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
>Assignee: Jason Gustafson
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.ap

[jira] [Commented] (KAFKA-6126) Reduce rebalance time by not checking if created topics are available

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299406#comment-16299406
 ] 

ASF GitHub Bot commented on KAFKA-6126:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4322


> Reduce rebalance time by not checking if created topics are available
> -
>
> Key: KAFKA-6126
> URL: https://issues.apache.org/jira/browse/KAFKA-6126
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
> Fix For: 1.1.0
>
>
> Within {{StreamPartitionAssignor#assign}} we create new topics and afterwards 
> wait in an "infinite loop" until topic metadata propagated throughout the 
> cluster. We do this, to make sure topics are available when we start 
> processing.
> However, with this approach we "extend" the time in the rebalance phase and 
> thus are not responsive (no calls to `poll` for liveness check and 
> {{KafkaStreams#close}} suffers). Thus, we might want to remove this check and 
> handle potential "topic not found" exceptions in the main thread gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5863) Potential null dereference in DistributedHerder#reconfigureConnector()

2017-12-20 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated KAFKA-5863:
--
Description: 
Here is the call chain:

{code}
RestServer.httpRequest(reconfigUrl, "POST", 
taskProps, null);
{code}

In httpRequest():
{code}
} else if (responseCode >= 200 && responseCode < 300) {
InputStream is = connection.getInputStream();
T result = JSON_SERDE.readValue(is, responseFormat);
{code}
For readValue():
{code}
public  T readValue(InputStream src, TypeReference valueTypeRef)
throws IOException, JsonParseException, JsonMappingException
{
return (T) _readMapAndClose(_jsonFactory.createParser(src), 
_typeFactory.constructType(valueTypeRef));
{code}
Then there would be NPE in constructType():
{code}
public JavaType constructType(TypeReference typeRef)
{
// 19-Oct-2015, tatu: Simpler variant like so should work
return _fromAny(null, typeRef.getType(), EMPTY_BINDINGS);
{code}

  was:
Here is the call chain:
{code}
RestServer.httpRequest(reconfigUrl, "POST", 
taskProps, null);
{code}

In httpRequest():
{code}
} else if (responseCode >= 200 && responseCode < 300) {
InputStream is = connection.getInputStream();
T result = JSON_SERDE.readValue(is, responseFormat);
{code}
For readValue():
{code}
public  T readValue(InputStream src, TypeReference valueTypeRef)
throws IOException, JsonParseException, JsonMappingException
{
return (T) _readMapAndClose(_jsonFactory.createParser(src), 
_typeFactory.constructType(valueTypeRef));
{code}
Then there would be NPE in constructType():
{code}
public JavaType constructType(TypeReference typeRef)
{
// 19-Oct-2015, tatu: Simpler variant like so should work
return _fromAny(null, typeRef.getType(), EMPTY_BINDINGS);
{code}


> Potential null dereference in DistributedHerder#reconfigureConnector()
> --
>
> Key: KAFKA-5863
> URL: https://issues.apache.org/jira/browse/KAFKA-5863
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> Here is the call chain:
> {code}
> RestServer.httpRequest(reconfigUrl, "POST", 
> taskProps, null);
> {code}
> In httpRequest():
> {code}
> } else if (responseCode >= 200 && responseCode < 300) {
> InputStream is = connection.getInputStream();
> T result = JSON_SERDE.readValue(is, responseFormat);
> {code}
> For readValue():
> {code}
> public  T readValue(InputStream src, TypeReference valueTypeRef)
> throws IOException, JsonParseException, JsonMappingException
> {
> return (T) _readMapAndClose(_jsonFactory.createParser(src), 
> _typeFactory.constructType(valueTypeRef));
> {code}
> Then there would be NPE in constructType():
> {code}
> public JavaType constructType(TypeReference typeRef)
> {
> // 19-Oct-2015, tatu: Simpler variant like so should work
> return _fromAny(null, typeRef.getType(), EMPTY_BINDINGS);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.

2017-12-20 Thread huxihx (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299357#comment-16299357
 ] 

huxihx commented on KAFKA-6375:
---

Disable the windows firewall and retry to see if this exception disappears.

> Follower replicas can never catch up to be ISR due to creating 
> ReplicaFetcherThread failed.
> ---
>
> Key: KAFKA-6375
> URL: https://issues.apache.org/jira/browse/KAFKA-6375
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: Windows,  23 brokers KafkaCluster
>Reporter: Rong Tang
>
> Hi, I met with a case that in one broker, the out of sync replicas never 
> catch up.
> When the broker starts up, it receives LeaderAndISR requests from controller, 
> which will call createFetcherThread, the thread creation failed, with 
> exceptions below.
> And then, there is no fetcher for these follower replicas, and it is out of 
> sync forever. Unless, later, it receives LeaderAndISR requests that has 
> higher leader EPOCH.  The broker had 260 out of 330 replicas out of sync for 
> one day, until I restarted it.
> Restart the broker can mitigate the issue.
> I have 2 questions.  
> First, Why NEW ReplicaFetcherThread failed?
> *Second, should Kafka do something to fail over, instead of letting the 
> broker in abnormal state.*
> It is a 23 brokers Kafka cluster running on Windows. each broker has 330 
> replicas.
> [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing 
> LeaderAndIsr request with correlationId 1 received from controller 427703487 
> epoch 22 (state.change.logger)
> org.apache.kafka.common.KafkaException: java.io.IOException: Unable to 
> establish loopback connection
>   at org.apache.kafka.common.network.Selector.(Selector.java:124)
>   at 
> kafka.server.ReplicaFetcherThread.(ReplicaFetcherThread.scala:87)
>   at 
> kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35)
>   at 
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83)
>   at 
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78)
>   at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869)
>   at 
> kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689)
>   at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:83)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to establish loopback connection
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94)
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.nio.ch.PipeImpl.(PipeImpl.java:171)
>   at 
> sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50)
>   at java.nio.channels.Pipe.open(Pipe.java:155)
>   at sun.nio.ch.WindowsSelectorImpl.(WindowsSelectorImpl.java:127)
>   at 
> sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44)
>   at java.nio.channels.Selector.open(Selector.java:227)
>   at org.apache.kafka.common.network.Selector.(Selector.java:122)
>   ... 16 more
> Caused by: java.net.ConnectException: Connection timed out: connect
>   at sun.nio.ch.Net.connect0(Native Method)
>   at sun.nio.ch.Net.connect(Net.java:454)
>   at sun.nio.ch.Net.connect(Net.java:446)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>   at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
>   at 
> sun.nio.ch.PipeImpl$Initializer$LoopbackConnector.run(PipeImpl.java:127)
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:76)
>   ... 25 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Frederic Arno (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299347#comment-16299347
 ] 

Frederic Arno commented on KAFKA-6323:
--

Thanks for your comments, I'll consider a KIP for the added API method when I'm 
back from a 2 weeks leave.

> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.

2017-12-20 Thread Rong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299342#comment-16299342
 ] 

Rong Tang edited comment on KAFKA-6375 at 12/21/17 12:53 AM:
-

Hi, [~huxi_2b]

Any thought on why the "exception: Unable to establish loopback connection" 
happen, or any way to handle this exception?

Another broker met this exception again, and its replicas stayed out of sync 
for 2 days until I restarted it.

both brokers had been controller before I restarted, not sure if related.

And I only see the exception when starting broker.

Thanks.


was (Author: trjianjianjiao):
Hi, huxihx

Any thought on why the "exception: Unable to establish loopback connection" 
happen, or any way to handle this exception?

Another broker met this exception again, and its replicas stayed out of sync 
for 2 days until I restarted it.

both brokers had been controller before I restarted, not sure if related.

And I only see the exception when starting broker.

Thanks.

> Follower replicas can never catch up to be ISR due to creating 
> ReplicaFetcherThread failed.
> ---
>
> Key: KAFKA-6375
> URL: https://issues.apache.org/jira/browse/KAFKA-6375
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: Windows,  23 brokers KafkaCluster
>Reporter: Rong Tang
>
> Hi, I met with a case that in one broker, the out of sync replicas never 
> catch up.
> When the broker starts up, it receives LeaderAndISR requests from controller, 
> which will call createFetcherThread, the thread creation failed, with 
> exceptions below.
> And then, there is no fetcher for these follower replicas, and it is out of 
> sync forever. Unless, later, it receives LeaderAndISR requests that has 
> higher leader EPOCH.  The broker had 260 out of 330 replicas out of sync for 
> one day, until I restarted it.
> Restart the broker can mitigate the issue.
> I have 2 questions.  
> First, Why NEW ReplicaFetcherThread failed?
> *Second, should Kafka do something to fail over, instead of letting the 
> broker in abnormal state.*
> It is a 23 brokers Kafka cluster running on Windows. each broker has 330 
> replicas.
> [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing 
> LeaderAndIsr request with correlationId 1 received from controller 427703487 
> epoch 22 (state.change.logger)
> org.apache.kafka.common.KafkaException: java.io.IOException: Unable to 
> establish loopback connection
>   at org.apache.kafka.common.network.Selector.(Selector.java:124)
>   at 
> kafka.server.ReplicaFetcherThread.(ReplicaFetcherThread.scala:87)
>   at 
> kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35)
>   at 
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83)
>   at 
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78)
>   at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869)
>   at 
> kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689)
>   at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:83)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to establish loopback connection
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94)
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.nio.ch.PipeImpl.(PipeImpl.java:171)
>   at 
> sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50)
>   at java.nio.channels.Pipe.open(Pipe.java:155)
>   at sun.nio.ch.WindowsSelectorImpl.(WindowsSelectorImpl.java:127)
>   at 
> sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44)
>   at java.nio.channels.Selector.open(Selector.java:227)
>   at org.apache.kafka.common.network.Selector.(Select

[jira] [Commented] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.

2017-12-20 Thread Rong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299342#comment-16299342
 ] 

Rong Tang commented on KAFKA-6375:
--

Hi, huxihx

Any thought on why the "exception: Unable to establish loopback connection" 
happen, or any way to handle this exception?

Another broker met this exception again, and its replicas stayed out of sync 
for 2 days until I restarted it.

both brokers had been controller before I restarted, not sure if related.

And I only see the exception when starting broker.

Thanks.

> Follower replicas can never catch up to be ISR due to creating 
> ReplicaFetcherThread failed.
> ---
>
> Key: KAFKA-6375
> URL: https://issues.apache.org/jira/browse/KAFKA-6375
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: Windows,  23 brokers KafkaCluster
>Reporter: Rong Tang
>
> Hi, I met with a case that in one broker, the out of sync replicas never 
> catch up.
> When the broker starts up, it receives LeaderAndISR requests from controller, 
> which will call createFetcherThread, the thread creation failed, with 
> exceptions below.
> And then, there is no fetcher for these follower replicas, and it is out of 
> sync forever. Unless, later, it receives LeaderAndISR requests that has 
> higher leader EPOCH.  The broker had 260 out of 330 replicas out of sync for 
> one day, until I restarted it.
> Restart the broker can mitigate the issue.
> I have 2 questions.  
> First, Why NEW ReplicaFetcherThread failed?
> *Second, should Kafka do something to fail over, instead of letting the 
> broker in abnormal state.*
> It is a 23 brokers Kafka cluster running on Windows. each broker has 330 
> replicas.
> [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing 
> LeaderAndIsr request with correlationId 1 received from controller 427703487 
> epoch 22 (state.change.logger)
> org.apache.kafka.common.KafkaException: java.io.IOException: Unable to 
> establish loopback connection
>   at org.apache.kafka.common.network.Selector.(Selector.java:124)
>   at 
> kafka.server.ReplicaFetcherThread.(ReplicaFetcherThread.scala:87)
>   at 
> kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35)
>   at 
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83)
>   at 
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78)
>   at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869)
>   at 
> kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689)
>   at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:83)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to establish loopback connection
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94)
>   at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.nio.ch.PipeImpl.(PipeImpl.java:171)
>   at 
> sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50)
>   at java.nio.channels.Pipe.open(Pipe.java:155)
>   at sun.nio.ch.WindowsSelectorImpl.(WindowsSelectorImpl.java:127)
>   at 
> sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44)
>   at java.nio.channels.Selector.open(Selector.java:227)
>   at org.apache.kafka.common.network.Selector.(Selector.java:122)
>   ... 16 more
> Caused by: java.net.ConnectException: Connection timed out: connect
>   at sun.nio.ch.Net.connect0(Native Method)
>   at sun.nio.ch.Net.connect(Net.java:454)
>   at sun.nio.ch.Net.connect(Net.java:446)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>   at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
>   at 
> sun.nio.ch.PipeImpl$Initializer$LoopbackConnector.run(PipeIm

[jira] [Assigned] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-6366:
--

Assignee: Jason Gustafson

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
>Assignee: Jason Gustafson
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at

[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299311#comment-16299311
 ] 

Jason Gustafson commented on KAFKA-6366:


[~joerg.heinicke] Ok, we may need to take a look at the more verbose logs 
(DEBUG is probably good enough). One idea I had is the following. Suppose the 
consumer has a large number of records buffered. It might be possible to hit a 
race condition in which the foreground thread is hitting {{poll()}} and 
{{commitAsync()}} in a tight loop because {{max.poll.records=50}} is 
immediately satisfied with the buffered records. At some point, the heartbeat 
thread might see the coordinator disconnect and attempt to mark it dead, but 
the new offset commit requests are piling up as fast as the background thread 
can cancel them in {{disconnect()}}, which ultimately causes the stack 
overflow. 

I think ultimately the fix for this issue is going to be setting the 
coordinator to null in {{AbstractCoordinator.coordinatorDead()}} prior to 
disconnecting. I will go ahead and submit a patch to do this. It would be good 
to confirm from the logging if the scenario I mentioned above is happening or 
if it's something else. If we still can't figure out the cause, perhaps we can 
at least test with the patch.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureComplet

[jira] [Commented] (KAFKA-6335) SimpleAclAuthorizerTest#testHighConcurrencyModificationOfResourceAcls fails intermittently

2017-12-20 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299303#comment-16299303
 ] 

Ted Yu commented on KAFKA-6335:
---

If needed, we can add logging into codebase so that it is easier to figure out 
the cause.

> SimpleAclAuthorizerTest#testHighConcurrencyModificationOfResourceAcls fails 
> intermittently
> --
>
> Key: KAFKA-6335
> URL: https://issues.apache.org/jira/browse/KAFKA-6335
> Project: Kafka
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Manikumar
> Fix For: 1.1.0
>
>
> From 
> https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/3045/testReport/junit/kafka.security.auth/SimpleAclAuthorizerTest/testHighConcurrencyModificationOfResourceAcls/
>  :
> {code}
> java.lang.AssertionError: expected acls Set(User:36 has Allow permission for 
> operations: Read from hosts: *, User:7 has Allow permission for operations: 
> Read from hosts: *, User:21 has Allow permission for operations: Read from 
> hosts: *, User:39 has Allow permission for operations: Read from hosts: *, 
> User:43 has Allow permission for operations: Read from hosts: *, User:3 has 
> Allow permission for operations: Read from hosts: *, User:35 has Allow 
> permission for operations: Read from hosts: *, User:15 has Allow permission 
> for operations: Read from hosts: *, User:16 has Allow permission for 
> operations: Read from hosts: *, User:22 has Allow permission for operations: 
> Read from hosts: *, User:26 has Allow permission for operations: Read from 
> hosts: *, User:11 has Allow permission for operations: Read from hosts: *, 
> User:38 has Allow permission for operations: Read from hosts: *, User:8 has 
> Allow permission for operations: Read from hosts: *, User:28 has Allow 
> permission for operations: Read from hosts: *, User:32 has Allow permission 
> for operations: Read from hosts: *, User:25 has Allow permission for 
> operations: Read from hosts: *, User:41 has Allow permission for operations: 
> Read from hosts: *, User:44 has Allow permission for operations: Read from 
> hosts: *, User:48 has Allow permission for operations: Read from hosts: *, 
> User:2 has Allow permission for operations: Read from hosts: *, User:9 has 
> Allow permission for operations: Read from hosts: *, User:14 has Allow 
> permission for operations: Read from hosts: *, User:46 has Allow permission 
> for operations: Read from hosts: *, User:13 has Allow permission for 
> operations: Read from hosts: *, User:5 has Allow permission for operations: 
> Read from hosts: *, User:29 has Allow permission for operations: Read from 
> hosts: *, User:45 has Allow permission for operations: Read from hosts: *, 
> User:6 has Allow permission for operations: Read from hosts: *, User:37 has 
> Allow permission for operations: Read from hosts: *, User:23 has Allow 
> permission for operations: Read from hosts: *, User:19 has Allow permission 
> for operations: Read from hosts: *, User:24 has Allow permission for 
> operations: Read from hosts: *, User:17 has Allow permission for operations: 
> Read from hosts: *, User:34 has Allow permission for operations: Read from 
> hosts: *, User:12 has Allow permission for operations: Read from hosts: *, 
> User:42 has Allow permission for operations: Read from hosts: *, User:4 has 
> Allow permission for operations: Read from hosts: *, User:47 has Allow 
> permission for operations: Read from hosts: *, User:18 has Allow permission 
> for operations: Read from hosts: *, User:31 has Allow permission for 
> operations: Read from hosts: *, User:49 has Allow permission for operations: 
> Read from hosts: *, User:33 has Allow permission for operations: Read from 
> hosts: *, User:1 has Allow permission for operations: Read from hosts: *, 
> User:27 has Allow permission for operations: Read from hosts: *) but got 
> Set(User:36 has Allow permission for operations: Read from hosts: *, User:7 
> has Allow permission for operations: Read from hosts: *, User:21 has Allow 
> permission for operations: Read from hosts: *, User:39 has Allow permission 
> for operations: Read from hosts: *, User:43 has Allow permission for 
> operations: Read from hosts: *, User:3 has Allow permission for operations: 
> Read from hosts: *, User:35 has Allow permission for operations: Read from 
> hosts: *, User:15 has Allow permission for operations: Read from hosts: *, 
> User:16 has Allow permission for operations: Read from hosts: *, User:22 has 
> Allow permission for operations: Read from hosts: *, User:26 has Allow 
> permission for operations: Read from hosts: *, User:11 has Allow permission 
> for operations: Read from hosts: *, User:38 has Allow permission for 
> operations: Read from hosts: *, User:8 has Allow permission for operations: 
> Read from

[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298915#comment-16298915
 ] 

Ted Yu edited comment on KAFKA-6366 at 12/21/17 12:19 AM:
--

Should I log another JIRA for the above ?

One aspect we need to pay attention is to avoid flooding the log file, since 
the stack trace is much longer compared to the single sentence.
To prevent (repeated) stack traces flooding the log, we can keep Map from stack 
trace to count (number of times the stack trace occurred).


was (Author: yuzhih...@gmail.com):
Should I log another JIRA for the above ?

One aspect we need to pay attention is to avoid flooding the log file, since 
the stack trace is much longer compared to the single sentence.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRe

[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278
 ] 

Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:10 AM:
--

Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
[^Screenshot-2017-12-19 21.35-22.10 processing.png]
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.


was (Author: joerg.heinicke):
Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
!Screenshot-2017-12-19 21.35-22.10 processing.png|thumbnail!
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.l

[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278
 ] 

Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:10 AM:
--

Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough ([^Screenshot-2017-12-19 21.35-22.10 
processing.png], Timings here are UTC + 1 while in the log file it's UTC.) 
while often it starts fluctuating strongly) this means around 3k messages per 
minute per thread or 50 per messages per second. The batch size is also rather 
small with just 50 messages, so 1 batch and thereby one async commit per 
second. The number of async commit failures is slightly off: e.g. > 5,000 
failures/ log entries between 20:38 and 21:03, i.e. within 25 mins or 1,500 s. 
So the number is still more than 3 times as high than expected in case all 
commits fail within that time.

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.


was (Author: joerg.heinicke):
Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
[^Screenshot-2017-12-19 21.35-22.10 processing.png]
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers

[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278
 ] 

Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:09 AM:
--

Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
!Screenshot-2017-12-19 21.35-22.10 processing.png|thumbnail!
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.


was (Author: joerg.heinicke):
Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
[^Screenshot-2017-12-19 21.35-22.10 processing.png]
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.l

[jira] [Updated] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Heinicke updated KAFKA-6366:
--
Attachment: (was: Screenshot-2017-12-21 processing.png)

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache

[jira] [Updated] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Heinicke updated KAFKA-6366:
--
Attachment: Screenshot-2017-12-19 21.35-22.10 processing.png

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apach

[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278
 ] 

Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:08 AM:
--

Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
[^Screenshot-2017-12-19 21.35-22.10 processing.png]
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.


was (Author: joerg.heinicke):
Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
[^Screenshot-2017-12-19 processing.png]
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-19 21.35-22.10 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.Appender

[jira] [Comment Edited] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278
 ] 

Joerg Heinicke edited comment on KAFKA-6366 at 12/21/17 12:07 AM:
--

Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretically steady processing (in this particular incident 
processing seemed steady enough even though often it starts fluctuating 
strongly) this means around 3k messages per minute per thread or 50 per 
messages per second. The batch size is also rather small with just 50 messages, 
so 1 batch and thereby one async commit per second. The number of async commit 
failures is slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 
21:03, i.e. within 25 mins or 1,500 s. So the number is still more than 3 times 
as high than expected in case all commits fail within that time.
[^Screenshot-2017-12-19 processing.png]
(Timings here are UTC + 1 while in the log file it's UTC.)

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.


was (Author: joerg.heinicke):
Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretical steady processing (unfortunately it's not, but 
usually in this erroneous case the processing throughput is highly fluctuating) 
this means around 3k messages per minute per thread or 50 per messages per 
second. The batch size is also rather small with just 50 messages, so 1 batch 
and thereby one async commit per second. The number of async commit failures is 
slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. 
within 25 mins or 1,500 s. So the number is still more than 3 times as high 
than expected in case all commits fail within that time.

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-21 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Cate

[jira] [Updated] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Heinicke updated KAFKA-6366:
--
Attachment: Screenshot-2017-12-21 processing.png

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip, 
> Screenshot-2017-12-21 processing.png
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer

[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299278#comment-16299278
 ] 

Joerg Heinicke commented on KAFKA-6366:
---

Simply for the quantity structure: Our system has a throughput of about 100 k 
messages per minute. The topic has 30 partitions. The consumer group matches 
those and consists of 5 service instances with 6 KafkaConsumers each. 
Eventually with a theoretical steady processing (unfortunately it's not, but 
usually in this erroneous case the processing throughput is highly fluctuating) 
this means around 3k messages per minute per thread or 50 per messages per 
second. The batch size is also rather small with just 50 messages, so 1 batch 
and thereby one async commit per second. The number of async commit failures is 
slightly off: e.g. > 5,000 failures/ log entries between 20:38 and 21:03, i.e. 
within 25 mins or 1,500 s. So the number is still more than 3 times as high 
than expected in case all commits fail within that time.

Btw., we are aware of the underlying issue with the infrastructure: heavily 
over-committed VMs in terms of CPU and rather low storage throughput.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.i

[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299273#comment-16299273
 ] 

Guozhang Wang commented on KAFKA-6323:
--

My concern of doing alignment with wall-clock time is that we would 
intentionally trigger {{puncuate(T2)}} where the passed in parameter {{T}} is 
actually not the current system wall-clock time {{NOW}}, but would be smaller 
to it. For user punctuation logics that rely on the accuracy of the passed in 
system time that might be a real problem (on the other hand, for stream time I 
consider this a much less of an issue since it is data drive anyways and hence 
stream time is defined on record timestamps only but not on when it is being 
processed).

That being said, I do not feel strong against aligning with wall-clock time, 
just throwing my two cents here. If people are in favor of doing this I'm OK as 
well. Just remind that we need to document the behavior clearly in javadoc.

> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Joerg Heinicke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299258#comment-16299258
 ] 

Joerg Heinicke commented on KAFKA-6366:
---

This is our commit async code block:

{code:java}
this.kafkaConsumer.commitAsync((offsets, exception) -> {
offsets.forEach((k, v) -> {
log.debug(k + "\t" + v);
});
if (exception != null) {
log.error(KafkaConsumer.class.getSimpleName() + " failed committing 
offets asynchronously! ", exception);
} else {
log.debug("Committing Consumer Offset succeeded!");
}
});
{code}

I don't see any explicit or can't even imagine implicit retry logic.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClien

[jira] [Commented] (KAFKA-4263) QueryableStateIntegrationTest.concurrentAccess is failing occasionally in jenkins builds

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299251#comment-16299251
 ] 

ASF GitHub Bot commented on KAFKA-4263:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4342


> QueryableStateIntegrationTest.concurrentAccess is failing occasionally in 
> jenkins builds
> 
>
> Key: KAFKA-4263
> URL: https://issues.apache.org/jira/browse/KAFKA-4263
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Damian Guy
>Assignee: Matthias J. Sax
> Fix For: 1.1.0, 1.0.1
>
>
> We are seeing occasional failures of this test in jenkins, however it isn't 
> failing when running locally (confirmed by multiple people). Needs 
> investingating



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-4263) QueryableStateIntegrationTest.concurrentAccess is failing occasionally in jenkins builds

2017-12-20 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-4263.
--
   Resolution: Fixed
Fix Version/s: (was: 0.10.1.1)
   1.0.1
   1.1.0

Issue resolved by pull request 4342
[https://github.com/apache/kafka/pull/4342]

> QueryableStateIntegrationTest.concurrentAccess is failing occasionally in 
> jenkins builds
> 
>
> Key: KAFKA-4263
> URL: https://issues.apache.org/jira/browse/KAFKA-4263
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Damian Guy
>Assignee: Matthias J. Sax
> Fix For: 1.1.0, 1.0.1
>
>
> We are seeing occasional failures of this test in jenkins, however it isn't 
> failing when running locally (confirmed by multiple people). Needs 
> investingating



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299140#comment-16299140
 ] 

Matthias J. Sax commented on KAFKA-6323:


[~frederica]

We cannot simply change the interface of `schedule` method -- this is a public 
API change and requires a KIP 
(https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals).
 If you think it's a valuable addition, feel free to do a KIP on it (including 
a new JIRA to cover the change).

[~guozhang]

I see your point that aligning start wall-clock time punctuations might not be 
as valuable as aligning stream-time ones. However, I agree with [~frederica] 
that if we move from `T2 (T2 >= T1)` to `T2 + T` punctuation shift into the 
future and I think this would be undesired behavior. For long GC pauses etc, we 
would just skip the corresponding punctuation similarly to the skipping 
behavior for stream-time in case stream-time make a larger advance that 2x 
punctuation interval.

> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299140#comment-16299140
 ] 

Matthias J. Sax edited comment on KAFKA-6323 at 12/20/17 9:40 PM:
--

[~frederica]

We cannot simply change the interface of {{schedule()}} method -- this is a 
public API change and requires a KIP 
(https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals).
 If you think it's a valuable addition, feel free to do a KIP on it (including 
a new JIRA to cover the change).

[~guozhang]

I see your point that aligning start wall-clock time punctuations might not be 
as valuable as aligning stream-time ones. However, I agree with [~frederica] 
that if we move from {{T2 (T2 >= T1)}} to {{T2 + T}} punctuation shift into the 
future and I think this would be undesired behavior. For long GC pauses etc, we 
would just skip the corresponding punctuation similarly to the skipping 
behavior for stream-time in case stream-time make a larger advance that 2x 
punctuation interval.


was (Author: mjsax):
[~frederica]

We cannot simply change the interface of `schedule` method -- this is a public 
API change and requires a KIP 
(https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals).
 If you think it's a valuable addition, feel free to do a KIP on it (including 
a new JIRA to cover the change).

[~guozhang]

I see your point that aligning start wall-clock time punctuations might not be 
as valuable as aligning stream-time ones. However, I agree with [~frederica] 
that if we move from `T2 (T2 >= T1)` to `T2 + T` punctuation shift into the 
future and I think this would be undesired behavior. For long GC pauses etc, we 
would just skip the corresponding punctuation similarly to the skipping 
behavior for stream-time in case stream-time make a larger advance that 2x 
punctuation interval.

> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5849) Add process stop faults, round trip workload, partitioned produce-consume test

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299131#comment-16299131
 ] 

ASF GitHub Bot commented on KAFKA-5849:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4323


> Add process stop faults, round trip workload, partitioned produce-consume test
> --
>
> Key: KAFKA-5849
> URL: https://issues.apache.org/jira/browse/KAFKA-5849
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Colin P. McCabe
>Assignee: Colin P. McCabe
> Fix For: 1.1.0
>
>
> Add partitioned produce consume test



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-5849) Add process stop faults, round trip workload, partitioned produce-consume test

2017-12-20 Thread Rajini Sivaram (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-5849.
---
   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 4323
[https://github.com/apache/kafka/pull/4323]

> Add process stop faults, round trip workload, partitioned produce-consume test
> --
>
> Key: KAFKA-5849
> URL: https://issues.apache.org/jira/browse/KAFKA-5849
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Colin P. McCabe
>Assignee: Colin P. McCabe
> Fix For: 1.1.0
>
>
> Add partitioned produce consume test



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6394) Prevent misconfiguration of advertised listeners

2017-12-20 Thread Jason Gustafson (JIRA)
Jason Gustafson created KAFKA-6394:
--

 Summary: Prevent misconfiguration of advertised listeners
 Key: KAFKA-6394
 URL: https://issues.apache.org/jira/browse/KAFKA-6394
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


We don't really have any protection from misconfiguration of the advertised 
listeners. Sometimes users will copy the config from one host to another during 
an upgrade. They may remember to update the broker id, but forget about the 
advertised listeners. It can be surprisingly difficult to detect this unless 
you know to look for it (e.g. you might just see a lot of NotLeaderForPartition 
errors as the fetchers connect to the wrong broker). It may not be totally 
foolproof, but it's probably enough for the common misconfiguration case to 
check existing brokers to see whether there are any which have already 
registered the advertised listener.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5647) Use async ZookeeperClient for Admin operations

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299024#comment-16299024
 ] 

ASF GitHub Bot commented on KAFKA-5647:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4260


> Use async ZookeeperClient for Admin operations
> --
>
> Key: KAFKA-5647
> URL: https://issues.apache.org/jira/browse/KAFKA-5647
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Ismael Juma
>Assignee: Manikumar
> Fix For: 1.1.0
>
>
> Since we will be removing the ZK dependency in most of the admin clients, we 
> only need to change the admin operations used on the server side. This 
> includes converting AdminManager and the remaining usage of zkUtils in 
> KafkaApis to use ZookeeperClient/KafkaZkClient. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6393) Add tool to view active brokers

2017-12-20 Thread Jason Gustafson (JIRA)
Jason Gustafson created KAFKA-6393:
--

 Summary: Add tool to view active brokers
 Key: KAFKA-6393
 URL: https://issues.apache.org/jira/browse/KAFKA-6393
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson


It would be helpful to have a tool to view the active brokers in the cluster. 
For example, it could include the following:

1. Broker id and version (maybe detected through ApiVersions request)
2. Broker listener information
3. Whether broker is online
4. Which broker is the active controller
5. Maybe some key configs (e.g. inter-broker version and message format version)





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298915#comment-16298915
 ] 

Ted Yu commented on KAFKA-6366:
---

Should I log another JIRA for the above ?

One aspect we need to pay attention is to avoid flooding the log file, since 
the stack trace is much longer compared to the single sentence.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.

[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298908#comment-16298908
 ] 

Jason Gustafson commented on KAFKA-6366:


[~tedyu] Agreed. I think we should just include the cause when we throw.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  

[jira] [Created] (KAFKA-6392) Do not permit message down-conversion for replicas

2017-12-20 Thread Jason Gustafson (JIRA)
Jason Gustafson created KAFKA-6392:
--

 Summary: Do not permit message down-conversion for replicas
 Key: KAFKA-6392
 URL: https://issues.apache.org/jira/browse/KAFKA-6392
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson


We have seen several cases where down-conversion caused replicas to diverge 
from the leader in subtle ways. Generally speaking, even if we addressed all of 
the edge cases so that down-conversion worked correctly as far as consistency 
of offsets, it would probably still be a bad idea to permit down-conversion. 
For example, this can cause message timestamps to be lost if down-converting 
from v1 to v0, or transactional data could be lost if down-converting from v2 
to v1 or v0. 

With that in mind, it would better to forbid down-conversion for replica 
fetches. Following the normal upgrade procedure, down-conversion is not needed 
anyway, but users often skip updating the inter-broker version. It is probably 
better in these cases to let the ISR shrink until the replicas have been 
updated as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298880#comment-16298880
 ] 

Ted Yu commented on KAFKA-6366:
---

{code}
completedOffsetCommits.add(new 
OffsetCommitCompletion(callback, offsets,

RetriableCommitFailedException.withUnderlyingMessage(e.getMessage(;
{code}
In Joerg's case, e.getMessage() was null.

I wonder if we can provide more information to the user when e.getMessage() is 
null. e.g. log e.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653)
>  

[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists

2017-12-20 Thread David Hay (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298798#comment-16298798
 ] 

David Hay commented on KAFKA-6388:
--

Ok...I'll look for that the next time we attempt to upgrade to 1.0.  For now 
we've rolled back to 0.8.2.2 and things seem to be (mostly) stable. (We have 
one cluster that refuses to spin up the ReplicaFetcherThreads...and no 
exceptions or log messages to indicate why)


> Error while trying to roll a segment that already exists
> 
>
> Key: KAFKA-6388
> URL: https://issues.apache.org/jira/browse/KAFKA-6388
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: David Hay
>Priority: Blocker
>
> Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in 
> our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2).
> After spending 30 min or more spewing log messages like this:
> {noformat}
> [2017-12-19 16:44:28,998] INFO Replica loaded for partition 
> screening.save.results.screening.save.results.processor.error-43 with initial 
> high watermark 0 (kafka.cluster.Replica)
> {noformat}
> Eventually, the replica thread throws the error below (also referenced in the 
> original issue).  If I remove that partition from the data directory and 
> bounce the broker, it eventually rebalances (assuming it doesn't hit a 
> different partition with the same error).
> {noformat}
> 2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.log already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.index already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.timeindex already exists; deleting it first 
> (kafka.log.Log)
> [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed 
> fetcher for partitions __consumer_offsets-20 
> (kafka.server.ReplicaFetcherManager)
> [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: Error processing data for partition 
> sr.new.sr.new.processor.error-38 offset 2
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
> Caused by: kafka.common.KafkaException: Trying to roll a new log segment for 
> topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it 
> already exists.
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338)
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.roll(Log.scala:1297)
> at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:624)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.append(Log.scala:624)
> at kafka.log.Log.appendAsFollower(Log.scala:607)
> at 
> kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:102)
> at 
> kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scal

[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists

2017-12-20 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298791#comment-16298791
 ] 

Jason Gustafson commented on KAFKA-6388:


[~dhay] Let me explain in a little more detail. The specific case is when the 
latest log segment (i.e. the one with the largest offset) and its index are 
both empty. There is a bug in the append logic which causes the broker to treat 
the zero-sized index file as if it were full, which triggers the log rolling 
logic. But since the log segment is empty, the new rolled log segment will have 
the same offset and we'll get the exception you saw initially. To fix the 
problem, you should shutdown the broker, remove the latest index file, and 
restart. I am not sure that is what is happening here, but it's the first thing 
I would check for. 

> Error while trying to roll a segment that already exists
> 
>
> Key: KAFKA-6388
> URL: https://issues.apache.org/jira/browse/KAFKA-6388
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: David Hay
>Priority: Blocker
>
> Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in 
> our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2).
> After spending 30 min or more spewing log messages like this:
> {noformat}
> [2017-12-19 16:44:28,998] INFO Replica loaded for partition 
> screening.save.results.screening.save.results.processor.error-43 with initial 
> high watermark 0 (kafka.cluster.Replica)
> {noformat}
> Eventually, the replica thread throws the error below (also referenced in the 
> original issue).  If I remove that partition from the data directory and 
> bounce the broker, it eventually rebalances (assuming it doesn't hit a 
> different partition with the same error).
> {noformat}
> 2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.log already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.index already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.timeindex already exists; deleting it first 
> (kafka.log.Log)
> [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed 
> fetcher for partitions __consumer_offsets-20 
> (kafka.server.ReplicaFetcherManager)
> [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: Error processing data for partition 
> sr.new.sr.new.processor.error-38 offset 2
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
> Caused by: kafka.common.KafkaException: Trying to roll a new log segment for 
> topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it 
> already exists.
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338)
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.roll(Log.scala:1297)
> at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710)
> at kafka.log.Log$$anonfun$append$2.apply(L

[jira] [Commented] (KAFKA-6366) StackOverflowError in kafka-coordinator-heartbeat-thread

2017-12-20 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298761#comment-16298761
 ] 

Jason Gustafson commented on KAFKA-6366:


[~joerg.heinicke] Thanks for sharing the logs. One thing that immediately 
stands out is the large number of async offset commit failures. I counted 
13,359 instances. Considering the "Marking coordinator dead" messages, there 
are about 10,862 instances. This is just a guess, but do you have any retry 
logic implemented for when async offset commits fail? That would explain the 
large number of "Marking coordinator dead" messages as well as the stack 
overflow.

> StackOverflowError in kafka-coordinator-heartbeat-thread
> 
>
> Key: KAFKA-6366
> URL: https://issues.apache.org/jira/browse/KAFKA-6366
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 1.0.0
>Reporter: Joerg Heinicke
> Attachments: 6366.v1.txt, ConverterProcessor.zip
>
>
> With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing 
> once a StackOverflowError in the heartbeat thread occurred due to 
> connectivity issues of the consumers to the coordinating broker:
> Immediately before the exception there are hundreds, if not thousands of log 
> entries of following type:
> 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] INFO  - [Consumer clientId=consumer-4, 
> groupId=my-consumer-group] Marking the coordinator : (id: 
> 2147483645 rack: null) dead
> The exceptions always happen somewhere in the DateFormat code, even 
> though at different lines.
> 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | 
> my-consumer-group] ERROR - Uncaught exception in thread 
> 'kafka-coordinator-heartbeat-thread | my-consumer-group':
> java.lang.StackOverflowError
>  at 
> java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362)
>  at 
> java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
>  at java.util.Calendar.getDisplayName(Calendar.java:2110)
>  at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
>  at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
>  at java.text.DateFormat.format(DateFormat.java:345)
>  at 
> org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443)
>  at 
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>  at org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
>  at 
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
>  at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
>  at 
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>  at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>  at org.apache.log4j.Category.callAppenders(Category.java:206)
>  at org.apache.log4j.Category.forcedLog(Category.java:391)
>  at org.apache.log4j.Category.log(Category.java:856)
>  at 
> org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
>  at 
> org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
> ...
> the following 9 lines are repeated around hundred times.
> ...
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388)
>  

[jira] [Commented] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298661#comment-16298661
 ] 

ASF GitHub Bot commented on KAFKA-6391:
---

GitHub user cvaliente opened a pull request:

https://github.com/apache/kafka/pull/4347

KAFKA-6391 ensure topics are created with correct partitions BEFORE 
building the…

ensure topics are created with correct partitions BEFORE building the 
metadata for our stream tasks

First ensureCoPartitioning() on repartitionTopicMetadata before creating 
allRepartitionTopicPartitions

### Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation 
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cvaliente/kafka KAFKA-6391

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4347


commit bda1803d50d984ef4860579d508c37487df9781a
Author: Clemens Valiente 
Date:   2017-12-20T15:45:41Z

ensure topics are created with correct partitions BEFORE building the 
metadata for our stream tasks




> output from ensure copartitioning is not used for Cluster metadata, resulting 
> in partitions without tasks working on them
> -
>
> Key: KAFKA-6391
> URL: https://issues.apache.org/jira/browse/KAFKA-6391
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Clemens Valiente
>Assignee: Clemens Valiente
>   Original Estimate: 20m
>  Remaining Estimate: 20m
>
> https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394
> Map allRepartitionTopicPartitions is created 
> from repartitionTopicMetadata
> THEN we do ensureCoPartitioning on repartitionTopicMetadata
> THEN we create topics and partitions according to repartitionTopicMetadata
> THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
> THEN we use fullMetadata to assign the tasks and no longer use 
> repartitionTopicMetadata
> This results in any change to repartitionTopicMetadata in 
> ensureCoPartitioning to be used for creating partitions but no tasks are ever 
> created for any partition added by ensureCoPartitioning()
> the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
> before creating allRepartitionTopicPartitions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them

2017-12-20 Thread Clemens Valiente (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298657#comment-16298657
 ] 

Clemens Valiente commented on KAFKA-6391:
-

I dunno if I am too dumb to understand it correctly because it seems to be a 
relatively basic thing to get wrong..

> output from ensure copartitioning is not used for Cluster metadata, resulting 
> in partitions without tasks working on them
> -
>
> Key: KAFKA-6391
> URL: https://issues.apache.org/jira/browse/KAFKA-6391
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Clemens Valiente
>Assignee: Clemens Valiente
>   Original Estimate: 20m
>  Remaining Estimate: 20m
>
> https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394
> Map allRepartitionTopicPartitions is created 
> from repartitionTopicMetadata
> THEN we do ensureCoPartitioning on repartitionTopicMetadata
> THEN we create topics and partitions according to repartitionTopicMetadata
> THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
> THEN we use fullMetadata to assign the tasks and no longer use 
> repartitionTopicMetadata
> This results in any change to repartitionTopicMetadata in 
> ensureCoPartitioning to be used for creating partitions but no tasks are ever 
> created for any partition added by ensureCoPartitioning()
> the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
> before creating allRepartitionTopicPartitions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata

2017-12-20 Thread Clemens Valiente (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clemens Valiente updated KAFKA-6391:

Description: 
https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394


Map allRepartitionTopicPartitions is created 
from repartitionTopicMetadata
THEN we do ensureCoPartitioning on repartitionTopicMetadata
THEN we create topics and partitions according to repartitionTopicMetadata
THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
THEN we use fullMetadata to assign the tasks and no longer use 
repartitionTopicMetadata

This results in any change to repartitionTopicMetadata in ensureCoPartitioning 
to be used for creating partitions but no tasks are ever created for any 
partition added by ensureCoPartitioning()


the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
before creating allRepartitionTopicPartitions


  was:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L366


Map allRepartitionTopicPartitions is created 
from repartitionTopicMetadata
THEN we do ensureCoPartitioning on repartitionTopicMetadata
THEN we create topics and partitions according to repartitionTopicMetadata
THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
THEN we use fullMetadata to assign the tasks and no longer use 
repartitionTopicMetadata

This results in any change to repartitionTopicMetadata in ensureCoPartitioning 
to be used for creating partitions but no tasks are ever created for any 
partition added by ensureCoPartitioning()


the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
before creating allRepartitionTopicPartitions



> output from ensure copartitioning is not used for Cluster metadata
> --
>
> Key: KAFKA-6391
> URL: https://issues.apache.org/jira/browse/KAFKA-6391
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Clemens Valiente
>Assignee: Clemens Valiente
>   Original Estimate: 20m
>  Remaining Estimate: 20m
>
> https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394
> Map allRepartitionTopicPartitions is created 
> from repartitionTopicMetadata
> THEN we do ensureCoPartitioning on repartitionTopicMetadata
> THEN we create topics and partitions according to repartitionTopicMetadata
> THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
> THEN we use fullMetadata to assign the tasks and no longer use 
> repartitionTopicMetadata
> This results in any change to repartitionTopicMetadata in 
> ensureCoPartitioning to be used for creating partitions but no tasks are ever 
> created for any partition added by ensureCoPartitioning()
> the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
> before creating allRepartitionTopicPartitions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata, resulting in partitions without tasks working on them

2017-12-20 Thread Clemens Valiente (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clemens Valiente updated KAFKA-6391:

Summary: output from ensure copartitioning is not used for Cluster 
metadata, resulting in partitions without tasks working on them  (was: output 
from ensure copartitioning is not used for Cluster metadata)

> output from ensure copartitioning is not used for Cluster metadata, resulting 
> in partitions without tasks working on them
> -
>
> Key: KAFKA-6391
> URL: https://issues.apache.org/jira/browse/KAFKA-6391
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Clemens Valiente
>Assignee: Clemens Valiente
>   Original Estimate: 20m
>  Remaining Estimate: 20m
>
> https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L394
> Map allRepartitionTopicPartitions is created 
> from repartitionTopicMetadata
> THEN we do ensureCoPartitioning on repartitionTopicMetadata
> THEN we create topics and partitions according to repartitionTopicMetadata
> THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
> THEN we use fullMetadata to assign the tasks and no longer use 
> repartitionTopicMetadata
> This results in any change to repartitionTopicMetadata in 
> ensureCoPartitioning to be used for creating partitions but no tasks are ever 
> created for any partition added by ensureCoPartitioning()
> the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
> before creating allRepartitionTopicPartitions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6391) output from ensure copartitioning is not used for Cluster metadata

2017-12-20 Thread Clemens Valiente (JIRA)
Clemens Valiente created KAFKA-6391:
---

 Summary: output from ensure copartitioning is not used for Cluster 
metadata
 Key: KAFKA-6391
 URL: https://issues.apache.org/jira/browse/KAFKA-6391
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 1.0.0
Reporter: Clemens Valiente
Assignee: Clemens Valiente


https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java#L366


Map allRepartitionTopicPartitions is created 
from repartitionTopicMetadata
THEN we do ensureCoPartitioning on repartitionTopicMetadata
THEN we create topics and partitions according to repartitionTopicMetadata
THEN we use allRepartitionTopicPartitions to create our Cluster fullMetadata
THEN we use fullMetadata to assign the tasks and no longer use 
repartitionTopicMetadata

This results in any change to repartitionTopicMetadata in ensureCoPartitioning 
to be used for creating partitions but no tasks are ever created for any 
partition added by ensureCoPartitioning()


the fix is easy: First ensureCoPartitioning() on repartitionTopicMetadata 
before creating allRepartitionTopicPartitions




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists

2017-12-20 Thread David Hay (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298571#comment-16298571
 ] 

David Hay commented on KAFKA-6388:
--

[~hachikuji], The exception in the description is the new stack trace.  That 
is, I copied it from our logs, it's not the stack trace from the original 
issue. It doesn't seem to be limited to the last partition (assuming you mean 
the partition with the largest id).  Most of our topics have a default 53 
partitions, and the id of the partition that has problems is all over the place.

If I'm understanding correctly. When we see this error, we should be able to 
recover by shutting down the node, deleting the index files and restarting?

> Error while trying to roll a segment that already exists
> 
>
> Key: KAFKA-6388
> URL: https://issues.apache.org/jira/browse/KAFKA-6388
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: David Hay
>Priority: Blocker
>
> Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in 
> our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2).
> After spending 30 min or more spewing log messages like this:
> {noformat}
> [2017-12-19 16:44:28,998] INFO Replica loaded for partition 
> screening.save.results.screening.save.results.processor.error-43 with initial 
> high watermark 0 (kafka.cluster.Replica)
> {noformat}
> Eventually, the replica thread throws the error below (also referenced in the 
> original issue).  If I remove that partition from the data directory and 
> bounce the broker, it eventually rebalances (assuming it doesn't hit a 
> different partition with the same error).
> {noformat}
> 2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.log already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.index already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.timeindex already exists; deleting it first 
> (kafka.log.Log)
> [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed 
> fetcher for partitions __consumer_offsets-20 
> (kafka.server.ReplicaFetcherManager)
> [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: Error processing data for partition 
> sr.new.sr.new.processor.error-38 offset 2
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
> Caused by: kafka.common.KafkaException: Trying to roll a new log segment for 
> topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it 
> already exists.
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338)
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.roll(Log.scala:1297)
> at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:624)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.append(Log

[jira] [Commented] (KAFKA-5746) Add new metrics to support health checks

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298520#comment-16298520
 ] 

ASF GitHub Bot commented on KAFKA-5746:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4026


> Add new metrics to support health checks
> 
>
> Key: KAFKA-5746
> URL: https://issues.apache.org/jira/browse/KAFKA-5746
> Project: Kafka
>  Issue Type: New Feature
>  Components: metrics
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
> Fix For: 1.0.0
>
>
> It will be useful to have some additional metrics to support health checks.
> Details are in 
> [KIP-188|https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+Add+new+metrics+to+support+health+checks]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5460) Documentation on website uses word-breaks resulting in confusion

2017-12-20 Thread Yuri Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuri Khan updated KAFKA-5460:
-
Attachment: 20171220-kafka-doc-tables.png

I’d like to add my vote to the dl/dt/dd style. I use a 24" monitor and tile my 
browser window with an editor, and here’s what the docs look like for me. That 
is _not_ readable.

!20171220-kafka-doc-tables.png|thumbnail!

> Documentation on website uses word-breaks resulting in confusion
> 
>
> Key: KAFKA-5460
> URL: https://issues.apache.org/jira/browse/KAFKA-5460
> Project: Kafka
>  Issue Type: Bug
>Reporter: Karel Vervaeke
> Attachments: 20171220-kafka-doc-tables.png, Screen Shot 2017-06-16 at 
> 14.45.40.png, Screenshot from 2017-06-23 14-45-02.png
>
>
> Documentation seems to suggest there is a configuration property 
> auto.off-set.reset but it really is auto.offset.reset.
> We should look into disabling the word-break css properties (globally or at 
> least in the configuration reference tables)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-6331) Transient failure in kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs

2017-12-20 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma resolved KAFKA-6331.

   Resolution: Fixed
Fix Version/s: 1.1.0

> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs
> --
>
> Key: KAFKA-6331
> URL: https://issues.apache.org/jira/browse/KAFKA-6331
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Reporter: Guozhang Wang
>Assignee: Dong Lin
> Fix For: 1.1.0
>
>
> Saw this error once on Jenkins: 
> https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/3025/testReport/junit/kafka.api/AdminClientIntegrationTest/testAlterReplicaLogDirs/
> {code}
> Stacktrace
> java.lang.AssertionError: timed out waiting for message produce
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:347)
>   at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:861)
>   at 
> kafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs(AdminClientIntegrationTest.scala:357)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at java.base/java.lang.Thread.run(Thread.java:844)
> Standard Output
> [2017-12-07 19:22:56,297] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:22:59,447] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:22:59,453] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:01,335] ERROR Error while creating ephemeral at 
> /controller, node already exists and owner '99134641238966279' does not match 
> current session '99134641238966277' 
> (kafka.zk.KafkaZkClient$CheckedEphemeral:71)
> [2017-12-07 19:23:04,695] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:04,760] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:06,764] ERROR Error while creating ephemeral at 
> /controller, node already exists and owner '99134641586700293' does not match 
> current session '99134641586700295' 
> (kafka.zk.KafkaZkClient$CheckedEphemeral:71)
> [2017-12-07 19:23:09,379] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:09,387] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:11,533] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:11,539] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN serve

[jira] [Commented] (KAFKA-6331) Transient failure in kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298357#comment-16298357
 ] 

ASF GitHub Bot commented on KAFKA-6331:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/4306


> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testLogStartOffsetCheckpointkafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs
> --
>
> Key: KAFKA-6331
> URL: https://issues.apache.org/jira/browse/KAFKA-6331
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Reporter: Guozhang Wang
>Assignee: Dong Lin
>
> Saw this error once on Jenkins: 
> https://builds.apache.org/job/kafka-pr-jdk9-scala2.12/3025/testReport/junit/kafka.api/AdminClientIntegrationTest/testAlterReplicaLogDirs/
> {code}
> Stacktrace
> java.lang.AssertionError: timed out waiting for message produce
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:347)
>   at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:861)
>   at 
> kafka.api.AdminClientIntegrationTest.testAlterReplicaLogDirs(AdminClientIntegrationTest.scala:357)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at java.base/java.lang.Thread.run(Thread.java:844)
> Standard Output
> [2017-12-07 19:22:56,297] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:22:59,447] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:22:59,453] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:01,335] ERROR Error while creating ephemeral at 
> /controller, node already exists and owner '99134641238966279' does not match 
> current session '99134641238966277' 
> (kafka.zk.KafkaZkClient$CheckedEphemeral:71)
> [2017-12-07 19:23:04,695] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:04,760] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:06,764] ERROR Error while creating ephemeral at 
> /controller, node already exists and owner '99134641586700293' does not match 
> current session '99134641586700295' 
> (kafka.zk.KafkaZkClient$CheckedEphemeral:71)
> [2017-12-07 19:23:09,379] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:09,387] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:11,533] ERROR ZKShutdownHandler is not registered, so 
> ZooKeeper server won't take any action on ERROR or SHUTDOWN server state 
> changes (org.apache.zookeeper.server.ZooKeeperServer:472)
> [2017-12-07 19:23:11,539] ERROR ZKShutdownHandler is not registe

[jira] [Commented] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed

2017-12-20 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298267#comment-16298267
 ] 

Ismael Juma commented on KAFKA-6188:


[~chubao], would you be able to test kafka trunk? 
https://github.com/apache/kafka/commit/a5cd34d7962ff5da9b99d0229ef5a9a5fcb3f318 
fixes a case where we would try to delete files that were still open (which 
only fails in some file systems).

> Broker fails with FATAL Shutdown - log dirs have failed
> ---
>
> Key: KAFKA-6188
> URL: https://issues.apache.org/jira/browse/KAFKA-6188
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, log
>Affects Versions: 1.0.0
> Environment: Windows 10
>Reporter: Valentina Baljak
>Priority: Blocker
>
> Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The 
> test environment is very simple, with only one producer and one consumer. 
> Initially, everything started fine, stand alone tests worked as expected. 
> However, running my code, Kafka clients fail after approximately 10 minutes. 
> Kafka won't start after that and it fails with the same error. 
> Deleting logs helps to start again, and the same problem occurs.
> Here is the error traceback:
> [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 30 
> ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of 
> 9223372036854775807 ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. 
> (kafka.network.Acceptor)
> [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor 
> threads (kafka.network.SocketServer)
> [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting 
> (kafka.server.ReplicaManager$LogDirFailureHandler)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving 
> replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions  are 
> offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed 
> fetcher for partitions  (kafka.server.ReplicaFetcherManager)
> [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped 
> fetcher for partitions  because they are in the failed log dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager)
> [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6390) Update ZooKeeper to 3.4.11, Gradle and other minor updates

2017-12-20 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6390:
---
Fix Version/s: 1.1.0

> Update ZooKeeper to 3.4.11, Gradle and other minor updates
> --
>
> Key: KAFKA-6390
> URL: https://issues.apache.org/jira/browse/KAFKA-6390
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Ismael Juma
> Fix For: 1.1.0
>
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6390) Update ZooKeeper to 3.4.11 and other minor updates

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298252#comment-16298252
 ] 

ASF GitHub Bot commented on KAFKA-6390:
---

GitHub user ijuma opened a pull request:

https://github.com/apache/kafka/pull/4345

KAFKA-6390: Update ZooKeeper to 3.4.11, Gradle and other minor updates

Updates:
- Gradle, gradle plugins and maven artifact updated
- Bug fix updates for ZooKeeper, Jackson, EasyMock and Snappy

Not updated:
- RocksDB as it often causes issues, so better done separately
- args4j as our test coverage is weak and the update was a
feature release

Release notes for ZooKeeper 3.4.11:
https://zookeeper.apache.org/doc/r3.4.11/releasenotes.html

Notable fix is improved handling of UnknownHostException:
https://issues.apache.org/jira/browse/ZOOKEEPER-2614

Manually tested that IntelliJ import and build still works.
Relying on existing test suite otherwise.

### Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation 
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ijuma/kafka 
kafka-6390-zk-3.4.11-and-other-updates

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4345


commit 5df853f9ea5425e08794507d1d104d050b56dde2
Author: Ismael Juma 
Date:   2017-12-20T10:27:57Z

KAFKA-6390: Update ZooKeeper to 3.4.11, Gradle and other minor updates

Updates:
- Gradle, gradle plugins and maven artifact updated
- Bug fix updates for ZooKeeper, Jackson, EasyMock and Snappy

Release notes for ZooKeeper 3.4.11:
https://zookeeper.apache.org/doc/r3.4.11/releasenotes.html

Notable fix is improved handling of UnknownHostException:
https://issues.apache.org/jira/browse/ZOOKEEPER-2614




> Update ZooKeeper to 3.4.11 and other minor updates
> --
>
> Key: KAFKA-6390
> URL: https://issues.apache.org/jira/browse/KAFKA-6390
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6390) Update ZooKeeper to 3.4.11, Gradle and other minor updates

2017-12-20 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6390:
---
Summary: Update ZooKeeper to 3.4.11, Gradle and other minor updates  (was: 
Update ZooKeeper to 3.4.11 and other minor updates)

> Update ZooKeeper to 3.4.11, Gradle and other minor updates
> --
>
> Key: KAFKA-6390
> URL: https://issues.apache.org/jira/browse/KAFKA-6390
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6390) Update ZooKeeper to 3.4.11 and other minor updates

2017-12-20 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6390:
---
Description: https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a 
helpful fix.

> Update ZooKeeper to 3.4.11 and other minor updates
> --
>
> Key: KAFKA-6390
> URL: https://issues.apache.org/jira/browse/KAFKA-6390
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2614 is a helpful fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6390) Update ZooKeeper to 3.4.11 and other minor updates

2017-12-20 Thread Ismael Juma (JIRA)
Ismael Juma created KAFKA-6390:
--

 Summary: Update ZooKeeper to 3.4.11 and other minor updates
 Key: KAFKA-6390
 URL: https://issues.apache.org/jira/browse/KAFKA-6390
 Project: Kafka
  Issue Type: Bug
Reporter: Ismael Juma
Assignee: Ismael Juma






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed

2017-12-20 Thread David Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298188#comment-16298188
 ] 

David Cheung edited comment on KAFKA-6188 at 12/20/17 9:54 AM:
---

Hi, I am facing exactly the same problem here. My stack: kafka (4.0 tag from 
https://hub.docker.com/r/confluentinc/cp-enterprise-kafka/) running on docker 
swarm under Amazon ec2 instances. The storage I used is Amazon's EFS. In my 
case, some log files cannot be deleted which will trigger this bug:
{code:xml}
Caused by: java.nio.file.FileSystemException: 
/var/lib/kafka/data/ksql_transient_8376289768731246768_1513675960541-KSTREAM-REDUCE-STATE-STORE-03-changelog-1.a9edc755278d425e9227bb03eb0cd55f-delete/.nfs937861751206a94a0fa2:
 Device or resource busy
...
...
[2017-12-19 10:56:37,681] INFO Stopping serving logs in dir /var/lib/kafka/data 
(kafka.log.LogManager)
[2017-12-19 10:56:37,682] FATAL Shutdown broker because all log dirs in 
/var/lib/kafka/data have failed (kafka.log.LogManager)
{code}


was (Author: chubao):
Hi, I am facing exactly the same problem here. My stack: kafka (4.0 tag from 
https://hub.docker.com/r/confluentinc/cp-enterprise-kafka/) running on docker 
swarm under Amazon ec2 instances. The storage I used is Amazon's EFS. In my 
case, some log files cannot be deleted which will trigger this bug:
{code:xml}
Caused by: java.nio.file.FileSystemException: 
/var/lib/kafka/data/ksql_transient_8376289768731246768_1513675960541-KSTREAM-REDUCE-STATE-STORE-03-changelog-1.a9edc755278d425e9227bb03eb0cd55f-delete/.nfs937861751206a94a0fa2:
 Device or resource busy
{code}

> Broker fails with FATAL Shutdown - log dirs have failed
> ---
>
> Key: KAFKA-6188
> URL: https://issues.apache.org/jira/browse/KAFKA-6188
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, log
>Affects Versions: 1.0.0
> Environment: Windows 10
>Reporter: Valentina Baljak
>Priority: Blocker
>
> Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The 
> test environment is very simple, with only one producer and one consumer. 
> Initially, everything started fine, stand alone tests worked as expected. 
> However, running my code, Kafka clients fail after approximately 10 minutes. 
> Kafka won't start after that and it fails with the same error. 
> Deleting logs helps to start again, and the same problem occurs.
> Here is the error traceback:
> [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 30 
> ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of 
> 9223372036854775807 ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. 
> (kafka.network.Acceptor)
> [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor 
> threads (kafka.network.SocketServer)
> [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting 
> (kafka.server.ReplicaManager$LogDirFailureHandler)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving 
> replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions  are 
> offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed 
> fetcher for partitions  (kafka.server.ReplicaFetcherManager)
> [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped 
> fetcher for partitions  because they are in the failed log dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager)
> [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed

2017-12-20 Thread David Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298188#comment-16298188
 ] 

David Cheung commented on KAFKA-6188:
-

Hi, I am facing exactly the same problem here. My stack: kafka (4.0 tag from 
https://hub.docker.com/r/confluentinc/cp-enterprise-kafka/) running on docker 
swarm under Amazon ec2 instances. The storage I used is Amazon's EFS. In my 
case, some log files cannot be deleted which will trigger this bug:
{code:xml}
Caused by: java.nio.file.FileSystemException: 
/var/lib/kafka/data/ksql_transient_8376289768731246768_1513675960541-KSTREAM-REDUCE-STATE-STORE-03-changelog-1.a9edc755278d425e9227bb03eb0cd55f-delete/.nfs937861751206a94a0fa2:
 Device or resource busy
{code}

> Broker fails with FATAL Shutdown - log dirs have failed
> ---
>
> Key: KAFKA-6188
> URL: https://issues.apache.org/jira/browse/KAFKA-6188
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, log
>Affects Versions: 1.0.0
> Environment: Windows 10
>Reporter: Valentina Baljak
>Priority: Blocker
>
> Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The 
> test environment is very simple, with only one producer and one consumer. 
> Initially, everything started fine, stand alone tests worked as expected. 
> However, running my code, Kafka clients fail after approximately 10 minutes. 
> Kafka won't start after that and it fails with the same error. 
> Deleting logs helps to start again, and the same problem occurs.
> Here is the error traceback:
> [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 30 
> ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of 
> 9223372036854775807 ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. 
> (kafka.network.Acceptor)
> [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor 
> threads (kafka.network.SocketServer)
> [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting 
> (kafka.server.ReplicaManager$LogDirFailureHandler)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving 
> replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions  are 
> offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed 
> fetcher for partitions  (kafka.server.ReplicaFetcherManager)
> [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped 
> fetcher for partitions  because they are in the failed log dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager)
> [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6389) Expose transaction metrics via JMX

2017-12-20 Thread JIRA
Florent Ramière created KAFKA-6389:
--

 Summary: Expose transaction metrics via JMX
 Key: KAFKA-6389
 URL: https://issues.apache.org/jira/browse/KAFKA-6389
 Project: Kafka
  Issue Type: Improvement
  Components: metrics
Affects Versions: 1.0.0
Reporter: Florent Ramière


Expose various metrics from 
https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka
Especially 
* number of transactions
* number of current transactions
* timeout



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298169#comment-16298169
 ] 

ASF GitHub Bot commented on KAFKA-6323:
---

Github user fredfp closed the pull request at:

https://github.com/apache/kafka/pull/4304


> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Frederic Arno (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298166#comment-16298166
 ] 

Frederic Arno commented on KAFKA-6323:
--

I'm fine with most of that, I only have a doubt about not aligning wall-clock 
punctuations on {{now + N*interval}}, which could effectively make punctuations 
calls drift away. Do you have use cases where spacing punctuations by at least 
interval is critical and requires that behavior?

I've pushed updated code, in which I do not allow punctuation time drift (this 
makes the behavior more coherent between stream-time and wall-clock-time 
punctuation).

By default, the new code aligns punctuations as discussed above.

I've also added an overload, enabling users to choose the first punctuation 
time, the first punctuation time then becomes the reference on which further 
punctuations are aligned.
{code:java}
public Cancellable schedule(final long startTime, final long interval, final 
PunctuationType type, final Punctuator punctuator)
{code}

In my use case, I use wall-clock time punctuation to punctuate every day at 
2am. I would use the new API the following way, allowing me to call 
{{context.schedule()}} once instead of twice currently:
{code:java}
context.schedule(timeUntil2Am, 24 * 60 * 60 * 1000, WALL_CLOCK_TIME, (callTime) 
-> doStuffRightAfter2am(callTime))
{code}


> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6323) punctuate with WALL_CLOCK_TIME triggered immediately

2017-12-20 Thread Frederic Arno (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298167#comment-16298167
 ] 

Frederic Arno commented on KAFKA-6323:
--

I'm fine with most of that, I only have a doubt about not aligning wall-clock 
punctuations on {{now + N*interval}}, which could effectively make punctuations 
calls drift away. Do you have use cases where spacing punctuations by at least 
interval is critical and requires that behavior?

I've pushed updated code, in which I do not allow punctuation time drift (this 
makes the behavior more coherent between stream-time and wall-clock-time 
punctuation).

By default, the new code aligns punctuations as discussed above.

I've also added an overload, enabling users to choose the first punctuation 
time, the first punctuation time then becomes the reference on which further 
punctuations are aligned.
{code:java}
public Cancellable schedule(final long startTime, final long interval, final 
PunctuationType type, final Punctuator punctuator)
{code}

In my use case, I use wall-clock time punctuation to punctuate every day at 
2am. I would use the new API the following way, allowing me to call 
{{context.schedule()}} once instead of twice currently:
{code:java}
context.schedule(timeUntil2Am, 24 * 60 * 60 * 1000, WALL_CLOCK_TIME, (callTime) 
-> doStuffRightAfter2am(callTime))
{code}


> punctuate with WALL_CLOCK_TIME triggered immediately
> 
>
> Key: KAFKA-6323
> URL: https://issues.apache.org/jira/browse/KAFKA-6323
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Arno
>Assignee: Frederic Arno
> Fix For: 1.1.0, 1.0.1
>
>
> When working on a custom Processor from which I am scheduling a punctuation 
> using WALL_CLOCK_TIME. I've noticed that whatever the punctuation interval I 
> set, a call to my Punctuator is always triggered immediately.
> Having a quick look at kafka-streams' code, I could find that all 
> PunctuationSchedule's timestamps are matched against the current time in 
> order to decide whether or not to trigger the punctuator 
> (org.apache.kafka.streams.processor.internals.PunctuationQueue#mayPunctuate). 
> However, I've only seen code that initializes PunctuationSchedule's timestamp 
> to 0, which I guess is what is causing an immediate punctuation.
> At least when using WALL_CLOCK_TIME, shouldn't the PunctuationSchedule's 
> timestamp be initialized to current time + interval?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)