[jira] [Commented] (KAFKA-5504) Kafka controller is not getting elected

2017-11-14 Thread Andrey (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251148#comment-16251148
 ] 

Andrey commented on KAFKA-5504:
---

zkCli shows:
- broker 1 is the controller
- several seconds later broker 2 is the controller
- several seconds later broker 1 is the controller or broker 3.

{code}
[zk: host1:2181(CONNECTED) 18] get /kafka-prod/controller
{"version":1,"brokerid":1,"timestamp":"1510651771022"}
{code}

{code}
[zk: host2:2181(CONNECTED) 19] get /kafka-prod/controller
{"version":1,"brokerid":2,"timestamp":"1510651882077"}
{code}

Looks like some king of double locking error on zk resources. 

> Kafka controller is not getting elected
> ---
>
> Key: KAFKA-5504
> URL: https://issues.apache.org/jira/browse/KAFKA-5504
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.9.0.1
>Reporter: Ashish Kumar
>
> I am having a kafka cluster of 20 nodes and I was facing the issue of 
> under-replicated topics issue for last few days so decided to restart the 
> broker which was working as a controller but after restart getting below logs 
> in all the brokers (It seems controller is not finalized and leader election 
> is happening continuously):
> [2017-06-23 02:59:50,388] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:50,396] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:50,410] INFO Rolled new log segment for 
> 'dpe_feedback_rating_history-4' in 0 ms. (kafka.log.Log)
> [2017-06-23 02:59:51,585] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,590] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,609] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:52,792] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,799] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,808] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:54,122] INFO New leader is 3 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:55,504] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,512] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,520] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:56,695] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,701] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,709] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:57,949] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,955] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,965] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:59,378] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,384] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,395] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> .
> .
> .
> Tried deleting controller znode (/controller) but no luck. Please let me know 
> if any fix is possible here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (KAFKA-5504) Kafka controller is not getting elected

2017-11-14 Thread Andrey (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey updated KAFKA-5504:
--
Comment: was deleted

(was: zkCli shows:
- broker 1 is the controller
- several seconds later broker 2 is the controller
- several seconds later broker 1 is the controller or broker 3.

{code}
[zk: host1:2181(CONNECTED) 18] get /kafka-prod/controller
{"version":1,"brokerid":1,"timestamp":"1510651771022"}
{code}

{code}
[zk: host2:2181(CONNECTED) 19] get /kafka-prod/controller
{"version":1,"brokerid":2,"timestamp":"1510651882077"}
{code}

Looks like some king of double locking error on zk resources. )

> Kafka controller is not getting elected
> ---
>
> Key: KAFKA-5504
> URL: https://issues.apache.org/jira/browse/KAFKA-5504
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.9.0.1
>Reporter: Ashish Kumar
>
> I am having a kafka cluster of 20 nodes and I was facing the issue of 
> under-replicated topics issue for last few days so decided to restart the 
> broker which was working as a controller but after restart getting below logs 
> in all the brokers (It seems controller is not finalized and leader election 
> is happening continuously):
> [2017-06-23 02:59:50,388] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:50,396] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:50,410] INFO Rolled new log segment for 
> 'dpe_feedback_rating_history-4' in 0 ms. (kafka.log.Log)
> [2017-06-23 02:59:51,585] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,590] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,609] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:52,792] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,799] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,808] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:54,122] INFO New leader is 3 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:55,504] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,512] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,520] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:56,695] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,701] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,709] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:57,949] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,955] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,965] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:59,378] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,384] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,395] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> .
> .
> .
> Tried deleting controller znode (/controller) but no luck. Please let me know 
> if any fix is possible here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5504) Kafka controller is not getting elected

2017-11-14 Thread Andrey (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251161#comment-16251161
 ] 

Andrey commented on KAFKA-5504:
---

Looks like configuration issue on our side. Checked *controller.log*:

{code}
[2017-11-14 09:50:37,093] ERROR [Controller 1]: Error while electing or 
becoming controller on broker 1 (kafka.controller.KafkaController)
kafka.admin.AdminOperationException: org.I0Itec.zkclient.exception.ZkException: 
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth 
for /admin/reassign_partitions
at 
kafka.utils.ZkUtils.updatePartitionReassignmentData(ZkUtils.scala:803)
at 
kafka.controller.KafkaController.removePartitionFromReassignedPartitions(KafkaController.scala:933)
at 
kafka.controller.KafkaController$$anonfun$initializePartitionReassignment$1.apply(KafkaController.scala:715)
{code}

> Kafka controller is not getting elected
> ---
>
> Key: KAFKA-5504
> URL: https://issues.apache.org/jira/browse/KAFKA-5504
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.9.0.1
>Reporter: Ashish Kumar
>
> I am having a kafka cluster of 20 nodes and I was facing the issue of 
> under-replicated topics issue for last few days so decided to restart the 
> broker which was working as a controller but after restart getting below logs 
> in all the brokers (It seems controller is not finalized and leader election 
> is happening continuously):
> [2017-06-23 02:59:50,388] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:50,396] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:50,410] INFO Rolled new log segment for 
> 'dpe_feedback_rating_history-4' in 0 ms. (kafka.log.Log)
> [2017-06-23 02:59:51,585] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,590] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,609] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:52,792] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,799] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,808] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:54,122] INFO New leader is 3 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:55,504] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,512] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,520] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:56,695] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,701] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,709] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:57,949] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,955] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,965] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:59,378] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,384] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,395] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> .
> .
> .
> Tried deleting controller znode (/controller) but no luck. Please let me know 
> if any fix is possible here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6194) Server crash while deleting segments

2017-11-14 Thread Ben Corlett (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Corlett updated KAFKA-6194:
---
Attachment: server.log.2017-11-14-03.gz

> Server crash while deleting segments
> 
>
> Key: KAFKA-6194
> URL: https://issues.apache.org/jira/browse/KAFKA-6194
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
> Environment: kafka version: 1.0
> client versions: 0.8.2.1-0.10.2.1
> platform: aws (eu-west-1a)
> nodes: 36 x r4.xlarge
> disk storage: 2.5 tb per node (~73% usage per node)
> topics: 250
> number of partitions: 48k (approx)
> os: ubuntu 14.04
> jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) 
> 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ben Corlett
>  Labels: regression
> Attachments: server.log.2017-11-14-03.gz
>
>
> We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock 
> from 0.11.0.*. Sadly our cluster has the memory leak issues with 1.0 most 
> likely from https://issues.apache.org/jira/browse/KAFKA-6185. We are running 
> one server on a patched version of 1.0 with the pull request from that.
> However today we have had two different servers fall over for non-heap 
> related reasons. The exceptions in the kafka log are :
> {code}
> [2017-11-09 15:32:04,037] ERROR Error while deleting segments for 
> xx-49 in dir /mnt/secure/kafka/datalog 
> (kafka.server.LogDirFailureChannel)
> java.io.IOException: Delete of log .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> [2017-11-09 15:32:04,040] INFO [ReplicaManager broker=122] Stopping serving 
> replicas in dir /mnt/secure/kafka/datalog (kafka.server.ReplicaManager)
> [2017-11-09 15:32:04,041] ERROR Uncaught exception in scheduled task 
> 'delete-file' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.KafkaStorageException: Error while deleting 
> segments for xx-49 in dir /mnt/secure/kafka/datalog
> Caused by: java.io.IOException: Delete of log 
> .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> .
> [2017-11-09 15:32:05,341] ERROR Error while processing data for partition 
> xxx-83 (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an 
> offline log directory for partition xxx-83
> [2017-11-09 15:32:05,341] ER

[jira] [Commented] (KAFKA-6194) Server crash while deleting segments

2017-11-14 Thread Ben Corlett (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251371#comment-16251371
 ] 

Ben Corlett commented on KAFKA-6194:


In the past topics have been deleted and recreated with the same name. 
Unfortunately this has resulted in "lingering" topics that had to manually be 
removed from zookeeper. These days we version the topics and create new ones 
with new names. It might still happen but very rarely will we do this.

In the last 24 hours we have had another 15 failures. I've sanitised (renamed 
the topics) the log file from one of them and attached it to the issue.

{code}
[2017-11-14 03:33:30,238] ERROR Uncaught exception in scheduled task 
'delete-file' (kafka.utils.KafkaScheduler)
kafka.common.KafkaStorageException: Delete of log 
.log.deleted failed.
at kafka.log.LogSegment.delete(LogSegment.scala:505)
at kafka.log.Log.deleteSeg$1(Log.scala:1490)
at kafka.log.Log.$anonfun$asyncDeleteSegment$2(Log.scala:1492)
at 
kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}

I had a brief look though the code and I can see what you mean about the 
exception only happening if the delete fails and the file still exists.

As you can see from the logs we have very aggressive 
rotation/deletion/compaction of 'topic1'. Probably way more than actually 
needed. But thats the devs for you.

I wonder if we have a sequence of two deletes followed by a compaction 
finishing and a new .delete file appearing so:

1. delete the file (succeeds)
2. delete the file (fails)
3. compaction finishes and new .delete file created
4. check file exists... it does ... throw exception.

Another thing. It always seems to be for .log.deleted on 
all the different servers.

Maybe we could enable additional logging.

> Server crash while deleting segments
> 
>
> Key: KAFKA-6194
> URL: https://issues.apache.org/jira/browse/KAFKA-6194
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
> Environment: kafka version: 1.0
> client versions: 0.8.2.1-0.10.2.1
> platform: aws (eu-west-1a)
> nodes: 36 x r4.xlarge
> disk storage: 2.5 tb per node (~73% usage per node)
> topics: 250
> number of partitions: 48k (approx)
> os: ubuntu 14.04
> jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) 
> 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ben Corlett
>  Labels: regression
> Attachments: server.log.2017-11-14-03.gz
>
>
> We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock 
> from 0.11.0.*. Sadly our cluster has the memory leak issues with 1.0 most 
> likely from https://issues.apache.org/jira/browse/KAFKA-6185. We are running 
> one server on a patched version of 1.0 with the pull request from that.
> However today we have had two different servers fall over for non-heap 
> related reasons. The exceptions in the kafka log are :
> {code}
> [2017-11-09 15:32:04,037] ERROR Error while deleting segments for 
> xx-49 in dir /mnt/secure/kafka/datalog 
> (kafka.server.LogDirFailureChannel)
> java.io.IOException: Delete of log .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180

[jira] [Commented] (KAFKA-6194) Server crash while deleting segments

2017-11-14 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251382#comment-16251382
 ] 

Ismael Juma commented on KAFKA-6194:


[~corlettb], thanks for the logs. You previously stated:

{code}
min.cleanable.dirty.ratio   0.05
cleanup.policy  compact
retention.ms1000
delete.retention.ms 60
segment.bytes   500
{code}

Is it right that cleanup.policy is compact? If that is really the case, we 
would not apply retention by size or by time.

> Server crash while deleting segments
> 
>
> Key: KAFKA-6194
> URL: https://issues.apache.org/jira/browse/KAFKA-6194
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
> Environment: kafka version: 1.0
> client versions: 0.8.2.1-0.10.2.1
> platform: aws (eu-west-1a)
> nodes: 36 x r4.xlarge
> disk storage: 2.5 tb per node (~73% usage per node)
> topics: 250
> number of partitions: 48k (approx)
> os: ubuntu 14.04
> jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) 
> 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ben Corlett
>  Labels: regression
> Attachments: server.log.2017-11-14-03.gz
>
>
> We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock 
> from 0.11.0.*. Sadly our cluster has the memory leak issues with 1.0 most 
> likely from https://issues.apache.org/jira/browse/KAFKA-6185. We are running 
> one server on a patched version of 1.0 with the pull request from that.
> However today we have had two different servers fall over for non-heap 
> related reasons. The exceptions in the kafka log are :
> {code}
> [2017-11-09 15:32:04,037] ERROR Error while deleting segments for 
> xx-49 in dir /mnt/secure/kafka/datalog 
> (kafka.server.LogDirFailureChannel)
> java.io.IOException: Delete of log .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> [2017-11-09 15:32:04,040] INFO [ReplicaManager broker=122] Stopping serving 
> replicas in dir /mnt/secure/kafka/datalog (kafka.server.ReplicaManager)
> [2017-11-09 15:32:04,041] ERROR Uncaught exception in scheduled task 
> 'delete-file' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.KafkaStorageException: Error while deleting 
> segments for xx-49 in dir /mnt/secure/kafka/datalog
> Caused by: java.io.IOException: Delete of log 
> .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6

[jira] [Commented] (KAFKA-6194) Server crash while deleting segments

2017-11-14 Thread Ben Corlett (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251400#comment-16251400
 ] 

Ben Corlett commented on KAFKA-6194:


Topic1 from that log has settings:

min.cleanable.dirty.ratio   0.05
cleanup.policy  compact
retention.ms1000
delete.retention.ms 60
segment.bytes   500

Yes. These settings were done by the devs presumably they were looking for very 
aggressive compaction.

> Server crash while deleting segments
> 
>
> Key: KAFKA-6194
> URL: https://issues.apache.org/jira/browse/KAFKA-6194
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
> Environment: kafka version: 1.0
> client versions: 0.8.2.1-0.10.2.1
> platform: aws (eu-west-1a)
> nodes: 36 x r4.xlarge
> disk storage: 2.5 tb per node (~73% usage per node)
> topics: 250
> number of partitions: 48k (approx)
> os: ubuntu 14.04
> jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) 
> 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ben Corlett
>  Labels: regression
> Attachments: server.log.2017-11-14-03.gz
>
>
> We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock 
> from 0.11.0.*. Sadly our cluster has the memory leak issues with 1.0 most 
> likely from https://issues.apache.org/jira/browse/KAFKA-6185. We are running 
> one server on a patched version of 1.0 with the pull request from that.
> However today we have had two different servers fall over for non-heap 
> related reasons. The exceptions in the kafka log are :
> {code}
> [2017-11-09 15:32:04,037] ERROR Error while deleting segments for 
> xx-49 in dir /mnt/secure/kafka/datalog 
> (kafka.server.LogDirFailureChannel)
> java.io.IOException: Delete of log .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> [2017-11-09 15:32:04,040] INFO [ReplicaManager broker=122] Stopping serving 
> replicas in dir /mnt/secure/kafka/datalog (kafka.server.ReplicaManager)
> [2017-11-09 15:32:04,041] ERROR Uncaught exception in scheduled task 
> 'delete-file' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.KafkaStorageException: Error while deleting 
> segments for xx-49 in dir /mnt/secure/kafka/datalog
> Caused by: java.io.IOException: Delete of log 
> .log.deleted failed.
> at kafka.log.LogSegment.delete(LogSegment.scala:496)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.deleteSeg$1(Log.scala:1596)
> at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)

[jira] [Commented] (KAFKA-4281) Should be able to forward aggregation values immediately

2017-11-14 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251583#comment-16251583
 ] 

Damian Guy commented on KAFKA-4281:
---

[~mjsax] - yep agreed

> Should be able to forward aggregation values immediately
> 
>
> Key: KAFKA-4281
> URL: https://issues.apache.org/jira/browse/KAFKA-4281
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Greg Fodor
>Assignee: Greg Fodor
>
> KIP-63 introduced changes to the behavior of aggregations such that the 
> result of aggregations will not appear to subsequent processors until a state 
> store flush occurs. This is problematic for latency sensitive aggregations 
> since flushes occur generally at commit.interval.ms, which is usually a few 
> seconds. Combined with several aggregations, this can result in several 
> seconds of latency through a topology for steps dependent upon aggregations.
> Two potential solutions:
> - Allow finer control over the state store flushing intervals
> - Allow users to change the behavior so that certain aggregations will 
> immediately forward records to the next step (as was the case pre-KIP-63)
> A PR is attached that takes the second approach. To add this unfortunately a 
> large number of files needed to be touched, and this effectively doubles the 
> number of method signatures around grouping on KTable and KStream. I tried an 
> alternative approach that let the user opt-in to immediate forwarding via an 
> additional builder method on KGroupedStream/Table but this didn't work as 
> expected because in order for the latency to go away, the KTableImpl itself 
> must also mark its source as forward immediate (otherwise we will still see 
> latency due to the materialization of the KTableSource still relying upon 
> state store flushes to propagate.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4730) Streams does not have an in-memory windowed store

2017-11-14 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251586#comment-16251586
 ] 

Damian Guy commented on KAFKA-4730:
---

[~mjsax] no we still don't have an in-memory window store

> Streams does not have an in-memory windowed store
> -
>
> Key: KAFKA-4730
> URL: https://issues.apache.org/jira/browse/KAFKA-4730
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Affects Versions: 0.10.2.0
>Reporter: Eno Thereska
>Assignee: Nikki Thean
> Fix For: 1.1.0
>
>
> Streams has windowed persistent stores (e.g., see PersistentKeyValueFactory 
> interface with "windowed" method), however it does not allow for windowed 
> in-memory stores (e.g., see InMemoryKeyValueFactory interface). 
> In addition to the interface not allowing it, streams does not actually have 
> an implementation of an in-memory windowed store.
> The implications are that operations that require windowing cannot use 
> in-memory stores. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4729) Stores for kstream-kstream join cannot be in-memory

2017-11-14 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251590#comment-16251590
 ] 

Damian Guy commented on KAFKA-4729:
---

@mjsax no it is still RocksDB only

> Stores for kstream-kstream join cannot be in-memory
> ---
>
> Key: KAFKA-4729
> URL: https://issues.apache.org/jira/browse/KAFKA-4729
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.10.2.0
>Reporter: Eno Thereska
> Fix For: 1.1.0
>
>
> Whereas we can specify in the DSL that stores used for aggregates can be 
> RocksDb-based on in-memory, we cannot do that for stores used for 
> KStream-KStream joins. E.g., the join() methon in KStreamImpl.java creates 
> two state stores and the user does not have the option of having them be 
> in-memory:
> StateStoreSupplier thisWindow =
> createWindowedStateStore(windows, keySerde, lhsValueSerde, 
> joinThisName + "-store");
> StateStoreSupplier otherWindow =
> createWindowedStateStore(windows, keySerde, otherValueSerde, 
> joinOtherName + "-store");
> Part of the problem is that for joins, stores are not exposed to the user. We 
> might want to rethink that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5254) rethink KTable API overloads

2017-11-14 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251595#comment-16251595
 ] 

Damian Guy commented on KAFKA-5254:
---

yep

> rethink KTable API overloads
> 
>
> Key: KAFKA-5254
> URL: https://issues.apache.org/jira/browse/KAFKA-5254
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.11.0.0
>Reporter: Eno Thereska
>  Labels: needs-kip
> Fix For: 1.0.0
>
>
> Currently all the APIs that produce KTables have several overloads (e.g., to 
> take queryableName, state store supplier, serdes, etc). It might be possible 
> to use builder patterns instead of the overloads. This will require a KIP to 
> discuss. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4730) Streams does not have an in-memory windowed store

2017-11-14 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251719#comment-16251719
 ] 

Guozhang Wang commented on KAFKA-4730:
--

I do not think so. Plus we do not have an in-memory session store as well.

> Streams does not have an in-memory windowed store
> -
>
> Key: KAFKA-4730
> URL: https://issues.apache.org/jira/browse/KAFKA-4730
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Affects Versions: 0.10.2.0
>Reporter: Eno Thereska
>Assignee: Nikki Thean
> Fix For: 1.1.0
>
>
> Streams has windowed persistent stores (e.g., see PersistentKeyValueFactory 
> interface with "windowed" method), however it does not allow for windowed 
> in-memory stores (e.g., see InMemoryKeyValueFactory interface). 
> In addition to the interface not allowing it, streams does not actually have 
> an implementation of an in-memory windowed store.
> The implications are that operations that require windowing cannot use 
> in-memory stores. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6207) Include start of record when RecordIsTooLarge

2017-11-14 Thread Tadhg Pearson (JIRA)
Tadhg Pearson created KAFKA-6207:


 Summary: Include start of record when RecordIsTooLarge
 Key: KAFKA-6207
 URL: https://issues.apache.org/jira/browse/KAFKA-6207
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.10.1.1
Reporter: Tadhg Pearson
Priority: Minor


When a message is too large to be send (at 
org.apache.kafka.clients.producer.KafkaProducer#doSend), the 
RecordTooLargeException should carry the start of the record (for example, the 
first 1KB) so that the calling application can debug which message caused the 
error. 

For example: one common use case of Kafka is logging. The 
RecordTooLargeException is thrown due to a large log message being sent by the 
application. How do you know which statement in your application logged this 
large message? If your exception has thousands of logging statements, it will 
be very tough to find which one is the cause today but you include the 
start of the message, this could prove a very strong hint as to the cause!





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6207) Include start of record when RecordIsTooLarge

2017-11-14 Thread Tadhg Pearson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tadhg Pearson updated KAFKA-6207:
-
Description: 
When a message is too large to be sent (at 
org.apache.kafka.clients.producer.KafkaProducer#doSend), the 
RecordTooLargeException should carry the start of the record (for example, the 
first 1KB) so that the calling application can debug which message caused the 
error. 

For example: one common use case of Kafka is logging. The 
RecordTooLargeException is thrown due to a large log message being sent by the 
application. How do you know which statement in your application logged this 
large message? If your exception has thousands of logging statements, it will 
be very tough to find which one is the cause today but you include the 
start of the message, this could prove a very strong hint as to the cause!



  was:
When a message is too large to be send (at 
org.apache.kafka.clients.producer.KafkaProducer#doSend), the 
RecordTooLargeException should carry the start of the record (for example, the 
first 1KB) so that the calling application can debug which message caused the 
error. 

For example: one common use case of Kafka is logging. The 
RecordTooLargeException is thrown due to a large log message being sent by the 
application. How do you know which statement in your application logged this 
large message? If your exception has thousands of logging statements, it will 
be very tough to find which one is the cause today but you include the 
start of the message, this could prove a very strong hint as to the cause!




> Include start of record when RecordIsTooLarge
> -
>
> Key: KAFKA-6207
> URL: https://issues.apache.org/jira/browse/KAFKA-6207
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.10.1.1
>Reporter: Tadhg Pearson
>Priority: Minor
>
> When a message is too large to be sent (at 
> org.apache.kafka.clients.producer.KafkaProducer#doSend), the 
> RecordTooLargeException should carry the start of the record (for example, 
> the first 1KB) so that the calling application can debug which message caused 
> the error. 
> For example: one common use case of Kafka is logging. The 
> RecordTooLargeException is thrown due to a large log message being sent by 
> the application. How do you know which statement in your application logged 
> this large message? If your exception has thousands of logging statements, it 
> will be very tough to find which one is the cause today but you include 
> the start of the message, this could prove a very strong hint as to the cause!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-4281) Should be able to forward aggregation values immediately

2017-11-14 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251742#comment-16251742
 ] 

Matthias J. Sax edited comment on KAFKA-4281 at 11/14/17 5:08 PM:
--

[~gfodor] Can you confirm that your feature request is covered. If yes, would 
you mind to close your PR and close this ticket? Thx.


was (Author: mjsax):
[~gfodor] Can you confirm that your feature request is covered. If yes, please 
close your PR and resolve this.

> Should be able to forward aggregation values immediately
> 
>
> Key: KAFKA-4281
> URL: https://issues.apache.org/jira/browse/KAFKA-4281
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Greg Fodor
>Assignee: Greg Fodor
>
> KIP-63 introduced changes to the behavior of aggregations such that the 
> result of aggregations will not appear to subsequent processors until a state 
> store flush occurs. This is problematic for latency sensitive aggregations 
> since flushes occur generally at commit.interval.ms, which is usually a few 
> seconds. Combined with several aggregations, this can result in several 
> seconds of latency through a topology for steps dependent upon aggregations.
> Two potential solutions:
> - Allow finer control over the state store flushing intervals
> - Allow users to change the behavior so that certain aggregations will 
> immediately forward records to the next step (as was the case pre-KIP-63)
> A PR is attached that takes the second approach. To add this unfortunately a 
> large number of files needed to be touched, and this effectively doubles the 
> number of method signatures around grouping on KTable and KStream. I tried an 
> alternative approach that let the user opt-in to immediate forwarding via an 
> additional builder method on KGroupedStream/Table but this didn't work as 
> expected because in order for the latency to go away, the KTableImpl itself 
> must also mark its source as forward immediate (otherwise we will still see 
> latency due to the materialization of the KTableSource still relying upon 
> state store flushes to propagate.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4281) Should be able to forward aggregation values immediately

2017-11-14 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251742#comment-16251742
 ] 

Matthias J. Sax commented on KAFKA-4281:


[~gfodor] Can you confirm that your feature request is covered. If yes, please 
close your PR and resolve this.

> Should be able to forward aggregation values immediately
> 
>
> Key: KAFKA-4281
> URL: https://issues.apache.org/jira/browse/KAFKA-4281
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Greg Fodor
>Assignee: Greg Fodor
>
> KIP-63 introduced changes to the behavior of aggregations such that the 
> result of aggregations will not appear to subsequent processors until a state 
> store flush occurs. This is problematic for latency sensitive aggregations 
> since flushes occur generally at commit.interval.ms, which is usually a few 
> seconds. Combined with several aggregations, this can result in several 
> seconds of latency through a topology for steps dependent upon aggregations.
> Two potential solutions:
> - Allow finer control over the state store flushing intervals
> - Allow users to change the behavior so that certain aggregations will 
> immediately forward records to the next step (as was the case pre-KIP-63)
> A PR is attached that takes the second approach. To add this unfortunately a 
> large number of files needed to be touched, and this effectively doubles the 
> number of method signatures around grouping on KTable and KStream. I tried an 
> alternative approach that let the user opt-in to immediate forwarding via an 
> additional builder method on KGroupedStream/Table but this didn't work as 
> expected because in order for the latency to go away, the KTableImpl itself 
> must also mark its source as forward immediate (otherwise we will still see 
> latency due to the materialization of the KTableSource still relying upon 
> state store flushes to propagate.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4281) Should be able to forward aggregation values immediately

2017-11-14 Thread Greg Fodor (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251792#comment-16251792
 ] 

Greg Fodor commented on KAFKA-4281:
---

Sadly I'm no longer on the project using Kafka Streams but happy to close this.

> Should be able to forward aggregation values immediately
> 
>
> Key: KAFKA-4281
> URL: https://issues.apache.org/jira/browse/KAFKA-4281
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Greg Fodor
>Assignee: Greg Fodor
>
> KIP-63 introduced changes to the behavior of aggregations such that the 
> result of aggregations will not appear to subsequent processors until a state 
> store flush occurs. This is problematic for latency sensitive aggregations 
> since flushes occur generally at commit.interval.ms, which is usually a few 
> seconds. Combined with several aggregations, this can result in several 
> seconds of latency through a topology for steps dependent upon aggregations.
> Two potential solutions:
> - Allow finer control over the state store flushing intervals
> - Allow users to change the behavior so that certain aggregations will 
> immediately forward records to the next step (as was the case pre-KIP-63)
> A PR is attached that takes the second approach. To add this unfortunately a 
> large number of files needed to be touched, and this effectively doubles the 
> number of method signatures around grouping on KTable and KStream. I tried an 
> alternative approach that let the user opt-in to immediate forwarding via an 
> additional builder method on KGroupedStream/Table but this didn't work as 
> expected because in order for the latency to go away, the KTableImpl itself 
> must also mark its source as forward immediate (otherwise we will still see 
> latency due to the materialization of the KTableSource still relying upon 
> state store flushes to propagate.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6058) Add "describe consumer groups" and "list consumer groups" to KafkaAdminClient

2017-11-14 Thread Jorge Quilcate (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Quilcate updated KAFKA-6058:
--
Summary: Add "describe consumer groups" and "list consumer groups" to 
KafkaAdminClient  (was: Add "describe consumer group" to KafkaAdminClient)

> Add "describe consumer groups" and "list consumer groups" to KafkaAdminClient
> -
>
> Key: KAFKA-6058
> URL: https://issues.apache.org/jira/browse/KAFKA-6058
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias J. Sax
>Assignee: Jorge Quilcate
>  Labels: needs-kip
>
> {{KafkaAdminClient}} does not allow to get information about consumer groups. 
> This feature is supported by old {{kafka.admin.AdminClient}} though.
> We should add {{KafkaAdminClient#describeConsumerGroup()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6058) Add "describe consumer groups" and "list consumer groups" to KafkaAdminClient

2017-11-14 Thread Jorge Quilcate (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Quilcate updated KAFKA-6058:
--
Description: 
{{KafkaAdminClient}} does not allow to get information about consumer groups. 
This feature is supported by old {{kafka.admin.AdminClient}} though.

We should add {{KafkaAdminClient#describeConsumerGroups()}} and 
{{KafkaAdminClient#listConsumerGroup()}}.

  was:
{{KafkaAdminClient}} does not allow to get information about consumer groups. 
This feature is supported by old {{kafka.admin.AdminClient}} though.

We should add {{KafkaAdminClient#describeConsumerGroup()}}.


> Add "describe consumer groups" and "list consumer groups" to KafkaAdminClient
> -
>
> Key: KAFKA-6058
> URL: https://issues.apache.org/jira/browse/KAFKA-6058
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias J. Sax
>Assignee: Jorge Quilcate
>  Labels: needs-kip
>
> {{KafkaAdminClient}} does not allow to get information about consumer groups. 
> This feature is supported by old {{kafka.admin.AdminClient}} though.
> We should add {{KafkaAdminClient#describeConsumerGroups()}} and 
> {{KafkaAdminClient#listConsumerGroup()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6208) Reduce startup time for Kafka Connect workers

2017-11-14 Thread Randall Hauch (JIRA)
Randall Hauch created KAFKA-6208:


 Summary: Reduce startup time for Kafka Connect workers
 Key: KAFKA-6208
 URL: https://issues.apache.org/jira/browse/KAFKA-6208
 Project: Kafka
  Issue Type: Improvement
  Components: KafkaConnect
Affects Versions: 1.0.0
Reporter: Randall Hauch


Kafka Connect startup times are excessive with a handful of connectors on the 
plugin path or classpath. We should not be scanning three times (once for 
connectors, once for SMTs, and once for converters), and hopefully we can avoid 
scanning directories that are clearly not plugin directories. 

We should also consider using Java's Service Loader to quickly identify 
connectors. The latter would require a KIP and would require time to for 
connectors to migrate, but we could be smarter about only scanning plugin 
directories that need to be scanned.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-4857) Replace StreamsKafkaClient with AdminClient in Kafka Streams

2017-11-14 Thread Matthias J. Sax (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax reassigned KAFKA-4857:
--

Assignee: Matthias J. Sax  (was: Guozhang Wang)

> Replace StreamsKafkaClient with AdminClient in Kafka Streams
> 
>
> Key: KAFKA-4857
> URL: https://issues.apache.org/jira/browse/KAFKA-4857
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> Streams uses {{KafkaClientSupplier}} to get 
> consumer/restore-consumer/producer clients. Streams also uses one more client 
> for admin purpose namely {{StreamsKafkaClient}} that is instantiated 
> "manually".
> With the newly upcoming {{AdminClient}} from KIP-117, we can simplify (or 
> even replace {{StreamsKafkaClient}} with the new {{AdminClient}}. We 
> furthermore want to unify how the client in generated and extend 
> {{KafkaClientSupplier}} with method that return this client.
> NOTE: The public facing changes are summarized in a separate ticket 
> KAFKA-6170, and this ticket is only for the internal swap, with the accepted 
> criterion to completely remove StreamsKafkaClient with the newly introduced 
> KafkaAdminClient.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5967) Ineffective check of negative value in CompositeReadOnlyKeyValueStore#approximateNumEntries()

2017-11-14 Thread Randall Hauch (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch updated KAFKA-5967:
-
Component/s: streams

> Ineffective check of negative value in 
> CompositeReadOnlyKeyValueStore#approximateNumEntries()
> -
>
> Key: KAFKA-5967
> URL: https://issues.apache.org/jira/browse/KAFKA-5967
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Ted Yu
>Assignee: siva santhalingam
>Priority: Minor
>  Labels: beginner, newbie
> Fix For: 1.0.0, 0.11.0.2
>
>
> {code}
> long total = 0;
> for (ReadOnlyKeyValueStore store : stores) {
> total += store.approximateNumEntries();
> }
> return total < 0 ? Long.MAX_VALUE : total;
> {code}
> The check for negative value seems to account for wrapping.
> However, wrapping can happen within the for loop. So the check should be 
> performed inside the loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-6167) Timestamp on streams directory contains a colon, which is an illegal character

2017-11-14 Thread Matthias J. Sax (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax reassigned KAFKA-6167:
--

Assignee: Matthias J. Sax

> Timestamp on streams directory contains a colon, which is an illegal character
> --
>
> Key: KAFKA-6167
> URL: https://issues.apache.org/jira/browse/KAFKA-6167
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
> Environment: AK 1.0.0
> Kubernetes
> CoreOS
> JDK 1.8
> Windows
>Reporter: Justin Manchester
>Assignee: Matthias J. Sax
>Priority: Blocker
>  Labels: user-experience
> Fix For: 1.0.1
>
>
> Problem:
> Development on Windows, which is not fully supported, however still a bug 
> that should be corrected.
> It looks like a timestamp was added to the streams directory using a colon as 
> separator. I believe this is an illegal character and potentially the cause 
> for the exception below.
> Error Stack:
> 2017-11-02 16:06:41 ERROR 
> [StreamDeduplicatorAcceptanceTest1-a3ae0ac6-a024-4006-bcb1-01ff0f433f6e-StreamThread-1]
>  org.apache.kafka.streams.processor.internals.AssignedTasks:301 - 
> stream-thread 
> [StreamDeduplicatorAcceptanceTest1-a3ae0ac6-a024-4006-bcb1-01ff0f433f6e-StreamThread-1]
>  Failed to process stream task 0_0 due to the following error: 
> org.apache.kafka.streams.errors.StreamsException: Exception caught in 
> process. taskId=0_0, processor=KSTREAM-SOURCE-00, topic=input-a_1, 
> partition=0, offset=0 
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:232)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.process(AssignedTasks.java:403)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:317)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:942)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:822)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:774)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:744)
>  [kafka-streams-1.0.0.jar:?] 
> Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error 
> opening store KSTREAM-JOINTHIS-04-store:150962400 at location 
> C:\Users\ADRIAN~1.MCC\AppData\Local\Temp\kafka3548813472740086814\StreamDeduplicatorAcceptanceTest1\0_0\KSTREAM-JOINTHIS-04-store\KSTREAM-JOINTHIS-04-store:150962400
>  
> at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:204)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:174)
>  ~[kafka-streams-1.0.0.jar:?] 
> at org.apache.kafka.streams.state.internals.Segment.openDB(Segment.java:40) 
> ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.Segments.getOrCreateSegment(Segments.java:89)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:81)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBWindowStore$RocksDBWindowBytesStore.put(RocksDBWindowStore.java:43)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBWindowStore$RocksDBWindowBytesStore.put(RocksDBWindowStore.java:34)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.ChangeLoggingWindowBytesStore.put(ChangeLoggingWindowBytesStore.java:67)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.ChangeLoggingWindowBytesStore.put(ChangeLoggingWindowBytesStore.java:33)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.MeteredWindowStore.put(MeteredWindowStore.java:96)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.MeteredWindowStore.put(MeteredWindowStore.java:89)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:64)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:208)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.pr

[jira] [Created] (KAFKA-6209) Lag is inconsistent when manually committing offset for transactionnal messages

2017-11-14 Thread Fabien Chaillou (JIRA)
Fabien Chaillou created KAFKA-6209:
--

 Summary: Lag is inconsistent when manually committing offset for 
transactionnal messages
 Key: KAFKA-6209
 URL: https://issues.apache.org/jira/browse/KAFKA-6209
 Project: Kafka
  Issue Type: Improvement
  Components: consumer
Affects Versions: 0.11.0.1
Reporter: Fabien Chaillou


Hi,

I'm using kafka consumers paired with kafka producers to have transactionnal 
consume -> produce -> commit flow.
I checked the kafka-consumer-groups.sh tool to make sure my consumer were 
consuming all the messages and it appears that my consumer always have a lag of 
1.

After more digging, I discovered that kafka uses Control records for 
transaction handling and those records are not returned to the consumer. So if 
the last record published in the topic is part of a transaction then kafka will 
add a control record after it and the client code will only be able to send a 
commit request up to the last control record and the lag will be computed as 1.

It is not really a bug per se but an inconsistent behavior and I add to dig 
into the exactly-once KIP (KIP-98) and the consumer code to figure this out.

I think the issue should at least be documented somewhere as I'm honestly not 
sure of the proper fix. 

Thanks for your feedback.
Fabien




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6209) Consumer lag is inconsistent when manually committing offset for transactionnal messages

2017-11-14 Thread Fabien Chaillou (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabien Chaillou updated KAFKA-6209:
---
Summary: Consumer lag is inconsistent when manually committing offset for 
transactionnal messages  (was: Lag is inconsistent when manually committing 
offset for transactionnal messages)

> Consumer lag is inconsistent when manually committing offset for 
> transactionnal messages
> 
>
> Key: KAFKA-6209
> URL: https://issues.apache.org/jira/browse/KAFKA-6209
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Affects Versions: 0.11.0.1
>Reporter: Fabien Chaillou
>
> Hi,
> I'm using kafka consumers paired with kafka producers to have transactionnal 
> consume -> produce -> commit flow.
> I checked the kafka-consumer-groups.sh tool to make sure my consumer were 
> consuming all the messages and it appears that my consumer always have a lag 
> of 1.
> After more digging, I discovered that kafka uses Control records for 
> transaction handling and those records are not returned to the consumer. So 
> if the last record published in the topic is part of a transaction then kafka 
> will add a control record after it and the client code will only be able to 
> send a commit request up to the last control record and the lag will be 
> computed as 1.
> It is not really a bug per se but an inconsistent behavior and I had to dig 
> into the exactly-once KIP (KIP-98) and the consumer code to figure this out.
> I think the issue should at least be documented somewhere as I'm honestly not 
> sure of the proper fix. 
> Thanks for your feedback.
> Fabien



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6209) Lag is inconsistent when manually committing offset for transactionnal messages

2017-11-14 Thread Fabien Chaillou (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabien Chaillou updated KAFKA-6209:
---
Description: 
Hi,

I'm using kafka consumers paired with kafka producers to have transactionnal 
consume -> produce -> commit flow.
I checked the kafka-consumer-groups.sh tool to make sure my consumer were 
consuming all the messages and it appears that my consumer always have a lag of 
1.

After more digging, I discovered that kafka uses Control records for 
transaction handling and those records are not returned to the consumer. So if 
the last record published in the topic is part of a transaction then kafka will 
add a control record after it and the client code will only be able to send a 
commit request up to the last control record and the lag will be computed as 1.

It is not really a bug per se but an inconsistent behavior and I had to dig 
into the exactly-once KIP (KIP-98) and the consumer code to figure this out.

I think the issue should at least be documented somewhere as I'm honestly not 
sure of the proper fix. 

Thanks for your feedback.
Fabien


  was:
Hi,

I'm using kafka consumers paired with kafka producers to have transactionnal 
consume -> produce -> commit flow.
I checked the kafka-consumer-groups.sh tool to make sure my consumer were 
consuming all the messages and it appears that my consumer always have a lag of 
1.

After more digging, I discovered that kafka uses Control records for 
transaction handling and those records are not returned to the consumer. So if 
the last record published in the topic is part of a transaction then kafka will 
add a control record after it and the client code will only be able to send a 
commit request up to the last control record and the lag will be computed as 1.

It is not really a bug per se but an inconsistent behavior and I add to dig 
into the exactly-once KIP (KIP-98) and the consumer code to figure this out.

I think the issue should at least be documented somewhere as I'm honestly not 
sure of the proper fix. 

Thanks for your feedback.
Fabien



> Lag is inconsistent when manually committing offset for transactionnal 
> messages
> ---
>
> Key: KAFKA-6209
> URL: https://issues.apache.org/jira/browse/KAFKA-6209
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Affects Versions: 0.11.0.1
>Reporter: Fabien Chaillou
>
> Hi,
> I'm using kafka consumers paired with kafka producers to have transactionnal 
> consume -> produce -> commit flow.
> I checked the kafka-consumer-groups.sh tool to make sure my consumer were 
> consuming all the messages and it appears that my consumer always have a lag 
> of 1.
> After more digging, I discovered that kafka uses Control records for 
> transaction handling and those records are not returned to the consumer. So 
> if the last record published in the topic is part of a transaction then kafka 
> will add a control record after it and the client code will only be able to 
> send a commit request up to the last control record and the lag will be 
> computed as 1.
> It is not really a bug per se but an inconsistent behavior and I had to dig 
> into the exactly-once KIP (KIP-98) and the consumer code to figure this out.
> I think the issue should at least be documented somewhere as I'm honestly not 
> sure of the proper fix. 
> Thanks for your feedback.
> Fabien



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6167) Timestamp on streams directory contains a colon, which is an illegal character

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252189#comment-16252189
 ] 

ASF GitHub Bot commented on KAFKA-6167:
---

GitHub user mjsax opened a pull request:

https://github.com/apache/kafka/pull/4210

KAFKA-6167: Timestamp on streams directory contains a colon, which is an 
illegal character

 - change segment delimiter to .
 - added upgrade path
 - added test for old and new upgrade path


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/kafka kafka-6167-windows-issue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4210.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4210


commit 83b46536787c7efb5e4aa93214fd4da668124aa1
Author: Matthias J. Sax 
Date:   2017-11-14T21:12:50Z

KAFKA-6167: Timestamp on streams directory contains a colon, which is an 
illegal character




> Timestamp on streams directory contains a colon, which is an illegal character
> --
>
> Key: KAFKA-6167
> URL: https://issues.apache.org/jira/browse/KAFKA-6167
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
> Environment: AK 1.0.0
> Kubernetes
> CoreOS
> JDK 1.8
> Windows
>Reporter: Justin Manchester
>Assignee: Matthias J. Sax
>Priority: Blocker
>  Labels: user-experience
> Fix For: 1.0.1
>
>
> Problem:
> Development on Windows, which is not fully supported, however still a bug 
> that should be corrected.
> It looks like a timestamp was added to the streams directory using a colon as 
> separator. I believe this is an illegal character and potentially the cause 
> for the exception below.
> Error Stack:
> 2017-11-02 16:06:41 ERROR 
> [StreamDeduplicatorAcceptanceTest1-a3ae0ac6-a024-4006-bcb1-01ff0f433f6e-StreamThread-1]
>  org.apache.kafka.streams.processor.internals.AssignedTasks:301 - 
> stream-thread 
> [StreamDeduplicatorAcceptanceTest1-a3ae0ac6-a024-4006-bcb1-01ff0f433f6e-StreamThread-1]
>  Failed to process stream task 0_0 due to the following error: 
> org.apache.kafka.streams.errors.StreamsException: Exception caught in 
> process. taskId=0_0, processor=KSTREAM-SOURCE-00, topic=input-a_1, 
> partition=0, offset=0 
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:232)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.process(AssignedTasks.java:403)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:317)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:942)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:822)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:774)
>  [kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:744)
>  [kafka-streams-1.0.0.jar:?] 
> Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error 
> opening store KSTREAM-JOINTHIS-04-store:150962400 at location 
> C:\Users\ADRIAN~1.MCC\AppData\Local\Temp\kafka3548813472740086814\StreamDeduplicatorAcceptanceTest1\0_0\KSTREAM-JOINTHIS-04-store\KSTREAM-JOINTHIS-04-store:150962400
>  
> at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:204)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:174)
>  ~[kafka-streams-1.0.0.jar:?] 
> at org.apache.kafka.streams.state.internals.Segment.openDB(Segment.java:40) 
> ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.Segments.getOrCreateSegment(Segments.java:89)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:81)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBWindowStore$RocksDBWindowBytesStore.put(RocksDBWindowStore.java:43)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.RocksDBWindowStore$RocksDBWindowBytesStore.put(RocksDBWindowStore.java:34)
>  ~[kafka-streams-1.0.0.jar:?] 
> at 
> org.apache.kafka.streams.state.internals.ChangeLoggingWindowBytesStore.put(ChangeLoggingWindowBytesStore.

[jira] [Commented] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established

2017-11-14 Thread Prasanna Gautam (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252192#comment-16252192
 ] 

Prasanna Gautam commented on KAFKA-5473:


[~junrao] [~ijuma] I have updated the PR without the config. I don't quite 
follow what you mean by adding in kafkaController.newSession() because I can't 
find it in KafkaController anymore. I'm currently running the startup function 
if the state callback stops returning SessionEstablishmentError and if it's not 
already in a startingUp state.  Did you mean a different way to run it?
Also, I have been getting ducktape errors on TravisCI for the tests. I'd assume 
this requires some ducktape tests too?

> handle ZK session expiration properly when a new session can't be established
> -
>
> Key: KAFKA-5473
> URL: https://issues.apache.org/jira/browse/KAFKA-5473
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 0.9.0.0
>Reporter: Jun Rao
>Assignee: Prasanna Gautam
> Fix For: 1.1.0
>
>
> In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in 
> handling ZK session expiration a bit. If a new ZK session can't be 
> established after session expiration, we just log an error and continue. 
> However, this can leave the broker in a bad state since it's up, but not 
> registered from the controller's perspective. Replicas on this broker may 
> never to be in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6170) Add the AdminClient in Streams' KafkaClientSupplier

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252236#comment-16252236
 ] 

ASF GitHub Bot commented on KAFKA-6170:
---

GitHub user guozhangwang opened a pull request:

https://github.com/apache/kafka/pull/4211

[WIP] KAFKA-6170: Add AdminClient to Streams

1. Add The AdminClient into Kafka Streams, which is shared among all the 
threads.
2. Refactored mutual dependency between StreamPartitionAssignor / 
StreamTread to TaskManager as discussed in 
https://github.com/apache/kafka/pull/3624#discussion_r132614639.


### Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation 
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guozhangwang/kafka K6170-admin-client

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4211.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4211


commit 6c5b20ea34323a101118286e9282568f428b8e34
Author: Guozhang Wang 
Date:   2017-11-07T00:14:53Z

add AdminClient

commit fc908e06d80816db1e28e0f1d05e1d10fa1d0379
Author: Guozhang Wang 
Date:   2017-11-13T22:13:37Z

Merge branch 'trunk' of https://git-wip-us.apache.org/repos/asf/kafka into 
K6170-admin-client

commit d1be566efe65c71c068a6e948c59f7bd980d6bd8
Author: Guozhang Wang 
Date:   2017-11-14T21:41:20Z

refactor thread / assignor dependency

commit d1a778fff0cbaeb8ea00421d89fcd50552b93eba
Author: Guozhang Wang 
Date:   2017-11-14T21:44:09Z

revert TaskManager APIs




> Add the AdminClient in Streams' KafkaClientSupplier
> ---
>
> Key: KAFKA-6170
> URL: https://issues.apache.org/jira/browse/KAFKA-6170
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Guozhang Wang
>
> We will add Java AdminClient to Kafka Streams, in order to replace the 
> internal StreamsKafkaClient. More details can be found in KIP-220 
> (https://cwiki.apache.org/confluence/display/KAFKA/KIP-220%3A+Add+AdminClient+into+Kafka+Streams%27+ClientSupplier)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6168) Connect Schema comparison is slow for large schemas

2017-11-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252532#comment-16252532
 ] 

Loïc Monney commented on KAFKA-6168:


Hello [~rhauch],
First, thank you very much for creating this issue and helping us :)
I created this morning an issue in the schema-registry 
(https://github.com/confluentinc/schema-registry/issues/665) that illustrates 
the problems you mentioned above with the maps and the significant usages of 
`hashCode` and `equals`.

> Connect Schema comparison is slow for large schemas
> ---
>
> Key: KAFKA-6168
> URL: https://issues.apache.org/jira/browse/KAFKA-6168
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 1.0.0
>Reporter: Randall Hauch
>Assignee: Ted Yu
>Priority: Critical
> Attachments: 6168.v1.txt
>
>
> The {{ConnectSchema}} implementation computes the hash code every time its 
> needed, and {{equals(Object)}} is a deep equality check. This extra work can 
> be expensive for large schemas, especially in code like the {{AvroConverter}} 
> (or rather {{AvroData}} in the converter) that uses instances as keys in a 
> hash map that then requires significant use of {{hashCode}} and {{equals}}.
> The {{ConnectSchema}} is an immutable object and should at a minimum 
> precompute the hash code. Also, the order that the fields are compared in 
> {{equals(...)}} should use the cheapest comparisons first (e.g., the {{name}} 
> field is one of the _last_ fields to be checked). Finally, it might be worth 
> considering having each instance precompute and cache a string or byte[] 
> representation of all fields that can be used for faster equality checking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4115) Grow default heap settings for distributed Connect from 256M to 1G

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252696#comment-16252696
 ] 

ASF GitHub Bot commented on KAFKA-4115:
---

GitHub user wicknicks opened a pull request:

https://github.com/apache/kafka/pull/4213

KAFKA-4115: Increasing the heap settings for connect-distributed script

Signed-off-by: Arjun Satish 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wicknicks/kafka KAFKA-4115

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4213


commit 51e6404ec090de6d6f8919e3abe78a955ac60d7a
Author: Arjun Satish 
Date:   2017-11-14T23:33:37Z

Increasing the heap settings for connect-distributed script

Signed-off-by: Arjun Satish 




> Grow default heap settings for distributed Connect from 256M to 1G
> --
>
> Key: KAFKA-4115
> URL: https://issues.apache.org/jira/browse/KAFKA-4115
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Shikhar Bhushan
>Assignee: Arjun Satish
>  Labels: newbie
>
> Currently, both {{connect-standalone.sh}} and {{connect-distributed.sh}} 
> start the Connect JVM with the default heap settings from 
> {{kafka-run-class.sh}} of {{-Xmx256M}}.
> At least for distributed connect, we should default to a much higher limit 
> like 1G. While the 'correct' sizing is workload dependent, with a system 
> where you can run arbitrary connector plugins which may perform buffering of 
> data, we should provide for more headroom.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4115) Grow default heap settings for distributed Connect from 256M to 1G

2017-11-14 Thread Randall Hauch (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252713#comment-16252713
 ] 

Randall Hauch commented on KAFKA-4115:
--

The {{kafka-run-class.sh}} current [sets the maximum heap 
size](https://github.com/apache/kafka/blob/trunk/bin/kafka-run-class.sh#L214-L216).
 By default the JDK will set the minimum size to the smaller of 1/4 of physical 
memory or 1GB. However, we know that Java ergonomics are not correct in Docker 
images prior to JDK 9 (there are some other cases), so IMO it is worthwhile to 
set the minimum heap size to 256M (as [~wicknicks] did in his PR).

> Grow default heap settings for distributed Connect from 256M to 1G
> --
>
> Key: KAFKA-4115
> URL: https://issues.apache.org/jira/browse/KAFKA-4115
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Shikhar Bhushan
>Assignee: Arjun Satish
>  Labels: newbie
>
> Currently, both {{connect-standalone.sh}} and {{connect-distributed.sh}} 
> start the Connect JVM with the default heap settings from 
> {{kafka-run-class.sh}} of {{-Xmx256M}}.
> At least for distributed connect, we should default to a much higher limit 
> like 1G. While the 'correct' sizing is workload dependent, with a system 
> where you can run arbitrary connector plugins which may perform buffering of 
> data, we should provide for more headroom.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5859) Avoid retaining AbstractRequest in RequestChannel.Request

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252935#comment-16252935
 ] 

ASF GitHub Bot commented on KAFKA-5859:
---

GitHub user seglo opened a pull request:

https://github.com/apache/kafka/pull/4216

KAFKA-5859: Avoid retaining AbstractRequest in RequestChannel.Response

This PR removes the need to keep a reference to the parsed 
`AbstractRequest` after it's been handled in `KafkaApis`.  A reference to 
`RequestAndSize` which holds the parsed `AbstractRequest` in  
`RequestChannel.Request` was kept in scope as a consequence of being passed 
into the `RequestChannel.Response` after being handled.  

The Jira ticket 
[KAFKA-5859](https://issues.apache.org/jira/browse/KAFKA-5859) suggests 
removing this reference as soon as it's no longer needed.  I considered several 
implementations and I settled on creating a new type that contains all the 
relevant information of the Request that is required after it has been handled. 
 I think this approach allows for the least amount of invasive changes in the 
Request/Response lifecycle while retaining the immutability of the 
`RequestChannel.Request`.

A new type called `RequestChannel.RequestSummary` now contains much of the 
information that was in `RequestChannel.Request` before.  The 
`RequestChannel.Request` now generates a `RequestChannel.RequestSummary` that 
is passed into the `RequestChannel.Response` after being handled in 
`KafkaApis`.  `RequestChannel.RequestSummary` contains information such as:

* A detailed and non-detailed description of the request
* Metrics associated with the request
* Helper methods to update various Request metrics
* A special case describing whether or not the original Request was a 
`FetchRequest` and whether it was from a follower.  This information is 
required in the `updateRequestMetrics` metrics helper method.

This change does not make any behaviour changes so no additional tests were 
added.  I've verified that all unit and integration tests pass and no 
regressions were introduced.  I'm interested in seeing the before and after 
results of the Confluent Kafka system tests as described in step 11 of the 
[Contributing Code 
Changes](https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Code+Changes#ContributingCodeChanges-PullRequest)
 section.  I would like to request access to kick off this system tests suite 
if you agree that it's relevant to this ticket.

This is my first contribution to this project.  I picked up this issue 
because it was marked with the newbie flag and it seemed like a good 
opportunity to learn more about about the request and response lifecycle in the 
Kafka broker.

### Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation 
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seglo/kafka to-request-summary-5859

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/4216.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4216


commit 9bd67ea7cf16077e20db8e1a87330176eb3772de
Author: seglo 
Date:   2017-11-12T02:42:55Z

Use RequestSummary in RequestChannel.Response




> Avoid retaining AbstractRequest in RequestChannel.Request
> -
>
> Key: KAFKA-5859
> URL: https://issues.apache.org/jira/browse/KAFKA-5859
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Sean Glover
>Priority: Minor
>  Labels: newbie
>
> We currently store AbstractRequest in RequestChannel.Request.bodyAndSize. 
> RequestChannel.Request is, in turn, stored in RequestChannel.Response. We 
> keep the latter until the response is sent to the client.
> However, after KafkaApis.handle, we no longer need AbstractRequest apart from 
> its string representation for logging. We could potentially replace 
> AbstractRequest with a String representation (if the relevant logging is 
> enabled). The String representation is generally small while some 
> AbstractRequest subclasses can be pretty large. The largest one is 
> ProduceRequest and we clear the underlying ByteBuffer explicitly in 
> KafkaApis.handleProduceRequest. We could potentially remove that special case 
> if AbstractRequest subclasses were not retained.
> This was originally suggested by [~hachikuji] in the following PR 
> https://github.com/apache/kafka/pull/3801#discussion_r137592277



--
This message was sent by A

[jira] [Commented] (KAFKA-5859) Avoid retaining AbstractRequest in RequestChannel.Request

2017-11-14 Thread Sean Glover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252936#comment-16252936
 ] 

Sean Glover commented on KAFKA-5859:


I've submitted a PR for this ticket.  See description for more details 
https://github.com/apache/kafka/pull/4216

> Avoid retaining AbstractRequest in RequestChannel.Request
> -
>
> Key: KAFKA-5859
> URL: https://issues.apache.org/jira/browse/KAFKA-5859
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Sean Glover
>Priority: Minor
>  Labels: newbie
>
> We currently store AbstractRequest in RequestChannel.Request.bodyAndSize. 
> RequestChannel.Request is, in turn, stored in RequestChannel.Response. We 
> keep the latter until the response is sent to the client.
> However, after KafkaApis.handle, we no longer need AbstractRequest apart from 
> its string representation for logging. We could potentially replace 
> AbstractRequest with a String representation (if the relevant logging is 
> enabled). The String representation is generally small while some 
> AbstractRequest subclasses can be pretty large. The largest one is 
> ProduceRequest and we clear the underlying ByteBuffer explicitly in 
> KafkaApis.handleProduceRequest. We could potentially remove that special case 
> if AbstractRequest subclasses were not retained.
> This was originally suggested by [~hachikuji] in the following PR 
> https://github.com/apache/kafka/pull/3801#discussion_r137592277



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6185) Selector memory leak with high likelihood of OOM in case of down conversion

2017-11-14 Thread Brett Rann (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253043#comment-16253043
 ] 

Brett Rann commented on KAFKA-6185:
---

Our memory problem persists.

I have a .hprof from -XX:+HeapDumpOnOutOfMemoryError and a manual dump.bin 
which i could share privately. And they are still in an OOMing state if any 
further poking is required. We have jmx.rmi exposed if that's useful.

Steps I took to build and deploy:

{noformat}
 $ gradle

Gradle 4.3


Build time:   2017-10-30 15:43:29 UTC
Revision: c684c202534c4138b51033b52d871939b8d38d72

Groovy:   2.4.12
Ant:  Apache Ant(TM) version 1.9.6 compiled on June 29 2015
JVM:  1.8.0_152 (Oracle Corporation 25.152-b16)
OS:   Mac OS X 10.12.6 x86_64

java version "1.8.0_152"
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
{noformat}

{noformat}
git clone g...@github.com:apache/kafka.git
git checkout 1.0
git branch -v
* 1.0   1a5a547bb KAFKA-6190: Use consumer.position() instead of 
record.offset() to advance in GlobalKTable restoration to avoid transactional 
control messages
trunk d04daf570 MINOR: Exclude Committer Checklist section from commit message
gradle
./gradlew clean releaseTarGz
{noformat}
I then renamed ./core/build/libs/kafka_2.11-1.0.1-SNAPSHOT.jar to 
kafka_2.11-1.0.0.jar and dropped it into the lib directory where the affected 
kafka is installed and restarted kafka.

three jmap -histo:live 15s apart
https://gist.github.com/brettrann/9c68fc4db86946827c8b438c51052f85

jmap -heap
https://gist.github.com/brettrann/e7cb4bbdc79955fa8e11f00aee85e663

jstack -f
https://gist.github.com/brettrann/42b42d2619fbebb85ed07d2bccb17a60

and some pretty graphs

!https://www.evernote.com/l/Ah8RUsJOvl9LFJlW_UfbPqTuultqCW1ph-gB/image.png!

Close ups of kafka1 and kafka2 heaps

!https://www.evernote.com/l/Ah-XRMZl6LhDo6U9uIiR_kcvAM2EtxksOngB/image.png!
!https://www.evernote.com/l/Ah9z05JCkS5AzZXHK7fdFvJ9IRY2SsB0mRIB/image.png!

and

!https://www.evernote.com/l/Ah88uDCwOftFwoeZ5HeI_clN7yJuxOJ63uEB/image.png!
!https://www.evernote.com/l/Ah_7VEduE1RLX6hxLXfMaYp4eMwuZjHAoIwB/image.png!

> Selector memory leak with high likelihood of OOM in case of down conversion
> ---
>
> Key: KAFKA-6185
> URL: https://issues.apache.org/jira/browse/KAFKA-6185
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0
> Environment: Ubuntu 14.04.5 LTS
> 5 brokers: 1&2 on 1.0.0 3,4,5 on 0.11.0.1
> inter.broker.protocol.version=0.11.0.1
> log.message.format.version=0.11.0.1
> clients a mix of 0.9, 0.10, 0.11
>Reporter: Brett Rann
>Assignee: Rajini Sivaram
>Priority: Blocker
>  Labels: regression
> Fix For: 1.1.0, 1.0.1
>
> Attachments: Kafka_Internals___Datadog.png, 
> Kafka_Internals___Datadog.png
>
>
> We are testing 1.0.0 in a couple of environments.
> Both have about 5 brokers, with two 1.0.0 brokers and the rest 0.11.0.1 
> brokers.
> One is using on disk message format 0.9.0.1, the other 0.11.0.1
> we have 0.9, 0.10, and 0.11 clients connecting.
> The cluster on the 0.9.0.1 format is running fine for a week.
> But the cluster on the 0.11.0.1 format is consistently having memory issues, 
> only on the two upgraded brokers running 1.0.0.
> The first occurrence of the error comes along with this stack trace
> {noformat}
> {"timestamp":"2017-11-06 
> 14:22:32,402","level":"ERROR","logger":"kafka.server.KafkaApis","thread":"kafka-request-handler-7","message":"[KafkaApi-1]
>  Error when handling request 
> {replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=maxwell.users,partitions=[{partition=0,fetch_offset=227537,max_bytes=1100},{partition=4,fetch_offset=354468,max_bytes=1100},{partition=5,fetch_offset=266524,max_bytes=1100},{partition=8,fetch_offset=324562,max_bytes=1100},{partition=10,fetch_offset=292931,max_bytes=1100},{partition=12,fetch_offset=325718,max_bytes=1100},{partition=15,fetch_offset=229036,max_bytes=1100}]}]}"}
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at 
> org.apache.kafka.common.record.AbstractRecords.downConvert(AbstractRecords.java:101)
> at 
> org.apache.kafka.common.record.FileRecords.downConvert(FileRecords.java:253)
> at 
> kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$convertedPartitionData$1$1$$anonfun$apply$4.apply(KafkaApis.scala:520)
> at 
> kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$convertedPartitionData$1$1

[jira] [Commented] (KAFKA-5504) Kafka controller is not getting elected

2017-11-14 Thread Umesh Chaudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253048#comment-16253048
 ] 

Umesh Chaudhary commented on KAFKA-5504:


Similar to KAFKA-4277

> Kafka controller is not getting elected
> ---
>
> Key: KAFKA-5504
> URL: https://issues.apache.org/jira/browse/KAFKA-5504
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.9.0.1
>Reporter: Ashish Kumar
>
> I am having a kafka cluster of 20 nodes and I was facing the issue of 
> under-replicated topics issue for last few days so decided to restart the 
> broker which was working as a controller but after restart getting below logs 
> in all the brokers (It seems controller is not finalized and leader election 
> is happening continuously):
> [2017-06-23 02:59:50,388] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:50,396] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:50,410] INFO Rolled new log segment for 
> 'dpe_feedback_rating_history-4' in 0 ms. (kafka.log.Log)
> [2017-06-23 02:59:51,585] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,590] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:51,609] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:52,792] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,799] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:52,808] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:54,122] INFO New leader is 3 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:55,504] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,512] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:55,520] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:56,695] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,701] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:56,709] INFO New leader is 11 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:57,949] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,955] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:57,965] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2017-06-23 02:59:59,378] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,384] INFO Result of znode creation is: NODEEXISTS 
> (kafka.utils.ZKCheckedEphemeral)
> [2017-06-23 02:59:59,395] INFO New leader is 12 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> .
> .
> .
> Tried deleting controller znode (/controller) but no luck. Please let me know 
> if any fix is possible here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)