[jira] [Resolved] (KAFKA-5098) KafkaProducer.send() blocks and generates TimeoutException if topic name has illegal char

2018-07-17 Thread Joel Koshy (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-5098.
---
Resolution: Fixed

> KafkaProducer.send() blocks and generates TimeoutException if topic name has 
> illegal char
> -
>
> Key: KAFKA-5098
> URL: https://issues.apache.org/jira/browse/KAFKA-5098
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.10.2.0
> Environment: Java client running against server using 
> wurstmeister/kafka Docker image.
>Reporter: Jeff Larsen
>Assignee: Ahmed Al-Mehdi
>Priority: Major
> Fix For: 2.1.0
>
>
> The server is running with auto create enabled. If we try to publish to a 
> topic with a forward slash in the name, the call blocks and we get a 
> TimeoutException in the Callback. I would expect it to return immediately 
> with an InvalidTopicException.
> There are other blocking issues that have been reported which may be related 
> to some degree, but this particular cause seems unrelated.
> Sample code:
> {code}
> import org.apache.kafka.clients.producer.*;
> import java.util.*;
> public class KafkaProducerUnexpectedBlockingAndTimeoutException {
>   public static void main(String[] args) {
> Properties props = new Properties();
> props.put("bootstrap.servers", "kafka.example.com:9092");
> props.put("key.serializer", 
> "org.apache.kafka.common.serialization.StringSerializer");
> props.put("value.serializer", 
> "org.apache.kafka.common.serialization.StringSerializer");
> props.put("max.block.ms", 1); // 10 seconds should illustrate our 
> point
> String separator = "/";
> //String separator = "_";
> try (Producer producer = new KafkaProducer<>(props)) {
>   System.out.println("Calling KafkaProducer.send() at " + new Date());
>   producer.send(
>   new ProducerRecord("abc" + separator + 
> "someStreamName",
>   "Not expecting a TimeoutException here"),
>   new Callback() {
> @Override
> public void onCompletion(RecordMetadata metadata, Exception e) {
>   if (e != null) {
> System.out.println(e.toString());
>   }
> }
>   });
>   System.out.println("KafkaProducer.send() completed at " + new Date());
> }
>   }
> }
> {code}
> Switching to the underscore separator in the above example works as expected.
> Mea culpa: We neglected to research allowed chars in a topic name, but the 
> TimeoutException we encountered did not help point us in the right direction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5011) Replica fetchers may need to down-convert messages during a selective message format upgrade

2017-04-11 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964994#comment-15964994
 ] 

Joel Koshy commented on KAFKA-5011:
---

Yes that is correct - it is rare. I think it's reasonable to close this as 
won't fix. Not sure if we need to mention it in docs given that it is extremely 
rare.

> Replica fetchers may need to down-convert messages during a selective message 
> format upgrade
> 
>
> Key: KAFKA-5011
> URL: https://issues.apache.org/jira/browse/KAFKA-5011
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
>Assignee: Jiangjie Qin
> Fix For: 0.11.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KAFKA-5011) Replica fetchers may need to down-convert messages during a selective message format upgrade

2017-04-04 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-5011:
-

 Summary: Replica fetchers may need to down-convert messages during 
a selective message format upgrade
 Key: KAFKA-5011
 URL: https://issues.apache.org/jira/browse/KAFKA-5011
 Project: Kafka
  Issue Type: Bug
Reporter: Joel Koshy
 Fix For: 0.11.0.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KAFKA-4250) make ProducerRecord and ConsumerRecord extensible

2016-11-30 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-4250.
---
Resolution: Fixed

> make ProducerRecord and ConsumerRecord extensible
> -
>
> Key: KAFKA-4250
> URL: https://issues.apache.org/jira/browse/KAFKA-4250
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
> Fix For: 0.10.2.0
>
>
> KafkaProducer and KafkaConsumer implement interfaces are are designed to be 
> extensible (or at least allow it).
> ProducerRecord and ConsumerRecord, however, are final, making extending 
> producer/consumer very difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (KAFKA-4250) make ProducerRecord and ConsumerRecord extensible

2016-11-30 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy reopened KAFKA-4250:
---

cc [~becket_qin] [~nsolis] [~radai]

> make ProducerRecord and ConsumerRecord extensible
> -
>
> Key: KAFKA-4250
> URL: https://issues.apache.org/jira/browse/KAFKA-4250
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
> Fix For: 0.10.1.1, 0.10.2.0
>
>
> KafkaProducer and KafkaConsumer implement interfaces are are designed to be 
> extensible (or at least allow it).
> ProducerRecord and ConsumerRecord, however, are final, making extending 
> producer/consumer very difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4250) make ProducerRecord and ConsumerRecord extensible

2016-11-30 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-4250:
--
Fix Version/s: 0.10.2.0

> make ProducerRecord and ConsumerRecord extensible
> -
>
> Key: KAFKA-4250
> URL: https://issues.apache.org/jira/browse/KAFKA-4250
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
> Fix For: 0.10.1.1, 0.10.2.0
>
>
> KafkaProducer and KafkaConsumer implement interfaces are are designed to be 
> extensible (or at least allow it).
> ProducerRecord and ConsumerRecord, however, are final, making extending 
> producer/consumer very difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-1911) Log deletion on stopping replicas should be async

2016-11-30 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-1911:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Log deletion on stopping replicas should be async
> -
>
> Key: KAFKA-1911
> URL: https://issues.apache.org/jira/browse/KAFKA-1911
> Project: Kafka
>  Issue Type: Bug
>  Components: log, replication
>Reporter: Joel Koshy
>Assignee: Mayuresh Gharat
>  Labels: newbie++, newbiee
>
> If a StopReplicaRequest sets delete=true then we do a file.delete on the file 
> message sets. I was under the impression that this is fast but it does not 
> seem to be the case.
> On a partition reassignment in our cluster the local time for stop replica 
> took nearly 30 seconds.
> {noformat}
> Completed request:Name: StopReplicaRequest; Version: 0; CorrelationId: 467; 
> ClientId: ;DeletePartitions: true; ControllerId: 1212; ControllerEpoch: 
> 53 from 
> client/...:45964;totalTime:29191,requestQueueTime:1,localTime:29190,remoteTime:0,responseQueueTime:0,sendTime:0
> {noformat}
> This ties up one API thread for the duration of the request.
> Specifically in our case, the queue times for other requests also went up and 
> producers to the partition that was just deleted on the old leader took a 
> while to refresh their metadata (see KAFKA-1303) and eventually ran out of 
> retries on some messages leading to data loss.
> I think the log deletion in this case should be fully asynchronous although 
> we need to handle the case when a broker may respond immediately to the 
> stop-replica-request but then go down after deleting only some of the log 
> segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4409) ZK consumer shutdown/topic event deadlock

2016-11-14 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-4409:
-

 Summary: ZK consumer shutdown/topic event deadlock
 Key: KAFKA-4409
 URL: https://issues.apache.org/jira/browse/KAFKA-4409
 Project: Kafka
  Issue Type: Bug
Reporter: Joel Koshy


This only applies to the old zookeeper consumer. It is trivial enough to fix.

The consumer can deadlock on shutdown if a topic event fires during shutdown. 
The shutdown acquires the rebalance lock and then the topic-event-watcher lock. 
The topic event watcher acquires these in the reverse order. Shutdown should 
not need to acquire the topic-event-watcher’s lock - all it does is 
unsubscribes from topic events.

Stack trace:
{noformat}
"mirrormaker-thread-0":
at 
kafka.consumer.ZookeeperTopicEventWatcher.shutdown(ZookeeperTopicEventWatcher.scala:50)
- waiting to lock <0x00072a65d508> (a java.lang.Object)
at 
kafka.consumer.ZookeeperConsumerConnector.shutdown(ZookeeperConsumerConnector.scala:216)
- locked <0x0007103c69c0> (a java.lang.Object)
at 
kafka.tools.MirrorMaker$MirrorMakerOldConsumer.cleanup(MirrorMaker.scala:519)
at 
kafka.tools.MirrorMaker$MirrorMakerThread$$anonfun$run$3.apply$mcV$sp(MirrorMaker.scala:441)
at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:76)
at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:47)
at kafka.utils.Logging$class.swallow(Logging.scala:94)
at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:47)
at kafka.tools.MirrorMaker$MirrorMakerThread.run(MirrorMaker.scala:441)
"ZkClient-EventThread-58-":
at 
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:639)
- waiting to lock <0x0007103c69c0> (a java.lang.Object)
at 
kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:982)
at 
kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.handleTopicEvent(ZookeeperConsumerConnector.scala:1048)
at 
kafka.consumer.ZookeeperTopicEventWatcher$ZkTopicEventListener.liftedTree1$1(ZookeeperTopicEventWatcher.scala:69)
at 
kafka.consumer.ZookeeperTopicEventWatcher$ZkTopicEventListener.handleChildChange(ZookeeperTopicEventWatcher.scala:65)
- locked <0x00072a65d508> (a java.lang.Object)
at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
Found one Java-level deadlock:
=
"mirrormaker-thread-0":
  waiting to lock monitor 0x7f1f38029748 (object 0x00072a65d508, a 
java.lang.Object),
  which is held by "ZkClient-EventThread-58-"
"ZkClient-EventThread-58-":
  waiting to lock monitor 0x7f1e900249a8 (object 0x0007103c69c0, a 
java.lang.Object),
  which is held by "mirrormaker-thread-0"
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4362) Consumer can fail after reassignment of the offsets topic partition

2016-11-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645583#comment-15645583
 ] 

Joel Koshy commented on KAFKA-4362:
---

In this specific issue, the coordinator is available, but has moved to another 
broker. The client isn't informed of this movement though since it gets an 
unknown error (-1). With the Java client such errors should be automatically 
handled - i.e., rediscover the coordinator.

> Consumer can fail after reassignment of the offsets topic partition
> ---
>
> Key: KAFKA-4362
> URL: https://issues.apache.org/jira/browse/KAFKA-4362
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Joel Koshy
>Assignee: Mayuresh Gharat
>
> When a consumer offsets topic partition reassignment completes, an offset 
> commit shows this:
> {code}
> java.lang.IllegalArgumentException: Message format version for partition 100 
> not found
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at scala.Option.getOrElse(Option.scala:120) ~[scala-library-2.10.4.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
>  ~[kafka_2.10.jar:?]
> at 
> ...
> {code}
> The issue is that the replica has been deleted so the 
> {{GroupMetadataManager.getMessageFormatVersionAndTimestamp}} throws this 
> exception instead which propagates as an unknown error.
> Unfortunately consumers don't respond to this and will fail their offset 
> commits.
> One workaround in the above situation is to bounce the cluster - the consumer 
> will be forced to rediscover the group coordinator.
> (Incidentally, the message incorrectly prints the number of partitions 
> instead of the actual partition.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4362) Consumer can fail after reassignment of the offsets topic partition

2016-11-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645427#comment-15645427
 ] 

Joel Koshy commented on KAFKA-4362:
---

Sorry - missed this comment. It does not recover because the exception is 
thrown before the point the broker determines that it is no longer the 
coordinator.

> Consumer can fail after reassignment of the offsets topic partition
> ---
>
> Key: KAFKA-4362
> URL: https://issues.apache.org/jira/browse/KAFKA-4362
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Joel Koshy
>Assignee: Mayuresh Gharat
>
> When a consumer offsets topic partition reassignment completes, an offset 
> commit shows this:
> {code}
> java.lang.IllegalArgumentException: Message format version for partition 100 
> not found
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at scala.Option.getOrElse(Option.scala:120) ~[scala-library-2.10.4.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
>  ~[kafka_2.10.jar:?]
> at 
> ...
> {code}
> The issue is that the replica has been deleted so the 
> {{GroupMetadataManager.getMessageFormatVersionAndTimestamp}} throws this 
> exception instead which propagates as an unknown error.
> Unfortunately consumers don't respond to this and will fail their offset 
> commits.
> One workaround in the above situation is to bounce the cluster - the consumer 
> will be forced to rediscover the group coordinator.
> (Incidentally, the message incorrectly prints the number of partitions 
> instead of the actual partition.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4381) Add per partition lag metric to KafkaConsumer.

2016-11-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645419#comment-15645419
 ] 

Joel Koshy commented on KAFKA-4381:
---

This is up to you but I definitely agree with Jason that it's overkill to have 
KIPs for this level of improvement. Configs are similar as well unless they are 
very nuanced such as timeouts. The PR itself should be sufficient to serve as a 
forum to discuss any concerns.

Metric name changes are different since presumably people are already 
monitoring those metrics - such changes could deserve a KIP or even just a 
email heads-up.

> Add per partition lag metric to KafkaConsumer.
> --
>
> Key: KAFKA-4381
> URL: https://issues.apache.org/jira/browse/KAFKA-4381
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Affects Versions: 0.10.1.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.10.2.0
>
>
> Currently KafkaConsumer only has a metric of max lag across all the 
> partitions. It would be useful to know per partition lag as well.
> I remember there was a ticket created before but did not find it. So I am 
> creating this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4362) Consumer can fail after reassignment of the offsets topic partition

2016-11-01 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-4362:
--
Summary: Consumer can fail after reassignment of the offsets topic 
partition  (was: Offset commits fail after a partition reassignment)

Yes definitely an issue, so I'm updating the title. So the reassignment of the 
offsets topic will perpetually cause offset commits to fail. A new consumer 
joining the group will talk to the new coordinator and incorrectly becomes an 
isolated group. Any rebalance of the remaining instances of the actual group 
(that's still talking to the old coordinator) can hit this error and die:

{code}
[2016-11-01 15:37:56,120] WARN Auto offset commit failed for group testgroup: 
Unexpected error in commit: The server experienced an unexpected error when 
processing the request 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
...


...

[2016-11-01 15:37:56,120] INFO Revoking previously assigned partitions 
[testtopic-0, testtopic-1] for group testgroup 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2016-11-01 15:37:56,120] INFO (Re-)joining group testgroup 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2016-11-01 15:37:56,124] ERROR Error processing message, terminating consumer 
process:  (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The 
server experienced an unexpected error when processing the request
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:518)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:485)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:742)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:722)
at 
org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:186)
at 
org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:149)
at 
org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:116)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:479)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:316)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:256)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:180)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:308)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:100)
at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
{code}

> Consumer can fail after reassignment of the offsets topic partition
> ---
>
> Key: KAFKA-4362
> URL: https://issues.apache.org/jira/browse/KAFKA-4362
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Joel Koshy
>Assignee: Jiangjie Qin
>
> When a consumer offsets topic partition reassignment completes, an offset 
> commit shows this:
> {code}
> java.lang.IllegalArgumentException: Message format version for partition 100 
> not found
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at scala.Option.getOrElse(Option.scala:120) ~[scala-library-2.10.4.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
>  ~[kafka_2.10.jar:?]
> at

[jira] [Commented] (KAFKA-4362) Offset commits fail after a partition reassignment

2016-10-31 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624520#comment-15624520
 ] 

Joel Koshy commented on KAFKA-4362:
---

Btw, the summary doesn't make it clear that this also affects operations such 
as sync-group/join-group in the new consumer as well.
I glanced through the new consumer code's handling on unknown error. 
Specifically we will need to rediscover the coordinator to recover from this. 
It does not appear to do this, but will double-check tomorrow.

> Offset commits fail after a partition reassignment
> --
>
> Key: KAFKA-4362
> URL: https://issues.apache.org/jira/browse/KAFKA-4362
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Joel Koshy
>Assignee: Jiangjie Qin
>
> When a consumer offsets topic partition reassignment completes, an offset 
> commit shows this:
> {code}
> java.lang.IllegalArgumentException: Message format version for partition 100 
> not found
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>  ~[kafka_2.10.jar:?]
> at scala.Option.getOrElse(Option.scala:120) ~[scala-library-2.10.4.jar:?]
> at 
> kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
>  ~[kafka_2.10.jar:?]
> at 
> ...
> {code}
> The issue is that the replica has been deleted so the 
> {{GroupMetadataManager.getMessageFormatVersionAndTimestamp}} throws this 
> exception instead which propagates as an unknown error.
> Unfortunately consumers don't respond to this and will fail their offset 
> commits.
> One workaround in the above situation is to bounce the cluster - the consumer 
> will be forced to rediscover the group coordinator.
> (Incidentally, the message incorrectly prints the number of partitions 
> instead of the actual partition.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4362) Offset commits fail after a partition reassignment

2016-10-31 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-4362:
-

 Summary: Offset commits fail after a partition reassignment
 Key: KAFKA-4362
 URL: https://issues.apache.org/jira/browse/KAFKA-4362
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.1.0
Reporter: Joel Koshy
Assignee: Jiangjie Qin


When a consumer offsets topic partition reassignment completes, an offset 
commit shows this:

{code}
java.lang.IllegalArgumentException: Message format version for partition 100 
not found
at 
kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
 ~[kafka_2.10.jar:?]
at 
kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
 ~[kafka_2.10.jar:?]
at scala.Option.getOrElse(Option.scala:120) ~[scala-library-2.10.4.jar:?]
at 
kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
 ~[kafka_2.10.jar:?]
at 
...
{code}

The issue is that the replica has been deleted so the 
{{GroupMetadataManager.getMessageFormatVersionAndTimestamp}} throws this 
exception instead which propagates as an unknown error.

Unfortunately consumers don't respond to this and will fail their offset 
commits.

One workaround in the above situation is to bounce the cluster - the consumer 
will be forced to rediscover the group coordinator.

(Incidentally, the message incorrectly prints the number of partitions instead 
of the actual partition.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-4250) make ProducerRecord and ConsumerRecord extensible

2016-10-19 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-4250.
---
   Resolution: Fixed
 Assignee: radai rosenblatt
 Reviewer: Joel Koshy
Fix Version/s: 0.10.2.0

> make ProducerRecord and ConsumerRecord extensible
> -
>
> Key: KAFKA-4250
> URL: https://issues.apache.org/jira/browse/KAFKA-4250
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
> Fix For: 0.10.2.0
>
>
> KafkaProducer and KafkaConsumer implement interfaces are are designed to be 
> extensible (or at least allow it).
> ProducerRecord and ConsumerRecord, however, are final, making extending 
> producer/consumer very difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1351) String.format is very expensive in Scala

2016-10-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583851#comment-15583851
 ] 

Joel Koshy commented on KAFKA-1351:
---

Yes - this is still an issue. cc [~nsolis]

> String.format is very expensive in Scala
> 
>
> Key: KAFKA-1351
> URL: https://issues.apache.org/jira/browse/KAFKA-1351
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.7.2, 0.8.0, 0.8.1
>Reporter: Neha Narkhede
>  Labels: newbie
> Attachments: KAFKA-1351.patch, KAFKA-1351_2014-04-07_18:02:18.patch, 
> KAFKA-1351_2014-04-09_15:40:11.patch
>
>
> As found in KAFKA-1350, logging is causing significant overhead in the 
> performance of a Kafka server. There are several info statements that use 
> String.format which is particularly expensive. We should investigate adding 
> our own version of String.format that merely uses string concatenation under 
> the covers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-4025) build fails on windows due to rat target output encoding

2016-10-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-4025.
---
   Resolution: Fixed
 Assignee: radai rosenblatt
 Reviewer: Joel Koshy
Fix Version/s: 0.10.1.1

> build fails on windows due to rat target output encoding
> 
>
> Key: KAFKA-4025
> URL: https://issues.apache.org/jira/browse/KAFKA-4025
> Project: Kafka
>  Issue Type: Bug
> Environment: windows 7, either regular command prompt or git bash
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
>Priority: Minor
> Fix For: 0.10.1.1
>
> Attachments: windows build debug output.txt
>
>
> kafka runs a rat report during the build, using [the rat ant report 
> task|http://creadur.apache.org/rat/apache-rat-tasks/report.html], which has 
> no output encoding parameter.
> this means that the resulting xml report is produced using the system-default 
> encoding, which is OS-dependent:
> the rat ant task code instantiates the output writer like so 
> ([org.apache.rat.anttasks.Report.java|http://svn.apache.org/repos/asf/creadur/rat/tags/apache-rat-project-0.11/apache-rat-tasks/src/main/java/org/apache/rat/anttasks/Report.java]
>  line 196):
> {noformat}
> out = new PrintWriter(new FileWriter(reportFile));{noformat}
> which eventually leads to {{Charset.defaultCharset()}} that relies on the 
> file.encoding system property. this causes an issue if the default encoding 
> isnt UTF-8 (which it isnt on windows) as the code called by 
> printUnknownFiles() in rat.gradle defaults to UTF-8 when reading the report 
> xml, causing the build to fail with:
> {noformat}
> com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
> Invalid byte 1 of 1-byte UTF-8 sequence.{noformat}
> (see complete output of {{gradlew --debug --stacktrace rat}} in attached file)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4293) ByteBufferMessageSet.deepIterator burns CPU catching EOFExceptions

2016-10-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-4293:
--
Assignee: radai rosenblatt

It turns out we should be able to handle all of our current codecs by 
re-implementing the {{available()}} method correctly. We would still want to 
continue to catch EOF as a safety net for any future codecs we may add.

> ByteBufferMessageSet.deepIterator burns CPU catching EOFExceptions
> --
>
> Key: KAFKA-4293
> URL: https://issues.apache.org/jira/browse/KAFKA-4293
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
>
> around line 110:
> {noformat}
> try {
> while (true)
> innerMessageAndOffsets.add(readMessageFromStream(compressed))
> } catch {
> case eofe: EOFException =>
> // we don't do anything at all here, because the finally
> // will close the compressed input stream, and we simply
> // want to return the innerMessageAndOffsets
> {noformat}
> the only indication the code has that the end of the oteration was reached is 
> by catching EOFException (which will be thrown inside 
> readMessageFromStream()).
> profiling runs performed at linkedIn show 10% of the total broker CPU time 
> taken up by Throwable.fillInStack() because of this behaviour.
> unfortunately InputStream.available() cannot be relied upon (concrete example 
> - GZipInputStream will not correctly return 0) so the fix would probably be a 
> wire format change to also encode the number of messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3175) topic not accessible after deletion even when delete.topic.enable is disabled

2016-10-10 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-3175:
--
Resolution: Fixed
  Reviewer: Joel Koshy
Status: Resolved  (was: Patch Available)

> topic not accessible after deletion even when delete.topic.enable is disabled
> -
>
> Key: KAFKA-3175
> URL: https://issues.apache.org/jira/browse/KAFKA-3175
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Jun Rao
>Assignee: Mayuresh Gharat
> Fix For: 0.10.1.1
>
>
> The can be reproduced with the following steps.
> 1. start ZK and 1 broker (with default delete.topic.enable=false)
> 2. create a topic test
> bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic test 
> --partition 1 --replication-factor 1
> 3. delete topic test
> bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic test
> 4. restart the broker
> Now topic test still shows up during topic description.
> bin/kafka-topics.sh --zookeeper localhost:2181 --describe
> Topic:testPartitionCount:1ReplicationFactor:1 Configs:
>   Topic: test Partition: 0Leader: 0   Replicas: 0 Isr: 0
> However, one can't produce to this topic any more.
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> [2016-01-29 17:55:24,527] WARN Error while fetching metadata with correlation 
> id 0 : {test=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2016-01-29 17:55:24,725] WARN Error while fetching metadata with correlation 
> id 1 : {test=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
> [2016-01-29 17:55:24,828] WARN Error while fetching metadata with correlation 
> id 2 : {test=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4254) Questionable handling of unknown partitions in KafkaProducer

2016-10-06 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553228#comment-15553228
 ] 

Joel Koshy commented on KAFKA-4254:
---

It seems (2) would still help as there are use-cases which set {{max.block.ms}} 
to zero. So we can refresh metadata but also return a more specific exception.

> Questionable handling of unknown partitions in KafkaProducer
> 
>
> Key: KAFKA-4254
> URL: https://issues.apache.org/jira/browse/KAFKA-4254
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Jason Gustafson
>Assignee: Konstantine Karantasis
> Fix For: 0.10.1.1
>
>
> Currently the producer will raise an {{IllegalArgumentException}} if the user 
> attempts to write to a partition which has just been created. This is caused 
> by the fact that the producer does not attempt to refetch topic metadata in 
> this case, which means that its check for partition validity is based on 
> stale metadata.
> If the topic for the partition did not already exist, it works fine because 
> the producer will block until it has metadata for the topic, so this case is 
> primarily hit when the number of partitions is dynamically increased. 
> A couple options to fix this that come to mind:
> 1. We could treat unknown partitions just as we do unknown topics. If the 
> partition doesn't exist, we refetch metadata and try again (timing out when 
> max.block.ms is reached).
> 2. We can at least throw a more specific exception so that users can handle 
> the error. Raising {{IllegalArgumentException}} is not helpful in practice 
> because it can also be caused by other error.s
> My inclination is to do the first one since the producer seems incorrect to 
> tell the user that the partition is invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4207) Partitions stopped after a rapid restart of a broker

2016-09-26 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524443#comment-15524443
 ] 

Joel Koshy commented on KAFKA-4207:
---

I have a KIP draft that has been sitting around for a while. I should be able 
to clean that up and send it out within the next week or so.

> Partitions stopped after a rapid restart of a broker
> 
>
> Key: KAFKA-4207
> URL: https://issues.apache.org/jira/browse/KAFKA-4207
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.9.0.1, 0.10.0.1
>Reporter: Dustin Cote
>
> Environment:
> 4 Kafka brokers
> 10,000 topics with one partition each, replication factor 3
> Partitions with 4KB data each
> No data being produced or consumed
> Scenario:
> Initiate controlled shutdown on one broker
> Interrupt controlled shutdown prior completion with a SIGKILL
> Start a new broker with the same broker ID as broker that was just killed 
> immediately
> Symptoms:
> After starting the new broker, the other three brokers in the cluster will 
> see under replicated partitions forever for some partitions that are hosted 
> on the broker that was killed and restarted
> Cause:
> Today, the controller sends a StopReplica command for each replica hosted on 
> a broker that has initiated a controlled shutdown.  For a large number of 
> replicas this can take awhile.  When the broker that is doing the controlled 
> shutdown is killed, the StopReplica commands are queued up even though the 
> request queue to the broker is cleared.  When the broker comes back online, 
> the StopReplica commands that were queued, get sent to the broker that just 
> started up.  
> CC: [~junrao] since he's familiar with the scenario seen here



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4178) Replication Throttling: Consolidate Rate Classes

2016-09-21 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511604#comment-15511604
 ] 

Joel Koshy commented on KAFKA-4178:
---

[~junrao] sorry I don't remember, but you might :) It appears the change was a 
follow-up to one of your [review comments on 
KAFKA-2084|https://reviews.apache.org/r/33049/#comment150934] Let me know if 
you need help with revisiting that discussion - it has been a while since I 
have looked at that code.

> Replication Throttling: Consolidate Rate Classes
> 
>
> Key: KAFKA-4178
> URL: https://issues.apache.org/jira/browse/KAFKA-4178
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>
> Replication throttling is using a different implementation of Rate to client 
> throttling (Rate & SimpleRate). These should be consolidated so both use the 
> same approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-4158) Reset quota to default value if quota override is deleted for a given clientId

2016-09-13 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-4158.
---
Resolution: Fixed

Pushed to trunk.
Not sure if we need this in 0.9 since there is an easy work-around for it 
(i.e., override to the default and then delete) and we are going to release 
0.10.1 soon

> Reset quota to default value if quota override is deleted for a given clientId
> --
>
> Key: KAFKA-4158
> URL: https://issues.apache.org/jira/browse/KAFKA-4158
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Critical
> Fix For: 0.10.1.0
>
>
> If user set a quota override and delete it override, Kafka will still use the 
> quota override. But this is wrong. Kafka should use default quota if the 
> override is deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4158) Reset quota to default value if quota override is deleted for a given clientId

2016-09-13 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-4158:
--
Fix Version/s: 0.10.1.0

> Reset quota to default value if quota override is deleted for a given clientId
> --
>
> Key: KAFKA-4158
> URL: https://issues.apache.org/jira/browse/KAFKA-4158
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Critical
> Fix For: 0.10.1.0
>
>
> If user set a quota override and delete it override, Kafka will still use the 
> quota override. But this is wrong. Kafka should use default quota if the 
> override is deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4074) Deleting a topic can make it unavailable even if delete.topic.enable is false

2016-09-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471686#comment-15471686
 ] 

Joel Koshy commented on KAFKA-4074:
---

Yes - totally missed KAFKA-3175

> Deleting a topic can make it unavailable even if delete.topic.enable is false
> -
>
> Key: KAFKA-4074
> URL: https://issues.apache.org/jira/browse/KAFKA-4074
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Joel Koshy
>Assignee: Manikumar Reddy
> Fix For: 0.10.1.0
>
>
> The {{delete.topic.enable}} configuration does not completely block the 
> effects of delete topic since the controller may (indirectly) query the list 
> of topics under the delete-topic znode.
> To reproduce:
> * Delete topic X
> * Force a controller move (either by bouncing or removing the controller 
> znode)
> * The new controller will send out UpdateMetadataRequests with leader=-2 for 
> the partitions of X
> * Producers eventually stop producing to that topic
> The reason for this is that when ControllerChannelManager adds 
> UpdateMetadataRequests for brokers, we directly use the partitionsToBeDeleted 
> field of the DeleteTopicManager (which is set to the partitions of the topics 
> under the delete-topic znode on controller startup).
> In order to get out of the situation you have to remove X from the znode and 
> then force another controller move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4074) Deleting a topic can make it unavailable even if delete.topic.enable is false

2016-08-22 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-4074:
-

 Summary: Deleting a topic can make it unavailable even if 
delete.topic.enable is false
 Key: KAFKA-4074
 URL: https://issues.apache.org/jira/browse/KAFKA-4074
 Project: Kafka
  Issue Type: Bug
  Components: controller
Reporter: Joel Koshy
 Fix For: 0.10.1.0


The {{delete.topic.enable}} configuration does not completely block the effects 
of delete topic since the controller may (indirectly) query the list of topics 
under the delete-topic znode.

To reproduce:
* Delete topic X
* Force a controller move (either by bouncing or removing the controller znode)
* The new controller will send out UpdateMetadataRequests with leader=-2 for 
the partitions of X
* Producers eventually stop producing to that topic

The reason for this is that when ControllerChannelManager adds 
UpdateMetadataRequests for brokers, we directly use the partitionsToBeDeleted 
field of the DeleteTopicManager (which is set to the partitions of the topics 
under the delete-topic znode on controller startup).

In order to get out of the situation you have to remove X from the znode and 
then force another controller move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-4050) Allow configuration of the PRNG used for SSL

2016-08-19 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-4050.
---
   Resolution: Fixed
Fix Version/s: 0.10.0.2
   0.10.1.0

Pushed to trunk and 0.10.0

> Allow configuration of the PRNG used for SSL
> 
>
> Key: KAFKA-4050
> URL: https://issues.apache.org/jira/browse/KAFKA-4050
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 0.10.0.1
>Reporter: Todd Palino
>Assignee: Todd Palino
>  Labels: security, ssl
> Fix For: 0.10.1.0, 0.10.0.2
>
>
> This change will make the pseudo-random number generator (PRNG) 
> implementation used by the SSLContext configurable. The configuration is not 
> required, and the default is to use whatever the default PRNG for the JDK/JRE 
> is. Providing a string, such as "SHA1PRNG", will cause that specific 
> SecureRandom implementation to get passed to the SSLContext.
> When enabling inter-broker SSL in our certification cluster, we observed 
> severe performance issues. For reference, this cluster can take up to 600 
> MB/sec of inbound produce traffic over SSL, with RF=2, before it gets close 
> to saturation, and the mirror maker normally produces about 400 MB/sec 
> (unless it is lagging). When we enabled inter-broker SSL, we saw persistent 
> replication problems in the cluster at any inbound rate of more than about 6 
> or 7 MB/sec per-broker. This was narrowed down to all the network threads 
> blocking on a single lock in the SecureRandom code.
> It turns out that the default PRNG implementation on Linux is NativePRNG. 
> This uses randomness from /dev/urandom (which, by itself, is a non-blocking 
> read) and mixes it with randomness from SHA1. The problem is that the entire 
> application shares a single SecureRandom instance, and NativePRNG has a 
> global lock within the implNextBytes method. Switching to another 
> implementation (SHA1PRNG, which has better performance characteristics and is 
> still considered secure) completely eliminated the bottleneck and allowed the 
> cluster to work properly at saturation.
> The SSLContext initialization has an optional argument to provide a 
> SecureRandom instance, which the code currently sets to null. This change 
> creates a new config to specify an implementation, and instantiates that and 
> passes it to SSLContext if provided. This will also let someone select a 
> stronger source of randomness (obviously at a performance cost) if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4050) Allow configuration of the PRNG used for SSL

2016-08-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425906#comment-15425906
 ] 

Joel Koshy commented on KAFKA-4050:
---

[~toddpalino] I had left a comment about this on the PR - one option is to 
default to SHA1PRNG and fall-back to null on NoSuchAlgorithmException

> Allow configuration of the PRNG used for SSL
> 
>
> Key: KAFKA-4050
> URL: https://issues.apache.org/jira/browse/KAFKA-4050
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 0.10.0.1
>Reporter: Todd Palino
>Assignee: Todd Palino
>  Labels: security, ssl
>
> This change will make the pseudo-random number generator (PRNG) 
> implementation used by the SSLContext configurable. The configuration is not 
> required, and the default is to use whatever the default PRNG for the JDK/JRE 
> is. Providing a string, such as "SHA1PRNG", will cause that specific 
> SecureRandom implementation to get passed to the SSLContext.
> When enabling inter-broker SSL in our certification cluster, we observed 
> severe performance issues. For reference, this cluster can take up to 600 
> MB/sec of inbound produce traffic over SSL, with RF=2, before it gets close 
> to saturation, and the mirror maker normally produces about 400 MB/sec 
> (unless it is lagging). When we enabled inter-broker SSL, we saw persistent 
> replication problems in the cluster at any inbound rate of more than about 6 
> or 7 MB/sec per-broker. This was narrowed down to all the network threads 
> blocking on a single lock in the SecureRandom code.
> It turns out that the default PRNG implementation on Linux is NativePRNG. 
> This uses randomness from /dev/urandom (which, by itself, is a non-blocking 
> read) and mixes it with randomness from SHA1. The problem is that the entire 
> application shares a single SecureRandom instance, and NativePRNG has a 
> global lock within the implNextBytes method. Switching to another 
> implementation (SHA1PRNG, which has better performance characteristics and is 
> still considered secure) completely eliminated the bottleneck and allowed the 
> cluster to work properly at saturation.
> The SSLContext initialization has an optional argument to provide a 
> SecureRandom instance, which the code currently sets to null. This change 
> creates a new config to specify an implementation, and instantiates that and 
> passes it to SSLContext if provided. This will also let someone select a 
> stronger source of randomness (obviously at a performance cost) if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4050) Allow configuration of the PRNG used for SSL

2016-08-16 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423839#comment-15423839
 ] 

Joel Koshy commented on KAFKA-4050:
---

A stack trace should help further clarify. (This is from a thread dump that 
Todd shared with us offline). Thanks [~toddpalino] and [~mgharat] for finding 
this.

{noformat}
"kafka-network-thread-1393-SSL-30" #114 prio=5 os_prio=0 tid=0x7f2ec8c30800 
nid=0x5c1e waiting for monitor entry [0x7f213b8f9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
sun.security.provider.NativePRNG$RandomIO.implNextBytes(NativePRNG.java:481)
- waiting to lock <0x000641508bf8> (a java.lang.Object)
at 
sun.security.provider.NativePRNG$RandomIO.access$400(NativePRNG.java:329)
at sun.security.provider.NativePRNG.engineNextBytes(NativePRNG.java:218)
at java.security.SecureRandom.nextBytes(SecureRandom.java:468)
- locked <0x00066aad9880> (a java.security.SecureRandom)
at sun.security.ssl.CipherBox.createExplicitNonce(CipherBox.java:1015)
at 
sun.security.ssl.EngineOutputRecord.write(EngineOutputRecord.java:287)
at 
sun.security.ssl.EngineOutputRecord.write(EngineOutputRecord.java:225)
at sun.security.ssl.EngineWriter.writeRecord(EngineWriter.java:186)
- locked <0x000671c5c978> (a sun.security.ssl.EngineWriter)
at sun.security.ssl.SSLEngineImpl.writeRecord(SSLEngineImpl.java:1300)
at 
sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1271)
- locked <0x000671ce7170> (a java.lang.Object)
at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
- locked <0x000671ce7150> (a java.lang.Object)
at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
at org.apache.kafka.common.network.SslTransportLayer.write(p.java:557)
at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:146)
at org.apache.kafka.common.network.MultiSend.writeTo(MultiSend.java:81)
at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:292)
at 
org.apache.kafka.common.network.KafkaChannel.send(KafkaChannel.java:158)
at 
org.apache.kafka.common.network.KafkaChannel.write(KafkaChannel.java:146)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:329)
at org.apache.kafka.common.network.Selector.poll(Selector.java:283)
at kafka.network.Processor.poll(SocketServer.scala:472)
at kafka.network.Processor.run(SocketServer.scala:412)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Of note is that all of the network threads are waiting on the same NativePRNG 
lock (0x000641508bf8)

> Allow configuration of the PRNG used for SSL
> 
>
> Key: KAFKA-4050
> URL: https://issues.apache.org/jira/browse/KAFKA-4050
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 0.10.0.1
>Reporter: Todd Palino
>Assignee: Todd Palino
>  Labels: security, ssl
>
> This change will make the pseudo-random number generator (PRNG) 
> implementation used by the SSLContext configurable. The configuration is not 
> required, and the default is to use whatever the default PRNG for the JDK/JRE 
> is. Providing a string, such as "SHA1PRNG", will cause that specific 
> SecureRandom implementation to get passed to the SSLContext.
> When enabling inter-broker SSL in our certification cluster, we observed 
> severe performance issues. For reference, this cluster can take up to 600 
> MB/sec of inbound produce traffic over SSL, with RF=2, before it gets close 
> to saturation, and the mirror maker normally produces about 400 MB/sec 
> (unless it is lagging). When we enabled inter-broker SSL, we saw persistent 
> replication problems in the cluster at any inbound rate of more than about 6 
> or 7 MB/sec per-broker. This was narrowed down to all the network threads 
> blocking on a single lock in the SecureRandom code.
> It turns out that the default PRNG implementation on Linux is NativePRNG. 
> This uses randomness from /dev/urandom (which, by itself, is a non-blocking 
> read) and mixes it with randomness from SHA1. The problem is that the entire 
> application shares a single SecureRandom instance, and NativePRNG has a 
> global lock within the implNextBytes method. Switching to another 
> implementation (SHA1PRNG, which has better performance characteristics and is 
> still considered secure) completely eliminated the bottleneck and allowed the 
> cluster to work properly at saturation.
> The SSLContext initialization has an optional argument to provide a 
> SecureRandom instance, which the code currently sets to null. This change 
> creates a new config to specify an implementation, a

[jira] [Commented] (KAFKA-3494) mbeans overwritten with identical clients on a single jvm

2016-04-06 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228733#comment-15228733
 ] 

Joel Koshy commented on KAFKA-3494:
---

I'm not sure that the second suggested solution in the description is viable 
for general usage since that could defeat the purpose of quotas - unless the 
broker is also equipped to somehow treat the unique client-ids as a single 
client-id. We could do a sequence-based metrics suffix for quota sensors but 
that would be a little difficult for users to benefit from in practice - since 
they would have to figure out which sequence number is associated with which 
actual client. It may be clearer to just have an explicit (new) metrics suffix 
config - I think we have collisions with more than just quota mbeans.

> mbeans overwritten with identical clients on a single jvm
> -
>
> Key: KAFKA-3494
> URL: https://issues.apache.org/jira/browse/KAFKA-3494
> Project: Kafka
>  Issue Type: Bug
>Reporter: Onur Karaman
>
> Quotas today are implemented on a (client-id, broker) granularity. I think 
> one of the motivating factors for using a simple quota id was to allow for 
> flexibility in the granularity of the quota enforcement. For instance, entire 
> services can share the same id to get some form of (service, broker) 
> granularity quotas. From my understanding, client-id was chosen as the quota 
> id because it's a property that already exists on the clients and reusing it 
> had relatively low impact.
> Continuing the above example, let's say a service spins up multiple 
> KafkaConsumers in one jvm sharing the same client-id because they want those 
> consumers to be quotad as a single entity. Sharing client-ids within a single 
> jvm would cause problems in client metrics since the mbeans tags only go as 
> granular as the client-id.
> An easy example is kafka-metrics count. Here's a sample code snippet:
> {code}
> package org.apache.kafka.clients.consumer;
> import java.util.Collections;
> import java.util.Properties;
> import org.apache.kafka.common.TopicPartition;
> public class KafkaConsumerMetrics {
> public static void main(String[] args) throws InterruptedException {
> Properties properties = new Properties();
> properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, 
> "localhost:9092");
> properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "test");
> properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, 
> "org.apache.kafka.common.serialization.StringDeserializer");
> 
> properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, 
> "org.apache.kafka.common.serialization.StringDeserializer");
> properties.setProperty(ConsumerConfig.CLIENT_ID_CONFIG, 
> "testclientid");
> KafkaConsumer kc1 = new KafkaConsumer<>(properties);
> KafkaConsumer kc2 = new KafkaConsumer<>(properties);
> kc1.assign(Collections.singletonList(new TopicPartition("t1", 0)));
> while (true) {
> kc1.poll(1000);
> System.out.println("kc1 metrics: " + kc1.metrics().size());
> System.out.println("kc2 metrics: " + kc2.metrics().size());
> Thread.sleep(1000);
> }
> }
> }
> {code}
> jconsole shows one mbean 
> kafka.consumer:type=kafka-metrics-count,client-id=testclientid consistently 
> with value 40.
> but stdout shows:
> {code}
> kc1 metrics: 69
> kc2 metrics: 40
> {code}
> I think the two possible solutions are:
> 1. add finer granularity to the mbeans to distinguish between the clients
> 2. require client ids to be unique per jvm like KafkaStreams has done



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3234) Minor documentation edits: clarify minISR; some topic-level configs are missing

2016-03-19 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-3234:
--
Fix Version/s: (was: 0.10.0.0)
   0.10.0.1

> Minor documentation edits: clarify minISR; some topic-level configs are 
> missing
> ---
>
> Key: KAFKA-3234
> URL: https://issues.apache.org/jira/browse/KAFKA-3234
> Project: Kafka
>  Issue Type: Improvement
>  Components: website
>Reporter: Joel Koshy
>Assignee: Joel Koshy
> Fix For: 0.10.0.1
>
>
> Based on an offline conversation with [~junrao] and [~gwenshap]
> The current documentation is somewhat confusing on minISR in that it says 
> that it offers a trade-off between consistency and availability. From the 
> user's view-point, consistency (at least in the usual sense of the term) is 
> achieved by disabling unclean leader election - since no replica that was out 
> of ISR can be elected as the leader. So a consumer will never see a message 
> that was not acknowledged to a producer that set acks to "all". Or to put it 
> another way, setting minISR alone will not prevent exposing uncommitted 
> messages - disabling unclean leader election is the stronger requirement. You 
> can achieve the same effect though by setting minISR equal to  the number of 
> replicas.
> There is also some stale documentation that needs to be removed:
> {quote}
> In our current release we choose the second strategy and favor choosing a 
> potentially inconsistent replica when all replicas in the ISR are dead. In 
> the future, we would like to make this configurable to better support use 
> cases where downtime is preferable to inconsistency.
> {quote}
> Finally, it was reported on the mailing list (from Elias Levy) that 
> compression.type should be added under the topic configs. Same goes for 
> unclean leader election. Would be good to have these auto-generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2960) DelayedProduce may cause message loss during repeated leader change

2016-03-11 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2960:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Issue resolved by pull request 1018
[https://github.com/apache/kafka/pull/1018]

> DelayedProduce may cause message loss during repeated leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.0.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-1148) Delayed fetch/producer requests should be satisfied on a leader change

2016-03-11 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-1148.
---
   Resolution: Fixed
Fix Version/s: 0.10.0.0

Issue resolved by pull request 1018
[https://github.com/apache/kafka/pull/1018]

> Delayed fetch/producer requests should be satisfied on a leader change
> --
>
> Key: KAFKA-1148
> URL: https://issues.apache.org/jira/browse/KAFKA-1148
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
> Fix For: 0.10.0.0
>
>
> Somewhat related to KAFKA-1016.
> This would be an issue only if max.wait is set to a very high value. When a 
> leader change occurs we should remove the delayed request from the purgatory 
> - either satisfy with error/expire - whichever makes more sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3197) Producer can send message out of order even when in flight request is set to 1.

2016-03-08 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-3197:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Pushed to trunk.

> Producer can send message out of order even when in flight request is set to 
> 1.
> ---
>
> Key: KAFKA-3197
> URL: https://issues.apache.org/jira/browse/KAFKA-3197
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 0.9.0.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>Priority: Blocker
> Fix For: 0.10.0.0
>
>
> The issue we saw is following:
> 1. Producer send message 0 to topic-partition-0 on broker A. The in-flight 
> request to broker A is 1.
> 2. The request is somehow lost
> 3. Producer refreshed its topic metadata and found leader of 
> topic-partition-0 migrated from broker A to broker B.
> 4. Because there is no in-flight request to broker B. All the subsequent 
> messages to topic-partition-0 in the record accumulator are sent to broker B.
> 5. Later on when the request in step (1) times out, message 0 will be retried 
> and sent to broker B. At this point, all the later messages has already been 
> sent, so we have re-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-3298) Document unclean.leader.election.enable as a valid topic-level config

2016-02-26 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-3298.
---
Resolution: Duplicate

Thanks for the report. KAFKA-3234 addresses this and other such inconsistencies.

> Document unclean.leader.election.enable as a valid topic-level config
> -
>
> Key: KAFKA-3298
> URL: https://issues.apache.org/jira/browse/KAFKA-3298
> Project: Kafka
>  Issue Type: Bug
>  Components: website
>Reporter: Andrew Olson
>Priority: Minor
>
> The http://kafka.apache.org/documentation.html#topic-config section does not 
> currently include {{unclean.leader.election.enable}}. That is a valid 
> topic-level configuration property as demonstrated by this [1] test.
> [1] 
> https://github.com/apache/kafka/blob/0.9.0.1/core/src/test/scala/unit/kafka/integration/UncleanLeaderElectionTest.scala#L127



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3197) Producer can send message out of order even when in flight request is set to 1.

2016-02-25 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168065#comment-15168065
 ] 

Joel Koshy commented on KAFKA-3197:
---

Hi [~fpj] so your suggestion is to preemptively invoke the retransmission logic 
to retry the affected partitions? It can be done, but I think it would 
necessitate some weird APIs in {{InFlightRequests}} since as Becket notes, we 
would need to also proactively fish out partitions from {{InFlightRequests}} 
and retry those on the new leader.
I’m +1 on the patch apart from the minor comments, but will leave this open for 
a few more days in case anyone has further concerns or better ideas.

> Producer can send message out of order even when in flight request is set to 
> 1.
> ---
>
> Key: KAFKA-3197
> URL: https://issues.apache.org/jira/browse/KAFKA-3197
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 0.9.0.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.10.0.0
>
>
> The issue we saw is following:
> 1. Producer send message 0 to topic-partition-0 on broker A. The in-flight 
> request to broker A is 1.
> 2. The request is somehow lost
> 3. Producer refreshed its topic metadata and found leader of 
> topic-partition-0 migrated from broker A to broker B.
> 4. Because there is no in-flight request to broker B. All the subsequent 
> messages to topic-partition-0 in the record accumulator are sent to broker B.
> 5. Later on when the request in step (1) times out, message 0 will be retried 
> and sent to broker B. At this point, all the later messages has already been 
> sent, so we have re-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2670) add sampling rate to MirrorMaker

2016-02-18 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153180#comment-15153180
 ] 

Joel Koshy commented on KAFKA-2670:
---

It should be possible to accomplish this with a custom [consumer 
interceptor|https://cwiki.apache.org/confluence/display/KAFKA/KIP-42%3A+Add+Producer+and+Consumer+Interceptors]
 - i.e., probably no need for an explicit change in mirror maker.

> add sampling rate to MirrorMaker
> 
>
> Key: KAFKA-2670
> URL: https://issues.apache.org/jira/browse/KAFKA-2670
> Project: Kafka
>  Issue Type: Wish
>  Components: tools
>Reporter: Christian Tramnitz
>Priority: Minor
>
> MirrorMaker could be used to copy data to different Kafka instances in 
> different environments (i.e. from production to development or test), but 
> often these are not at the same scale as production. A sampling rate could be 
> introduced to MirrorMaker to define a ratio of data to copied (per topic) to 
> downstream instances. Of course this should be 1:1 (or 100%) per default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (KAFKA-3088) 0.9.0.0 broker crash on receipt of produce request with empty client ID

2016-02-16 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy closed KAFKA-3088.
-

Thanks for the patches - this has been pushed to trunk and 0.9

> 0.9.0.0 broker crash on receipt of produce request with empty client ID
> ---
>
> Key: KAFKA-3088
> URL: https://issues.apache.org/jira/browse/KAFKA-3088
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.9.0.0
>Reporter: Dave Peterson
>Assignee: Grant Henke
> Fix For: 0.9.1.0
>
>
> Sending a produce request with an empty client ID to a 0.9.0.0 broker causes 
> the broker to crash as shown below.  More details can be found in the 
> following email thread:
> http://mail-archives.apache.org/mod_mbox/kafka-users/201601.mbox/%3c5693ecd9.4050...@dspeterson.com%3e
>[2016-01-10 23:03:44,957] ERROR [KafkaApi-3] error when handling request 
> Name: ProducerRequest; Version: 0; CorrelationId: 1; ClientId: null; 
> RequiredAcks: 1; AckTimeoutMs: 1 ms; TopicAndPartition: [topic_1,3] -> 37 
> (kafka.server.KafkaApis)
>java.lang.NullPointerException
>   at 
> org.apache.kafka.common.metrics.JmxReporter.getMBeanName(JmxReporter.java:127)
>   at 
> org.apache.kafka.common.metrics.JmxReporter.addAttribute(JmxReporter.java:106)
>   at 
> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:76)
>   at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:288)
>   at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:177)
>   at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:162)
>   at 
> kafka.server.ClientQuotaManager.getOrCreateQuotaSensors(ClientQuotaManager.scala:209)
>   at 
> kafka.server.ClientQuotaManager.recordAndMaybeThrottle(ClientQuotaManager.scala:111)
>   at 
> kafka.server.KafkaApis.kafka$server$KafkaApis$$sendResponseCallback$2(KafkaApis.scala:353)
>   at 
> kafka.server.KafkaApis$$anonfun$handleProducerRequest$1.apply(KafkaApis.scala:371)
>   at 
> kafka.server.KafkaApis$$anonfun$handleProducerRequest$1.apply(KafkaApis.scala:371)
>   at 
> kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:348)
>   at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:366)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
>   at 
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3088) 0.9.0.0 broker crash on receipt of produce request with empty client ID

2016-02-12 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145738#comment-15145738
 ] 

Joel Koshy commented on KAFKA-3088:
---

[~granthenke] I'm leaving this open for 0.9 (i.e., 0.9.1.0)

> 0.9.0.0 broker crash on receipt of produce request with empty client ID
> ---
>
> Key: KAFKA-3088
> URL: https://issues.apache.org/jira/browse/KAFKA-3088
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.9.0.0
>Reporter: Dave Peterson
>Assignee: Grant Henke
> Fix For: 0.9.1.0
>
>
> Sending a produce request with an empty client ID to a 0.9.0.0 broker causes 
> the broker to crash as shown below.  More details can be found in the 
> following email thread:
> http://mail-archives.apache.org/mod_mbox/kafka-users/201601.mbox/%3c5693ecd9.4050...@dspeterson.com%3e
>[2016-01-10 23:03:44,957] ERROR [KafkaApi-3] error when handling request 
> Name: ProducerRequest; Version: 0; CorrelationId: 1; ClientId: null; 
> RequiredAcks: 1; AckTimeoutMs: 1 ms; TopicAndPartition: [topic_1,3] -> 37 
> (kafka.server.KafkaApis)
>java.lang.NullPointerException
>   at 
> org.apache.kafka.common.metrics.JmxReporter.getMBeanName(JmxReporter.java:127)
>   at 
> org.apache.kafka.common.metrics.JmxReporter.addAttribute(JmxReporter.java:106)
>   at 
> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:76)
>   at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:288)
>   at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:177)
>   at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:162)
>   at 
> kafka.server.ClientQuotaManager.getOrCreateQuotaSensors(ClientQuotaManager.scala:209)
>   at 
> kafka.server.ClientQuotaManager.recordAndMaybeThrottle(ClientQuotaManager.scala:111)
>   at 
> kafka.server.KafkaApis.kafka$server$KafkaApis$$sendResponseCallback$2(KafkaApis.scala:353)
>   at 
> kafka.server.KafkaApis$$anonfun$handleProducerRequest$1.apply(KafkaApis.scala:371)
>   at 
> kafka.server.KafkaApis$$anonfun$handleProducerRequest$1.apply(KafkaApis.scala:371)
>   at 
> kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:348)
>   at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:366)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
>   at 
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-3234) Minor documentation edits: clarify minISR; some topic-level configs are missing

2016-02-11 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-3234:
-

 Summary: Minor documentation edits: clarify minISR; some 
topic-level configs are missing
 Key: KAFKA-3234
 URL: https://issues.apache.org/jira/browse/KAFKA-3234
 Project: Kafka
  Issue Type: Improvement
  Components: website
Reporter: Joel Koshy
Assignee: Joel Koshy
 Fix For: 0.9.0.1


Based on an offline conversation with [~junrao] and [~gwenshap]

The current documentation is somewhat confusing on minISR in that it says that 
it offers a trade-off between consistency and availability. From the user's 
view-point, consistency (at least in the usual sense of the term) is achieved 
by disabling unclean leader election - since no replica that was out of ISR can 
be elected as the leader. So a consumer will never see a message that was not 
acknowledged to a producer that set acks to "all". Or to put it another way, 
setting minISR alone will not prevent exposing uncommitted messages - disabling 
unclean leader election is the stronger requirement. You can achieve the same 
effect though by setting minISR equal to  the number of replicas.

There is also some stale documentation that needs to be removed:

{quote}
In our current release we choose the second strategy and favor choosing a 
potentially inconsistent replica when all replicas in the ISR are dead. In the 
future, we would like to make this configurable to better support use cases 
where downtime is preferable to inconsistency.
{quote}

Finally, it was reported on the mailing list (from Elias Levy) that 
compression.type should be added under the topic configs. Same goes for unclean 
leader election. Would be good to have these auto-generated.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2750) Sender.java: handleProduceResponse does not check protocol version

2016-02-09 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140105#comment-15140105
 ] 

Joel Koshy commented on KAFKA-2750:
---

[~felixgv] You mean retry with the older protocol? It does not do it currently 
- i.e., it always sends with the latest version. However, it should be possible 
after we support protocol version querying:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-35+-+Retrieving+protocol+version
 Until that is available we only recommend upgrading the clients _after_ the 
brokers.

> Sender.java: handleProduceResponse does not check protocol version
> --
>
> Key: KAFKA-2750
> URL: https://issues.apache.org/jira/browse/KAFKA-2750
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.0
>Reporter: Geoff Anderson
>
> If you try run an 0.9 producer against 0.8.2.2 kafka broker, you get a fairly 
> cryptic error message:
> [2015-11-04 18:55:43,583] ERROR Uncaught error in kafka producer I/O thread:  
> (org.apache.kafka.clients.producer.internals.Sender)
> org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
> 'throttle_time_ms': java.nio.BufferUnderflowException
>   at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:71)
>   at 
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:462)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:279)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:216)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:141)
> Although we shouldn't expect an 0.9 producer to work against an 0.8.X broker 
> since the protocol version has been increased, perhaps the error could be 
> clearer.
> The cause seems to be that in Sender.java, handleProduceResponse does not to 
> have any mechanism for checking the protocol version of the received produce 
> response - it just calls a constructor which blindly tries to grab the 
> throttle time field which in this case fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3179) Kafka consumer delivers message whose offset is earlier than sought offset.

2016-02-04 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-3179:
--
   Resolution: Fixed
Fix Version/s: (was: 0.9.0.1)
   0.9.1.0
   Status: Resolved  (was: Patch Available)

Issue resolved by pull request 842
[https://github.com/apache/kafka/pull/842]

> Kafka consumer delivers message whose offset is earlier than sought offset.
> ---
>
> Key: KAFKA-3179
> URL: https://issues.apache.org/jira/browse/KAFKA-3179
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.9.0.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.9.1.0
>
>
> This problem is reproducible by seeking to the middle a compressed message 
> set. Because KafkaConsumer does not filter out the messages earlier than the 
> sought offset in the compressed message. The message returned to user will 
> always be the first message in the compressed message set instead of the 
> message user sought to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3197) Producer can send message out of order even when in flight request is set to 1.

2016-02-03 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130785#comment-15130785
 ] 

Joel Koshy commented on KAFKA-3197:
---

Sorry - hit enter too soon. "in that it does not specifically say that it can 
be used to prevent reordering within a partition"

> Producer can send message out of order even when in flight request is set to 
> 1.
> ---
>
> Key: KAFKA-3197
> URL: https://issues.apache.org/jira/browse/KAFKA-3197
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 0.9.0.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.9.0.1
>
>
> The issue we saw is following:
> 1. Producer send message 0 to topic-partition-0 on broker A. The in-flight 
> request to broker A is 1.
> 2. The request is somehow lost
> 3. Producer refreshed its topic metadata and found leader of 
> topic-partition-0 migrated from broker A to broker B.
> 4. Because there is no in-flight request to broker B. All the subsequent 
> messages to topic-partition-0 in the record accumulator are sent to broker B.
> 5. Later on when the request in step (1) times out, message 0 will be retried 
> and sent to broker B. At this point, all the later messages has already been 
> sent, so we have re-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3197) Producer can send message out of order even when in flight request is set to 1.

2016-02-03 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130781#comment-15130781
 ] 

Joel Koshy commented on KAFKA-3197:
---

[~enothereska] - the documentation is accurate in that it . The reality though 
is that everyone (or at least most users) interpret that to mean it is possible 
to achieve strict ordering within a partition which is necessary for several 
use-cases.


> Producer can send message out of order even when in flight request is set to 
> 1.
> ---
>
> Key: KAFKA-3197
> URL: https://issues.apache.org/jira/browse/KAFKA-3197
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 0.9.0.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.9.0.1
>
>
> The issue we saw is following:
> 1. Producer send message 0 to topic-partition-0 on broker A. The in-flight 
> request to broker A is 1.
> 2. The request is somehow lost
> 3. Producer refreshed its topic metadata and found leader of 
> topic-partition-0 migrated from broker A to broker B.
> 4. Because there is no in-flight request to broker B. All the subsequent 
> messages to topic-partition-0 in the record accumulator are sent to broker B.
> 5. Later on when the request in step (1) times out, message 0 will be retried 
> and sent to broker B. At this point, all the later messages has already been 
> sent, so we have re-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3088) 0.9.0.0 broker crash on receipt of produce request with empty client ID

2016-01-21 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111679#comment-15111679
 ] 

Joel Koshy commented on KAFKA-3088:
---

[~aauradkar] can chime in since this is related to code that he touched for 
quotas, but personally I would prefer just preserving the old behavior and 
treating empty client-ids as a default ANONYMOUS or similar. Also, since we 
quota by client-id it behooves clients to set their client-id properly. If we 
want to change this behavior and start _requiring_ client-ids then perhaps a 
short DISCUSS thread is in order.

> 0.9.0.0 broker crash on receipt of produce request with empty client ID
> ---
>
> Key: KAFKA-3088
> URL: https://issues.apache.org/jira/browse/KAFKA-3088
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.9.0.0
>Reporter: Dave Peterson
>Assignee: Jun Rao
>
> Sending a produce request with an empty client ID to a 0.9.0.0 broker causes 
> the broker to crash as shown below.  More details can be found in the 
> following email thread:
> http://mail-archives.apache.org/mod_mbox/kafka-users/201601.mbox/%3c5693ecd9.4050...@dspeterson.com%3e
>[2016-01-10 23:03:44,957] ERROR [KafkaApi-3] error when handling request 
> Name: ProducerRequest; Version: 0; CorrelationId: 1; ClientId: null; 
> RequiredAcks: 1; AckTimeoutMs: 1 ms; TopicAndPartition: [topic_1,3] -> 37 
> (kafka.server.KafkaApis)
>java.lang.NullPointerException
>   at 
> org.apache.kafka.common.metrics.JmxReporter.getMBeanName(JmxReporter.java:127)
>   at 
> org.apache.kafka.common.metrics.JmxReporter.addAttribute(JmxReporter.java:106)
>   at 
> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:76)
>   at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:288)
>   at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:177)
>   at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:162)
>   at 
> kafka.server.ClientQuotaManager.getOrCreateQuotaSensors(ClientQuotaManager.scala:209)
>   at 
> kafka.server.ClientQuotaManager.recordAndMaybeThrottle(ClientQuotaManager.scala:111)
>   at 
> kafka.server.KafkaApis.kafka$server$KafkaApis$$sendResponseCallback$2(KafkaApis.scala:353)
>   at 
> kafka.server.KafkaApis$$anonfun$handleProducerRequest$1.apply(KafkaApis.scala:371)
>   at 
> kafka.server.KafkaApis$$anonfun$handleProducerRequest$1.apply(KafkaApis.scala:371)
>   at 
> kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:348)
>   at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:366)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
>   at 
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2758) Improve Offset Commit Behavior

2016-01-21 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111675#comment-15111675
 ] 

Joel Koshy commented on KAFKA-2758:
---

Unless I'm misunderstanding I don't think we should do (1) - i.e., ideally the 
"age" of the commit should be same for all partitions. So if you fail to commit 
offsets for some partitions those offsets could get compacted out early (and 
possibly removed entirely). See KAFKA-1510

> Improve Offset Commit Behavior
> --
>
> Key: KAFKA-2758
> URL: https://issues.apache.org/jira/browse/KAFKA-2758
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Guozhang Wang
>  Labels: newbiee
> Fix For: 0.9.0.1
>
>
> There are two scenarios of offset committing that we can improve:
> 1) we can filter the partitions whose committed offset is equal to the 
> consumed offset, meaning there is no new consumed messages from this 
> partition and hence we do not need to include this partition in the commit 
> request.
> 2) we can make a commit request right after resetting to a fetch / consume 
> position either according to the reset policy (e.g. on consumer starting up, 
> or handling of out of range offset, etc), or through the {code} seek {code} 
> so that if the consumer fails right after these event, upon recovery it can 
> restarts from the reset position instead of resetting again: this can lead 
> to, for example, data loss if we use "largest" as reset policy while there 
> are new messages coming to the fetching partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-3124) Update protocol wiki page to reflect latest request/response formats

2016-01-20 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-3124:
-

 Summary: Update protocol wiki page to reflect latest 
request/response formats
 Key: KAFKA-3124
 URL: https://issues.apache.org/jira/browse/KAFKA-3124
 Project: Kafka
  Issue Type: Bug
  Components: clients
Reporter: Joel Koshy


The protocol wiki 
(https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol)
 is slightly out of date. It does not have some of the newer request/response 
formats.

We should actually figure out a way to _source_ the protocol definitions from 
the last official release version into that wiki.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2863) Authorizer should provide lifecycle (shutdown) methods

2015-11-19 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-2863:
-

 Summary: Authorizer should provide lifecycle (shutdown) methods
 Key: KAFKA-2863
 URL: https://issues.apache.org/jira/browse/KAFKA-2863
 Project: Kafka
  Issue Type: Improvement
  Components: security
Reporter: Joel Koshy
Assignee: Parth Brahmbhatt
 Fix For: 0.9.0.1


Authorizer supports configure, but no shutdown. This would be useful for 
non-trivial authorizers that need to do some cleanup (e.g., shutting down 
threadpools and such) on broker shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2664) Adding a new metric with several pre-existing metrics is very expensive

2015-10-29 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2664.
---
   Resolution: Fixed
Fix Version/s: (was: 0.9.0.1)
   0.9.0.0

Issue resolved by pull request 369
[https://github.com/apache/kafka/pull/369]

> Adding a new metric with several pre-existing metrics is very expensive
> ---
>
> Key: KAFKA-2664
> URL: https://issues.apache.org/jira/browse/KAFKA-2664
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
>Assignee: Aditya Auradkar
> Fix For: 0.9.0.0
>
>
> I know the summary sounds expected, but we recently ran into a socket server 
> request queue backup that I suspect was caused by a combination of improperly 
> implemented applications that reconnect with a different (random) client-id 
> each time; and the fact that for quotas we now register a new quota 
> metric-set for each client-id.
> So here is what happened: a broker went down and a handful of other brokers 
> starting seeing queue times go up significantly. This caused the request 
> queue to backup, which caused socket timeouts and a further deluge of 
> reconnects. The only way we could get out of this was to fire-wall the broker 
> and downgrade to a version without quotas (or I think it would have worked to 
> just restart the broker).
> My guess is that there were a ton of pre-existing client-id metrics. I don’t 
> know for sure but I’m basing that on the fact that there were several new 
> unique client-ids showing up in the public access logs and request local 
> times for fetches started going up inexplicably. (It would have been useful 
> to have a metric for the number of metrics.) So it turns out that in the 
> above scenario (with say 50k pre-existing client-ids), the avg local time for 
> fetch can go up to the order of 50-100ms (at least with tests on a linux box) 
> largely due to the time taken to create new metrics; and that’s because we 
> use a copy-on-write map underneath. If you have enough (say, hundreds) of 
> clients re-connecting at the same time with new client-id's, that can cause 
> the request queues to start backing up and the overall queuing system to 
> become unstable; and the line starts to spill out of the building.
> I think this is a fairly new scenario with quotas - i.e., I don’t think the 
> past per-X metrics (per-topic for e.g.,) creation rate would ever come this 
> close.
> To be clear, the clients are clearly doing the wrong thing but I think the 
> broker can and should protect itself adequately against such rogue scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2663) Add quota-delay time to request processing time break-up

2015-10-29 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2663.
---
   Resolution: Fixed
Fix Version/s: (was: 0.9.0.1)
   0.9.0.0

Issue resolved by pull request 369
[https://github.com/apache/kafka/pull/369]

> Add quota-delay time to request processing time break-up
> 
>
> Key: KAFKA-2663
> URL: https://issues.apache.org/jira/browse/KAFKA-2663
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
>Assignee: Aditya Auradkar
> Fix For: 0.9.0.0
>
>
> This is probably not critical for 0.9 but should be easy to fix:
> If a request is delayed due to quotas, I think the remote time will go up 
> artificially - or maybe response queue time (haven’t checked). We should add 
> a new quotaDelayTime to the request handling time break-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2235) LogCleaner offset map overflow

2015-10-26 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974502#comment-14974502
 ] 

Joel Koshy commented on KAFKA-2235:
---

Agreed with Todd - this is very similar to the proposal in btw, 
https://issues.apache.org/jira/browse/KAFKA-1755?focusedCommentId=14216486

> LogCleaner offset map overflow
> --
>
> Key: KAFKA-2235
> URL: https://issues.apache.org/jira/browse/KAFKA-2235
> Project: Kafka
>  Issue Type: Bug
>  Components: core, log
>Affects Versions: 0.8.1, 0.8.2.0
>Reporter: Ivan Simoneko
>Assignee: Ivan Simoneko
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2235_v1.patch, KAFKA-2235_v2.patch
>
>
> We've seen log cleaning generating an error for a topic with lots of small 
> messages. It seems that cleanup map overflow is possible if a log segment 
> contains more unique keys than empty slots in offsetMap. Check for baseOffset 
> and map utilization before processing segment seems to be not enough because 
> it doesn't take into account segment size (number of unique messages in the 
> segment).
> I suggest to estimate upper bound of keys in a segment as a number of 
> messages in the segment and compare it with the number of available slots in 
> the map (keeping in mind desired load factor). It should work in cases where 
> an empty map is capable to hold all the keys for a single segment. If even a 
> single segment no able to fit into an empty map cleanup process will still 
> fail. Probably there should be a limit on the log segment entries count?
> Here is the stack trace for this error:
> 2015-05-19 16:52:48,758 ERROR [kafka-log-cleaner-thread-0] 
> kafka.log.LogCleaner - [kafka-log-cleaner-thread-0], Error due to
> java.lang.IllegalArgumentException: requirement failed: Attempt to add a new 
> entry to a full offset map.
>at scala.Predef$.require(Predef.scala:233)
>at kafka.log.SkimpyOffsetMap.put(OffsetMap.scala:79)
>at 
> kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:543)
>at 
> kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:538)
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
>at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>at kafka.message.MessageSet.foreach(MessageSet.scala:67)
>at 
> kafka.log.Cleaner.kafka$log$Cleaner$$buildOffsetMapForSegment(LogCleaner.scala:538)
>at 
> kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:515)
>at 
> kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:512)
>at scala.collection.immutable.Stream.foreach(Stream.scala:547)
>at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:512)
>at kafka.log.Cleaner.clean(LogCleaner.scala:307)
>at 
> kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:221)
>at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:199)
>at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2235) LogCleaner offset map overflow

2015-10-23 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971728#comment-14971728
 ] 

Joel Koshy commented on KAFKA-2235:
---

While I think the change is well-motivated I'm not sure this is the right fix 
for this issue as the check is too conservative. i.e., especially with highly 
compressible messages the message-count in the segment may be extremely high 
but the unique-key-count may be low.

> LogCleaner offset map overflow
> --
>
> Key: KAFKA-2235
> URL: https://issues.apache.org/jira/browse/KAFKA-2235
> Project: Kafka
>  Issue Type: Bug
>  Components: core, log
>Affects Versions: 0.8.1, 0.8.2.0
>Reporter: Ivan Simoneko
>Assignee: Ivan Simoneko
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2235_v1.patch, KAFKA-2235_v2.patch
>
>
> We've seen log cleaning generating an error for a topic with lots of small 
> messages. It seems that cleanup map overflow is possible if a log segment 
> contains more unique keys than empty slots in offsetMap. Check for baseOffset 
> and map utilization before processing segment seems to be not enough because 
> it doesn't take into account segment size (number of unique messages in the 
> segment).
> I suggest to estimate upper bound of keys in a segment as a number of 
> messages in the segment and compare it with the number of available slots in 
> the map (keeping in mind desired load factor). It should work in cases where 
> an empty map is capable to hold all the keys for a single segment. If even a 
> single segment no able to fit into an empty map cleanup process will still 
> fail. Probably there should be a limit on the log segment entries count?
> Here is the stack trace for this error:
> 2015-05-19 16:52:48,758 ERROR [kafka-log-cleaner-thread-0] 
> kafka.log.LogCleaner - [kafka-log-cleaner-thread-0], Error due to
> java.lang.IllegalArgumentException: requirement failed: Attempt to add a new 
> entry to a full offset map.
>at scala.Predef$.require(Predef.scala:233)
>at kafka.log.SkimpyOffsetMap.put(OffsetMap.scala:79)
>at 
> kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:543)
>at 
> kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:538)
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
>at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>at kafka.message.MessageSet.foreach(MessageSet.scala:67)
>at 
> kafka.log.Cleaner.kafka$log$Cleaner$$buildOffsetMapForSegment(LogCleaner.scala:538)
>at 
> kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:515)
>at 
> kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:512)
>at scala.collection.immutable.Stream.foreach(Stream.scala:547)
>at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:512)
>at kafka.log.Cleaner.clean(LogCleaner.scala:307)
>at 
> kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:221)
>at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:199)
>at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2454) Dead lock between delete log segment and shutting down.

2015-10-21 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2454:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Issue resolved by pull request 153
[https://github.com/apache/kafka/pull/153]

> Dead lock between delete log segment and shutting down.
> ---
>
> Key: KAFKA-2454
> URL: https://issues.apache.org/jira/browse/KAFKA-2454
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.9.0.0
>
>
> When the broker shutdown, it will shutdown scheduler which grabs the 
> scheduler lock then wait for all the threads in scheduler to shutdown.
> The dead lock will happen when the scheduled task try to delete old log 
> segment, it will schedule a log delete task which also needs to acquire the 
> scheduler lock. In this case the shutdown thread will hold scheduler lock and 
> wait for the the log deletion thread to finish, but the log deletion thread 
> will block on waiting for the scheduler lock.
> Related stack trace:
> {noformat}
> "Thread-1" #21 prio=5 os_prio=0 tid=0x7fe7601a7000 nid=0x1a4e waiting on 
> condition [0x7fe7cf698000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000640d53540> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
> at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:94)
> - locked <0x000640b6d480> (a kafka.utils.KafkaScheduler)
> at 
> kafka.server.KafkaServer$$anonfun$shutdown$4.apply$mcV$sp(KafkaServer.scala:352)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:79)
> at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
> at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:51)
> at kafka.utils.Logging$class.swallow(Logging.scala:94)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:51)
> at kafka.server.KafkaServer.shutdown(KafkaServer.scala:352)
> at 
> kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42)
> at com.linkedin.kafka.KafkaServer.notifyShutdown(KafkaServer.java:99)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyShutdownListener(LifeCycleMgr.java:123)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyListeners(LifeCycleMgr.java:102)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyStop(LifeCycleMgr.java:82)
> - locked <0x000640b77bb0> (a java.util.ArrayDeque)
> at com.linkedin.util.factory.Generator.stop(Generator.java:177)
> - locked <0x000640b77bc8> (a java.lang.Object)
> at 
> com.linkedin.offspring.servlet.OffspringServletRuntime.destroy(OffspringServletRuntime.java:82)
> at 
> com.linkedin.offspring.servlet.OffspringServletContextListener.contextDestroyed(OffspringServletContextListener.java:51)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doStop(ContextHandler.java:813)
> at 
> org.eclipse.jetty.servlet.ServletContextHandler.doStop(ServletContextHandler.java:160)
> at 
> org.eclipse.jetty.webapp.WebAppContext.doStop(WebAppContext.java:516)
> at com.linkedin.emweb.WebappContext.doStop(WebappContext.java:35)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x0006400018b8> (a java.lang.Object)
> at 
> com.linkedin.emweb.ContextBasedHandlerImpl.doStop(ContextBasedHandlerImpl.java:112)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x000640001900> (a java.lang.Object)
> at 
> com.linkedin.emweb.WebappDeployerImpl.stop(WebappDeployerImpl.java:349)
> at 
> com.linkedin.emweb.WebappDeployerImpl.doStop(WebappDeployerImpl.java:414)
> - locked <0x0006400019c0> (a 
> com.linkedin.emweb.MapBasedHandlerImpl)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x0006404ee8e8> (a java.lang.Object)
> at 
> org.eclipse.jetty.util.component.AggregateLifeCycle.doStop(AggregateLifeCycle.java:107)
> at 
> org.eclipse.jetty.server.handler.AbstractHandler.doStop(AbstractHandler.java:69)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.doStop(Handle

[jira] [Created] (KAFKA-2668) Add a metric that records the total number of metrics

2015-10-16 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-2668:
-

 Summary: Add a metric that records the total number of metrics
 Key: KAFKA-2668
 URL: https://issues.apache.org/jira/browse/KAFKA-2668
 Project: Kafka
  Issue Type: Improvement
Reporter: Joel Koshy
 Fix For: 0.9.1


Sounds recursive and weird, but this would have been useful while debugging 
KAFKA-2664



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2664) Adding a new metric with several pre-existing metrics is very expensive

2015-10-16 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961595#comment-14961595
 ] 

Joel Koshy commented on KAFKA-2664:
---

[~gwenshap] in general yes it could, but  if we did register per-connection 
metrics it is unlikely to cause as much of an issue as per-client-id metrics if 
you have clients that improperly generate a new client-id for every reconnect. 
This is because you would (typically) have of the order of a few hundred or low 
thousands of connection-id's; and once those have become registered you 
wouldn't need to add anymore even if you have many of those clients 
reconnecting frequently. That said, it is currently disabled (i.e., we don't 
register per-connection metrics) in the server-side selector.

bq.1. Can you specify which git-hash you reverted to?

The version we rolled back to does include KAFKA-1928 if that's what you are 
asking and multi-port support as well; but since those per-connection metrics 
are disabled it is probably irrelevant here.

bq. 2. Did you profile the connection? Or is this an educated guess of where 
time went?

I forgot to mention this above, but after the above episode and a mild 
suspicion fell on quota metrics I had a separate stress test for quota metrics 
- the easiest way to observe this is to synthetically call 
{{QuotaManager.recordAndMaybeThrottle}} in a loop and profile it. Most of the 
time is spent in the copy on write map and map resizes. So yes it was an 
educated guess until today... because today we deliberately reproduced this 
today in production and attached a profiler to the broker to verify that the 
higher local times were due to the creation of per-client id quota metrics.

> Adding a new metric with several pre-existing metrics is very expensive
> ---
>
> Key: KAFKA-2664
> URL: https://issues.apache.org/jira/browse/KAFKA-2664
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
> Fix For: 0.9.0.1
>
>
> I know the summary sounds expected, but we recently ran into a socket server 
> request queue backup that I suspect was caused by a combination of improperly 
> implemented applications that reconnect with a different (random) client-id 
> each time; and the fact that for quotas we now register a new quota 
> metric-set for each client-id.
> So here is what happened: a broker went down and a handful of other brokers 
> starting seeing queue times go up significantly. This caused the request 
> queue to backup, which caused socket timeouts and a further deluge of 
> reconnects. The only way we could get out of this was to fire-wall the broker 
> and downgrade to a version without quotas (or I think it would have worked to 
> just restart the broker).
> My guess is that there were a ton of pre-existing client-id metrics. I don’t 
> know for sure but I’m basing that on the fact that there were several new 
> unique client-ids showing up in the public access logs and request local 
> times for fetches started going up inexplicably. (It would have been useful 
> to have a metric for the number of metrics.) So it turns out that in the 
> above scenario (with say 50k pre-existing client-ids), the avg local time for 
> fetch can go up to the order of 50-100ms (at least with tests on a linux box) 
> largely due to the time taken to create new metrics; and that’s because we 
> use a copy-on-write map underneath. If you have enough (say, hundreds) of 
> clients re-connecting at the same time with new client-id's, that can cause 
> the request queues to start backing up and the overall queuing system to 
> become unstable; and the line starts to spill out of the building.
> I think this is a fairly new scenario with quotas - i.e., I don’t think the 
> past per-X metrics (per-topic for e.g.,) creation rate would ever come this 
> close.
> To be clear, the clients are clearly doing the wrong thing but I think the 
> broker can and should protect itself adequately against such rogue scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2664) Adding a new metric with several pre-existing metrics is very expensive

2015-10-16 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-2664:
-

 Summary: Adding a new metric with several pre-existing metrics is 
very expensive
 Key: KAFKA-2664
 URL: https://issues.apache.org/jira/browse/KAFKA-2664
 Project: Kafka
  Issue Type: Bug
Reporter: Joel Koshy
 Fix For: 0.9.0.1


I know the summary sounds expected, but we recently ran into a socket server 
request queue backup that I suspect was caused by a combination of improperly 
implemented applications that reconnect with a different (random) client-id 
each time; and the fact that for quotas we now register a new quota metric-set 
for each client-id.

So here is what happened: a broker went down and a handful of other brokers 
starting seeing queue times go up significantly. This caused the request queue 
to backup, which caused socket timeouts and a further deluge of reconnects. The 
only way we could get out of this was to fire-wall the broker and downgrade to 
a version without quotas (or I think it would have worked to just restart the 
broker).

My guess is that there were a ton of pre-existing client-id metrics. I don’t 
know for sure but I’m basing that on the fact that there were several new 
unique client-ids showing up in the public access logs and request local times 
for fetches started going up inexplicably. (It would have been useful to have a 
metric for the number of metrics.) So it turns out that in the above scenario 
(with say 50k pre-existing client-ids), the avg local time for fetch can go up 
to the order of 50-100ms (at least with tests on a linux box) largely due to 
the time taken to create new metrics; and that’s because we use a copy-on-write 
map underneath. If you have enough (say, hundreds) of clients re-connecting at 
the same time with new client-id's, that can cause the request queues to start 
backing up and the overall queuing system to become unstable; and the line 
starts to spill out of the building.

I think this is a fairly new scenario with quotas - i.e., I don’t think the 
past per-X metrics (per-topic for e.g.,) creation rate would ever come this 
close.

To be clear, the clients are clearly doing the wrong thing but I think the 
broker can and should protect itself adequately against such rogue scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2663) Add quota-delay time to request processing time break-up

2015-10-16 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-2663:
-

 Summary: Add quota-delay time to request processing time break-up
 Key: KAFKA-2663
 URL: https://issues.apache.org/jira/browse/KAFKA-2663
 Project: Kafka
  Issue Type: Bug
Reporter: Joel Koshy
Assignee: Aditya Auradkar
 Fix For: 0.9.0.1


This is probably not critical for 0.9 but should be easy to fix:

If a request is delayed due to quotas, I think the remote time will go up 
artificially - or maybe response queue time (haven’t checked). We should add a 
new quotaDelayTime to the request handling time break-up.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2660) Correct cleanableRatio calculation

2015-10-15 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2660:
--
Reviewer: Joel Koshy

> Correct cleanableRatio calculation
> --
>
> Key: KAFKA-2660
> URL: https://issues.apache.org/jira/browse/KAFKA-2660
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Dong Lin
> Fix For: 0.9.0.0
>
>
> There is a bug in LogToClean that causes cleanableRatio to be over-estimated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2017) Persist Coordinator State for Coordinator Failover

2015-10-15 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959392#comment-14959392
 ] 

Joel Koshy commented on KAFKA-2017:
---

btw, I just realized a migration as described above probably won't work - since 
we would need all consumers to participate in a rebalance to migrate state from 
zookeeper to Kafka (if we ever want to do that). We could think of other 
migration strategies but if there is no easy migration strategy then I think it 
is more important to make a sound choice now that most people are comfortable 
with. To summarize:
* I think persistence in ZK is okay and should work
* Persistence in Kafka seems better, but implementation in Kafka is a little 
more work.
* We definitely need persistence in 0.9.
* If we do persistence in ZK and discover use-cases where it doesn't scale then 
migration may be difficult.

Okay so those are the observations. Anything else? Basically IMO it boils down 
to an okay solution vs a good solution. i.e., I personally think it is okay to 
do it in ZK and I'm not uncomfortable with it. If we are cool with spending 
some more time to implement in Kafka I would prefer that approach.

> Persist Coordinator State for Coordinator Failover
> --
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Onur Karaman
>Assignee: Guozhang Wang
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2017) Persist Coordinator State for Coordinator Failover

2015-10-15 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959266#comment-14959266
 ] 

Joel Koshy commented on KAFKA-2017:
---

bq.  But if we really want to have group metadata persisted in 0.9 and don't 
want to miss the planned release again.

Yes I think we definitely need it in 0.9 - so if people really think it is 
expedient and less prone to error to do it in ZK then maybe that is what we 
should do but I'm not sure if everyone is convinced about that.

I don't think migration will be painful since it is all server-side - although 
for a dual-write migration scheme to work you would need some sequence number 
as well in each persisted state entry (e.g., for offsets we use the offset 
itself) in order to decide which one is more recent.

> Persist Coordinator State for Coordinator Failover
> --
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Onur Karaman
>Assignee: Guozhang Wang
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2017) Persist Coordinator State for Coordinator Failover

2015-10-15 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958442#comment-14958442
 ] 

Joel Koshy commented on KAFKA-2017:
---

Ashish, that is correct - consumers can still remain detached from ZK. I think 
going with ZK should work fine but the main point of this exercise is to 
convince ourselves one way or the other. i.e., that persisting in ZK is better; 
or persisting in Kafka is better; or even if persisting in Kafka is better, 
that it is not worth doing it in Kafka at this time.

I agree that the write rate will be low most of the time. So there is mostly no 
concern there. I say mostly, because there may be some special cases to be 
aware of. For e.g., if there is a heavily multi-tenant topic that gets deleted 
(or a new subscribed topic gets created) you could have a ton of consumers 
rebalance all at the same time. So you will have (in the worst case) one 
coordinator that ends up having to write a ton of data to zookeeper in a short 
span of time; and if the state size is large that would make it worse. The 
actual write load on ZK is probably not an issue (since it should be no worse 
than what the old consumers would currently do - except that right now those 
writes are spread across multiple consumers as opposed to one or a few brokers 
with this proposal) but if you write enough then request handling latencies can 
go up a bit on the brokers. (Incidentally, we are reeling from a recent 
production incident that was caused by high request local times so it's fresh 
in my mind :) ) I'm not suggesting that ZK will not work - it's just that 
writes are much cheaper in Kafka than ZK; and reads can be made cheap by 
caching. So if it is not difficult to persist state in Kafka maybe it is worth 
buying some insurance now.

Also, one small correction to what I mentioned above - I don't think you need 
to support different value schemas for different group management use-cases 
because it will just be an _opaque_ byte array.

I don't quite agree that doing it in Kafka is not-so-simple mainly because most 
of what we need to do is already done in the offset manager so we could 
refactor that to make it more general. I also don't think that tooling or ops 
would suffer. Yes we would need to look up the coordinator for the group or 
consume the state topic, but that can be wrapped in a utility. We already do 
that for tools like the offset checker and it works fine. Being able to consume 
the state topic and quickly see what is going on globally is in fact a big plus 
(as opposed to doing an expensive traversal of zookeeper) - as an example, the 
[burrow|https://github.com/linkedin/Burrow] monitoring tool benefits greatly 
from that property. (cc [~toddpalino])

Finally, I should add that if we do it in ZK and then decide we want to abandon 
that and do it in Kafka that is also feasible - we can do something similar to 
what we did for dual-commit with consumer offsets. It will be much simpler to 
orchestrate since only the brokers need to turn on dual-write and then turn it 
off (as opposed to all of the consumers for offset management).

> Persist Coordinator State for Coordinator Failover
> --
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Onur Karaman
>Assignee: Guozhang Wang
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2017) Persist Coordinator State for Coordinator Failover

2015-10-14 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957348#comment-14957348
 ] 

Joel Koshy commented on KAFKA-2017:
---

If we reused this topic, then I think we could end up doing something like this:

* Bump up the key format version and add a new field: key-type (which could be 
either _offset_ or _consumer group data_)
* The actual key-bytes would be either the group-topic-partition schema (for 
keys with type _offset_) and just group (for keys with type _consumer group 
data_)
* If the key is of type _offset_ then the value will be the same as what we 
currently have for offsets
* If the key is of type _consumer group data_ then the value will be a new 
schema (for the consumer group data). Now we may end up needing to support 
different value schemas for different group management use-cases.
Overall, it can get a bit ugly so I'm wondering if the above indicates that it 
is cleaner to just use a separate topic for group state and refactor the offset 
manager to become a simple state store that can be easily repurposed for these 
use-cases (offset state storage and group state storage).


> Persist Coordinator State for Coordinator Failover
> --
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Onur Karaman
>Assignee: Guozhang Wang
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2017) Persist Coordinator State for Coordinator Failover

2015-10-14 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957204#comment-14957204
 ] 

Joel Koshy commented on KAFKA-2017:
---

I agree with those benefits. If we go that route then I would prefer compaction 
over other retention policies. You do lose some history, but _typically_ you 
would have the last couple of state entries available if you overpartition and 
time-based or size-based retention would anyway keep only a certain amount of 
history. I think you can merge this in with the offsets topic (and rename it to 
something like __consumer_state or something like that). If we merge then we 
will end up having a heterogeneous topic with different keys 
(group-topic-partition for offsets and group for state) but that should be fine.

WRT implementation complexity that was referenced above: I agree it is more 
complicated to implement than ZK storage if we implement it from scratch but I 
don't think we need to right? i.e., all of the fault-tolerance and caching 
logic is already there in offset manager. So on coordinator failover the new 
coordinator just reads from the state partitions that it now leads and loads 
into memory (as we already do for offsets). Is there any other implementation 
complexity that I'm missing?

> Persist Coordinator State for Coordinator Failover
> --
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Onur Karaman
>Assignee: Guozhang Wang
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2017) Persist Coordinator State for Coordinator Failover

2015-10-13 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956108#comment-14956108
 ] 

Joel Koshy commented on KAFKA-2017:
---

In offline threads we were going to discuss the options of Kafka-based storage 
vs zookeeper for persisting state. I think it would be beneficial to finish 
that discussion either here or in the wiki or mailing list before doing this 
exclusively in ZK for 0.9 right?

> Persist Coordinator State for Coordinator Failover
> --
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Onur Karaman
>Assignee: Guozhang Wang
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2567) throttle-time shouldn't be NaN

2015-10-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2567.
---
Resolution: Fixed

Issue resolved by pull request 213
[https://github.com/apache/kafka/pull/213]

> throttle-time shouldn't be NaN
> --
>
> Key: KAFKA-2567
> URL: https://issues.apache.org/jira/browse/KAFKA-2567
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jun Rao
>Assignee: Aditya Auradkar
>Priority: Minor
>  Labels: quotas
> Fix For: 0.9.0.0
>
>
> Currently, if throttling never happens, we get the NaN for throttle-time. It 
> seems it's better to default to 0.
> "kafka.server:client-id=eventsimgroup200343,type=Fetch" : { "byte-rate": 0.0, 
> "throttle-time": NaN }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2443) Expose windowSize on Rate

2015-10-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2443.
---
   Resolution: Fixed
Fix Version/s: 0.9.0.0

Issue resolved by pull request 213
[https://github.com/apache/kafka/pull/213]

> Expose windowSize on Rate
> -
>
> Key: KAFKA-2443
> URL: https://issues.apache.org/jira/browse/KAFKA-2443
> Project: Kafka
>  Issue Type: Task
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>  Labels: quotas
> Fix For: 0.9.0.0
>
>
> Currently, we dont have a means to measure the size of the metric window 
> since the final sample can be incomplete.
> Expose windowSize(now) on Rate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2419) Allow certain Sensors to be garbage collected after inactivity

2015-10-07 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2419.
---
Resolution: Fixed

> Allow certain Sensors to be garbage collected after inactivity
> --
>
> Key: KAFKA-2419
> URL: https://issues.apache.org/jira/browse/KAFKA-2419
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.9.0.0
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>Priority: Blocker
>  Labels: quotas
> Fix For: 0.9.0.0
>
>
> Currently, metrics cannot be removed once registered. 
> Implement a feature to remove certain sensors after a certain period of 
> inactivity (perhaps configurable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)

2015-10-05 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943698#comment-14943698
 ] 

Joel Koshy commented on KAFKA-2580:
---

I was wondering if we can do with something much simpler - basically close out 
file handles if they haven't been accessed after "x" minutes. The file-handle 
cache approach has some benefit over this in that it may allow you to close out 
unused file handles quicker than the other approach, but in both cases you have 
to account for the worst case scenario - which is the worst case expected 
number of bootstrapping consumers * number of segments in the logs that they 
consume from. Log recovery is another scenario where we may need to open 
several logs over a short span of time but I think that can be addressed by 
closing out segments immediately after scanning them during recovery. That 
said, I'm not very clear on how useful all of this is - maybe that's because I 
don't do Kafka operations on a day to day basis :) [~toddpalino] what do you 
think?

> Kafka Broker keeps file handles open for all log files (even if its not 
> written to/read from)
> -
>
> Key: KAFKA-2580
> URL: https://issues.apache.org/jira/browse/KAFKA-2580
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.1
>Reporter: Vinoth Chandar
>
> We noticed this in one of our clusters where we stage logs for a longer 
> amount of time. It appears that the Kafka broker keeps file handles open even 
> for non active (not written to or read from) files. (in fact, there are some 
> threads going back to 2013 
> http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) 
> Needless to say, this is a problem and forces us to either artificially bump 
> up ulimit (its already at 100K) or expand the cluster (even if we have 
> sufficient IO and everything). 
> Filing this ticket, since I could find anything similar. Very interested to 
> know if there are plans to address this (given how Samza's changelog topic is 
> meant to be a persistent large state use case).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2584) SecurityProtocol enum validation should be removed or relaxed for non-config usages

2015-10-05 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943694#comment-14943694
 ] 

Joel Koshy commented on KAFKA-2584:
---

Here is a quick summary of the patch we ended up applying at LinkedIn:
* In the implementation of {{EndPoint.createEndPoint(listener)}}: catch 
{{IllegalArgumentException}} (due to enum validation) and throw a new 
{{UnknownEndpointException}}.
* {{Broker.createBroker}} catches {{UnknownEndpointException}} and just filters 
out those endpoints.
* Config validation will still fail on unknown endpoints (since we don't catch 
the exception there).

We felt the enum validation is useful for config validation. Hopefully the 
above is clear enough, but we can upload a small PR as well. It is not 
necessarily the best/only option, but we can discuss alternatives.

> SecurityProtocol enum validation should be removed or relaxed for non-config 
> usages
> ---
>
> Key: KAFKA-2584
> URL: https://issues.apache.org/jira/browse/KAFKA-2584
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
> Fix For: 0.9.0.0
>
>
> While deploying SSL to our clusters, we had to roll back due to another 
> compatibility issue similar to what we mentioned in passing in other 
> threads/KIP hangouts. i.e., picking up jars between official releases. 
> Fortunately, there is an easy server-side hot-fix we can do internally to 
> work around it. However, I would classify the issue below as a bug since 
> there is little point in doing endpoint type validation (except for config 
> validation).
> What happened here is that some (old) consumers (that do not care about SSL) 
> picked up a Kafka jar that understood multiple endpoints but did not have the 
> SSL feature. The rebalance fails because while creating the Broker objects we 
> are forced to validate all the endpoints.
> Yes the old consumer is going away, but this would affect tools as well. The 
> same issue could also happen on the brokers if we were to upgrade them to 
> include (say) a Kerberos endpoint. So the old brokers would not be able to 
> read the registration of newly upgraded brokers. Well you could get around 
> that by doing two rounds of deployment (one to get the new code, and another 
> to expose the Kerberos endpoint) but that’s inconvenient and I think 
> unnecessary. Although validation makes sense for configs, I think the current 
> validate everywhere is overkill. (i.e., an old consumer, tool or broker 
> should not complain because another broker can talk more protocols.)
> {noformat}
> kafka.common.KafkaException: Failed to parse the broker info from zookeeper: 
> {"jmx_port":-1,"timestamp":"1442952770627","endpoints":["PLAINTEXT://:","SSL://:"],"host”:”","version":2,"port”:}
> at kafka.cluster.Broker$.createBroker(Broker.scala:61)
> at kafka.utils.ZkUtils$$anonfun$getCluster$1.apply(ZkUtils.scala:520)
> at kafka.utils.ZkUtils$$anonfun$getCluster$1.apply(ZkUtils.scala:518)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at kafka.utils.ZkUtils$.getCluster(ZkUtils.scala:518)
> at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener
> ...
> Caused by: java.lang.IllegalArgumentException: No enum constant 
> org.apache.kafka.common.protocol.SecurityProtocol.SSL
> at java.lang.Enum.valueOf(Enum.java:238)
> at 
> org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.java:24)
> at kafka.cluster.EndPoint$.createEndPoint(EndPoint.scala:48)
> at kafka.cluster.Broker$$anonfun$1.apply(Broker.scala:74)
> at kafka.cluster.Broker$$anonfun$1.apply(Broker.scala:73)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at kafka.cluster.Broker$.createBroker(Broker.scala:73)
> ... 70 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-29 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2120:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

+1 and committed to trunk.

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch, 
> KAFKA-2120_2015-09-18_19:27:48.patch, KAFKA-2120_2015-09-28_16:13:02.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)

2015-09-28 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933781#comment-14933781
 ] 

Joel Koshy commented on KAFKA-2580:
---

[~guozhang] actually it is now at 400k.
[~vinothchandar] yes that can be done. I think there may be a jira open as well 
but I couldn't find it. I vaguely recollect discussing a LRU-like file-handle 
cache with someone - it could have been a jira or just mailing list. It just 
hasn't been a particularly pressing concern so far.

> Kafka Broker keeps file handles open for all log files (even if its not 
> written to/read from)
> -
>
> Key: KAFKA-2580
> URL: https://issues.apache.org/jira/browse/KAFKA-2580
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.1
>Reporter: Vinoth Chandar
>
> We noticed this in one of our clusters where we stage logs for a longer 
> amount of time. It appears that the Kafka broker keeps file handles open even 
> for non active (not written to or read from) files. (in fact, there are some 
> threads going back to 2013 
> http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) 
> Needless to say, this is a problem and forces us to either artificially bump 
> up ulimit (its already at 100K) or expand the cluster (even if we have 
> sufficient IO and everything). 
> Filing this ticket, since I could find anything similar. Very interested to 
> know if there are plans to address this (given how Samza's changelog topic is 
> meant to be a persistent large state use case).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2584) SecurityProtocol enum validation should be removed or relaxed for non-config usages

2015-09-25 Thread Joel Koshy (JIRA)
Joel Koshy created KAFKA-2584:
-

 Summary: SecurityProtocol enum validation should be removed or 
relaxed for non-config usages
 Key: KAFKA-2584
 URL: https://issues.apache.org/jira/browse/KAFKA-2584
 Project: Kafka
  Issue Type: Bug
Reporter: Joel Koshy
 Fix For: 0.9.0.0


While deploying SSL to our clusters, we had to roll back due to another 
compatibility issue similar to what we mentioned in passing in other 
threads/KIP hangouts. i.e., picking up jars between official releases. 
Fortunately, there is an easy server-side hot-fix we can do internally to work 
around it. However, I would classify the issue below as a bug since there is 
little point in doing endpoint type validation (except for config validation).

What happened here is that some (old) consumers (that do not care about SSL) 
picked up a Kafka jar that understood multiple endpoints but did not have the 
SSL feature. The rebalance fails because while creating the Broker objects we 
are forced to validate all the endpoints.

Yes the old consumer is going away, but this would affect tools as well. The 
same issue could also happen on the brokers if we were to upgrade them to 
include (say) a Kerberos endpoint. So the old brokers would not be able to read 
the registration of newly upgraded brokers. Well you could get around that by 
doing two rounds of deployment (one to get the new code, and another to expose 
the Kerberos endpoint) but that’s inconvenient and I think unnecessary. 
Although validation makes sense for configs, I think the current validate 
everywhere is overkill. (i.e., an old consumer, tool or broker should not 
complain because another broker can talk more protocols.)

{noformat}
kafka.common.KafkaException: Failed to parse the broker info from zookeeper: 
{"jmx_port":-1,"timestamp":"1442952770627","endpoints":["PLAINTEXT://:","SSL://:"],"host”:”","version":2,"port”:}
at kafka.cluster.Broker$.createBroker(Broker.scala:61)
at kafka.utils.ZkUtils$$anonfun$getCluster$1.apply(ZkUtils.scala:520)
at kafka.utils.ZkUtils$$anonfun$getCluster$1.apply(ZkUtils.scala:518)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at kafka.utils.ZkUtils$.getCluster(ZkUtils.scala:518)
at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener
...
Caused by: java.lang.IllegalArgumentException: No enum constant 
org.apache.kafka.common.protocol.SecurityProtocol.SSL
at java.lang.Enum.valueOf(Enum.java:238)
at 
org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.java:24)
at kafka.cluster.EndPoint$.createEndPoint(EndPoint.scala:48)
at kafka.cluster.Broker$$anonfun$1.apply(Broker.scala:74)
at kafka.cluster.Broker$$anonfun$1.apply(Broker.scala:73)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at kafka.cluster.Broker$.createBroker(Broker.scala:73)
... 70 more
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-1911) Log deletion on stopping replicas should be async

2015-09-22 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903046#comment-14903046
 ] 

Joel Koshy edited comment on KAFKA-1911 at 9/22/15 5:34 PM:


The original motivation in this ticket was to avoid a high latency request from 
tying up request handlers. However, while thinking through some nuances of 
delete topic, I think delete topic would also benefit from this. Since 
stop-replica-requests can take a while to finish delete topic can also take a 
while (apart from failure cases such as a replica being down).

I think the easiest way to fix this would be to just rename the partition 
directory from  to something like deleted 
and asynchronously delete that. The  is probably needed if a user were 
to delete and recreate multiple times in rapid fire for whatever reason.


was (Author: jjkoshy):
The original motivation in this ticket was to avoid a high latency request from 
tying up request handlers. However, while thinking through some nuances of 
delete topic, I think delete topic would also benefit from this. Since 
stop-replica-requests can take a while to finish delete topic can also take a 
while (apart from failure cases such as a replica being down).

I think the easiest way to fix this would be to just rename the partition 
directory from - to something like 
--deleted- and asynchronously delete that. The  is 
probably needed if a user were to delete and recreate multiple times in rapid 
fire for whatever reason.

> Log deletion on stopping replicas should be async
> -
>
> Key: KAFKA-1911
> URL: https://issues.apache.org/jira/browse/KAFKA-1911
> Project: Kafka
>  Issue Type: Bug
>  Components: log, replication
>Reporter: Joel Koshy
>Assignee: Geoff Anderson
>  Labels: newbie++
>
> If a StopReplicaRequest sets delete=true then we do a file.delete on the file 
> message sets. I was under the impression that this is fast but it does not 
> seem to be the case.
> On a partition reassignment in our cluster the local time for stop replica 
> took nearly 30 seconds.
> {noformat}
> Completed request:Name: StopReplicaRequest; Version: 0; CorrelationId: 467; 
> ClientId: ;DeletePartitions: true; ControllerId: 1212; ControllerEpoch: 
> 53 from 
> client/...:45964;totalTime:29191,requestQueueTime:1,localTime:29190,remoteTime:0,responseQueueTime:0,sendTime:0
> {noformat}
> This ties up one API thread for the duration of the request.
> Specifically in our case, the queue times for other requests also went up and 
> producers to the partition that was just deleted on the old leader took a 
> while to refresh their metadata (see KAFKA-1303) and eventually ran out of 
> retries on some messages leading to data loss.
> I think the log deletion in this case should be fully asynchronous although 
> we need to handle the case when a broker may respond immediately to the 
> stop-replica-request but then go down after deleting only some of the log 
> segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1911) Log deletion on stopping replicas should be async

2015-09-22 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903046#comment-14903046
 ] 

Joel Koshy commented on KAFKA-1911:
---

The original motivation in this ticket was to avoid a high latency request from 
tying up request handlers. However, while thinking through some nuances of 
delete topic, I think delete topic would also benefit from this. Since 
stop-replica-requests can take a while to finish delete topic can also take a 
while (apart from failure cases such as a replica being down).

I think the easiest way to fix this would be to just rename the partition 
directory from - to something like 
--deleted- and asynchronously delete that. The  is 
probably needed if a user were to delete and recreate multiple times in rapid 
fire for whatever reason.

> Log deletion on stopping replicas should be async
> -
>
> Key: KAFKA-1911
> URL: https://issues.apache.org/jira/browse/KAFKA-1911
> Project: Kafka
>  Issue Type: Bug
>  Components: log, replication
>Reporter: Joel Koshy
>Assignee: Geoff Anderson
>  Labels: newbie++
>
> If a StopReplicaRequest sets delete=true then we do a file.delete on the file 
> message sets. I was under the impression that this is fast but it does not 
> seem to be the case.
> On a partition reassignment in our cluster the local time for stop replica 
> took nearly 30 seconds.
> {noformat}
> Completed request:Name: StopReplicaRequest; Version: 0; CorrelationId: 467; 
> ClientId: ;DeletePartitions: true; ControllerId: 1212; ControllerEpoch: 
> 53 from 
> client/...:45964;totalTime:29191,requestQueueTime:1,localTime:29190,remoteTime:0,responseQueueTime:0,sendTime:0
> {noformat}
> This ties up one API thread for the duration of the request.
> Specifically in our case, the queue times for other requests also went up and 
> producers to the partition that was just deleted on the old leader took a 
> while to refresh their metadata (see KAFKA-1303) and eventually ran out of 
> retries on some messages leading to data loss.
> I think the log deletion in this case should be fully asynchronous although 
> we need to handle the case when a broker may respond immediately to the 
> stop-replica-request but then go down after deleting only some of the log 
> segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2438) add maxParallelForks to build.gradle to speedup tests

2015-09-18 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876545#comment-14876545
 ] 

Joel Koshy commented on KAFKA-2438:
---

FYI, I find that this patch appears to exacerbate timing-related test failures 
with JDK 7 on my linux machine - not on JDK 8 though. (cc [~ijuma] since I made 
mention of this in KAFKA-2120)

> add maxParallelForks to build.gradle to speedup tests
> -
>
> Key: KAFKA-2438
> URL: https://issues.apache.org/jira/browse/KAFKA-2438
> Project: Kafka
>  Issue Type: Bug
>Reporter: Sriharsha Chintalapani
>Assignee: Sriharsha Chintalapani
> Fix For: 0.9.0.0
>
>
> With current trunk unit tests on my machine takes 16+ mins and with this 
> patch runs about 6mins. Tested on OS X and linux.
> Before 
> {code}
> Total time: 18 mins 29.806 secs
> {code}
> After
> {code}
> BUILD SUCCESSFUL
> Total time: 5 mins 37.194 secs
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2419) Allow certain Sensors to be garbage collected after inactivity

2015-09-18 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875986#comment-14875986
 ] 

Joel Koshy commented on KAFKA-2419:
---

However, we should think through the effect of doing this. i.e., consider 
topic-level metrics that are only updated when there is active traffic to/from 
a topic. So for e.g., if there are topics with intermittent pushes of data 
(e.g., from hadoop to Kafka) you may delete those metrics which would 
(depending on what monitoring framework you use) typically manifest in gaps - 
so it will be hard to tell if it is an issue with the monitoring framework or 
whether that topic just got deleted - likewise for quotas. That won't occur 
with explicit deletion. So maybe we should go with (2) or (3) unless the 
caveats of (1) (which is the simplest approach) are not a major concern.

> Allow certain Sensors to be garbage collected after inactivity
> --
>
> Key: KAFKA-2419
> URL: https://issues.apache.org/jira/browse/KAFKA-2419
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.9.0.0
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>Priority: Blocker
>  Labels: quotas
> Fix For: 0.9.0.0
>
>
> Currently, metrics cannot be removed once registered. 
> Implement a feature to remove certain sensors after a certain period of 
> inactivity (perhaps configurable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2419) Allow certain Sensors to be garbage collected after inactivity

2015-09-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804893#comment-14804893
 ] 

Joel Koshy commented on KAFKA-2419:
---

Re: the options in 
https://issues.apache.org/jira/browse/KAFKA-2419?focusedCommentId=14737756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737756

I'm not really in favor of (3). Between (1) and (2) I'm leaning more towards 
(1) as we probably want to do something similar for other (non-quota) metrics - 
such as per-topic-metrics that should go away after a topic is deleted.

> Allow certain Sensors to be garbage collected after inactivity
> --
>
> Key: KAFKA-2419
> URL: https://issues.apache.org/jira/browse/KAFKA-2419
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.9.0.0
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>Priority: Blocker
>  Labels: quotas
> Fix For: 0.9.0.0
>
>
> Currently, metrics cannot be removed once registered. 
> Implement a feature to remove certain sensors after a certain period of 
> inactivity (perhaps configurable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1215) Rack-Aware replica assignment option

2015-09-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804835#comment-14804835
 ] 

Joel Koshy commented on KAFKA-1215:
---

[~allenxwang] you should have access now.

> Rack-Aware replica assignment option
> 
>
> Key: KAFKA-1215
> URL: https://issues.apache.org/jira/browse/KAFKA-1215
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.8.0
>Reporter: Joris Van Remoortere
>Assignee: Jun Rao
> Fix For: 0.10.0.0
>
> Attachments: rack_aware_replica_assignment_v1.patch, 
> rack_aware_replica_assignment_v2.patch
>
>
> Adding a rack-id to kafka config. This rack-id can be used during replica 
> assignment by using the max-rack-replication argument in the admin scripts 
> (create topic, etc.). By default the original replication assignment 
> algorithm is used because max-rack-replication defaults to -1. 
> max-rack-replication > -1 is not honored if you are doing manual replica 
> assignment (preffered).
> If this looks good I can add some test cases specific to the rack-aware 
> assignment.
> I can also port this to trunk. We are currently running 0.8.0 in production 
> and need this, so i wrote the patch against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2552) Certain admin commands such as partition assignment fail on large clusters

2015-09-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804600#comment-14804600
 ] 

Joel Koshy commented on KAFKA-2552:
---

Reported in KAFKA-1599 as well?

> Certain admin commands such as partition assignment fail on large clusters
> --
>
> Key: KAFKA-2552
> URL: https://issues.apache.org/jira/browse/KAFKA-2552
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Abhishek Nigam
>Assignee: Abhishek Nigam
>
> This happens because the json generated is greater than 1 MB and exceeds the 
> default data limit of zookeeper nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804570#comment-14804570
 ] 

Joel Koshy commented on KAFKA-2120:
---

Actually some of the confusion on my side is because I was running tests with 
JDK 7u51 which has been failing consistently for a while now (wonder what's up 
with that) - with JDK 8 the previous revision does seem to go through unit 
tests without failures. So in the interest of keeping trunk somewhat stable I 
have reverted the patch for now - sorry about the inconvenience.

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803291#comment-14803291
 ] 

Joel Koshy commented on KAFKA-2120:
---

I see test failures on b7d4043d8da1d296519faf321e9f0f7188d7d67a as well

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-17 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803251#comment-14803251
 ] 

Joel Koshy commented on KAFKA-2120:
---

[~mgharat] yes I'm seeing the same thing. i.e., I ran tests against the 
revision just before your patch and I'm seeing various failures as well as 
hanging tests.

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-16 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791284#comment-14791284
 ] 

Joel Koshy commented on KAFKA-2120:
---

Hmm.. actually after rerunning tests I'm not sure this patch is not to blame. 
[~mgharat] can you also look into this. I'm going to do a few more runs with an 
earlier revision, but if it appears that this patch is making things worse I 
would be in favor of reverting if we cannot figure it out by EOD today.

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-16 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy reopened KAFKA-2120:
---

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-16 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791049#comment-14791049
 ] 

Joel Koshy commented on KAFKA-2120:
---

Yes I did note failures in the RB, but those were on trunk as well and 
transient on my laptop. I believe the failures are Jenkins slowness, but I'm 
going to rerun tests locally to verify

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-16 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2120:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks [~mgharat] for the patch, and thanks to everyone who helped with reviews.

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch, KAFKA-2120_2015-09-10_21:38:55.patch, 
> KAFKA-2120_2015-09-11_14:54:15.patch, KAFKA-2120_2015-09-15_18:57:20.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2529) Brokers should write current version to log when they first start

2015-09-10 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739774#comment-14739774
 ] 

Joel Koshy commented on KAFKA-2529:
---

Yes this was done in KAFKA-1901, but one issue is that the above is logged at 
start-up, and logs will eventually roll (and typically garbage collected). I 
thought of this in KAFKA-1901 but thought it should be fine since there are 
multiple ways right now to get the version:
* via JMX
* Use {{jar xf ...}}
* Logs (if still available)
One easy way to take care of this would be to just have a scheduled task (say, 
every minute or so) that logs the version, but some may consider that to be 
annoying.

> Brokers should write current version to log when they first start
> -
>
> Key: KAFKA-2529
> URL: https://issues.apache.org/jira/browse/KAFKA-2529
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>
> It is currently non-trivial to tell, by looking at log files, which version 
> of Kafka is the log from. 
> Having this information can be useful in some troubleshooting scenarios. We 
> are exposing this via JMX, but since troubleshooting usually involves asking 
> for logs, it will be nice if this information will be included there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2120) Add a request timeout to NetworkClient

2015-09-10 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739766#comment-14739766
 ] 

Joel Koshy commented on KAFKA-2120:
---

Thanks for the updated patch - I will review again later today

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
>Priority: Blocker
> Fix For: 0.8.3
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch, KAFKA-2120_2015-09-03_15:12:02.patch, 
> KAFKA-2120_2015-09-04_17:49:01.patch, KAFKA-2120_2015-09-09_16:45:44.patch, 
> KAFKA-2120_2015-09-09_18:56:18.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2437) Controller does not handle zk node deletion correctly.

2015-09-02 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2437.
---
   Resolution: Fixed
Fix Version/s: 0.8.3

> Controller does not handle zk node deletion correctly.
> --
>
> Key: KAFKA-2437
> URL: https://issues.apache.org/jira/browse/KAFKA-2437
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.8.3
>
>
> We see this issue occasionally. The symptom is that when /controller path got 
> deleted, the old controller does not resign so we end up having more than one 
> controller in the cluster (although the requests from controller with old 
> epoch will not be accepted). After checking zookeeper watcher by using wchp, 
> it looks the zookeeper session who created the /controller path does not have 
> a watcher on /controller. That causes the old controller not resigning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2502) Quotas documentation for 0.8.3

2015-09-02 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2502:
--
Description: 
Complete quotas documentation

Also, 
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol 
needs to be updated with protocol changes introduced in KAFKA-2136

  was:Complete quotas documentation


> Quotas documentation for 0.8.3
> --
>
> Key: KAFKA-2502
> URL: https://issues.apache.org/jira/browse/KAFKA-2502
> Project: Kafka
>  Issue Type: Task
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>Priority: Blocker
>  Labels: quotas
> Fix For: 0.8.3
>
>
> Complete quotas documentation
> Also, 
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
>  needs to be updated with protocol changes introduced in KAFKA-2136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2332) Add quota metrics to old producer and consumer

2015-09-01 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2332:
--
   Resolution: Fixed
Fix Version/s: 0.8.3
   Status: Resolved  (was: Patch Available)

Issue resolved by pull request 176
[https://github.com/apache/kafka/pull/176]

> Add quota metrics to old producer and consumer
> --
>
> Key: KAFKA-2332
> URL: https://issues.apache.org/jira/browse/KAFKA-2332
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Dong Lin
>  Labels: quotas
> Fix For: 0.8.3
>
> Attachments: KAFKA-2332.patch, KAFKA-2332.patch, 
> KAFKA-2332_2015-08-03_18:22:53.patch
>
>
> Quota metrics have only been added to the new producer and consumer. It may 
> be beneficial to add these to the existing consumer and old producer also for 
> clients using the older versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2120) Add a request timeout to NetworkClient

2015-08-26 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2120:
--
Status: In Progress  (was: Patch Available)

> Add a request timeout to NetworkClient
> --
>
> Key: KAFKA-2120
> URL: https://issues.apache.org/jira/browse/KAFKA-2120
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jiangjie Qin
>Assignee: Mayuresh Gharat
> Fix For: 0.8.3
>
> Attachments: KAFKA-2120.patch, KAFKA-2120_2015-07-27_15:31:19.patch, 
> KAFKA-2120_2015-07-29_15:57:02.patch, KAFKA-2120_2015-08-10_19:55:18.patch, 
> KAFKA-2120_2015-08-12_10:59:09.patch
>
>
> Currently NetworkClient does not have a timeout setting for requests. So if 
> no response is received for a request due to reasons such as broker is down, 
> the request will never be completed.
> Request timeout will also be used as implicit timeout for some methods such 
> as KafkaProducer.flush() and kafkaProducer.close().
> KIP-19 is created for this public interface change.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2136) Client side protocol changes to return quota delays

2015-08-25 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712324#comment-14712324
 ] 

Joel Koshy commented on KAFKA-2136:
---

Thanks for the patches - pushed to trunk. Can you update:
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

Can you also file a quotas documentation ticket and set its fix-version to 
0.8.3?

We can close this ticket after that.

> Client side protocol changes to return quota delays
> ---
>
> Key: KAFKA-2136
> URL: https://issues.apache.org/jira/browse/KAFKA-2136
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>  Labels: quotas
> Attachments: KAFKA-2136.patch, KAFKA-2136_2015-05-06_18:32:48.patch, 
> KAFKA-2136_2015-05-06_18:35:54.patch, KAFKA-2136_2015-05-11_14:50:56.patch, 
> KAFKA-2136_2015-05-12_14:40:44.patch, KAFKA-2136_2015-06-09_10:07:13.patch, 
> KAFKA-2136_2015-06-09_10:10:25.patch, KAFKA-2136_2015-06-30_19:43:55.patch, 
> KAFKA-2136_2015-07-13_13:34:03.patch, KAFKA-2136_2015-08-18_13:19:57.patch, 
> KAFKA-2136_2015-08-18_13:24:00.patch, KAFKA-2136_2015-08-21_16:29:17.patch, 
> KAFKA-2136_2015-08-24_10:33:10.patch, KAFKA-2136_2015-08-25_11:29:52.patch
>
>
> As described in KIP-13, evolve the protocol to return a throttle_time_ms in 
> the Fetch and the ProduceResponse objects. Add client side metrics on the new 
> producer and consumer to expose the delay time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2351) Brokers are having a problem shutting down correctly

2015-08-25 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2351:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk.

> Brokers are having a problem shutting down correctly
> 
>
> Key: KAFKA-2351
> URL: https://issues.apache.org/jira/browse/KAFKA-2351
> Project: Kafka
>  Issue Type: Bug
>Reporter: Mayuresh Gharat
>Assignee: Mayuresh Gharat
> Attachments: KAFKA-2351.patch, KAFKA-2351_2015-07-21_14:58:13.patch, 
> KAFKA-2351_2015-07-23_21:36:52.patch, KAFKA-2351_2015-08-13_13:10:05.patch, 
> KAFKA-2351_2015-08-24_15:50:41.patch
>
>
> The run() in Acceptor during shutdown might throw an exception that is not 
> caught and it never reaches shutdownComplete due to which the latch is not 
> counted down and the broker will not be able to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2442) QuotasTest should not fail when cpu is busy

2015-08-21 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2442:
--
Reviewer: Joel Koshy

will do

> QuotasTest should not fail when cpu is busy
> ---
>
> Key: KAFKA-2442
> URL: https://issues.apache.org/jira/browse/KAFKA-2442
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Aditya Auradkar
> Fix For: 0.8.3
>
>
> We observed that testThrottledProducerConsumer in QuotasTest may fail or 
> succeed randomly. It appears that the test may fail when the system is slow. 
> We can add timer in the integration test to avoid random failure.
> See an example failure at 
> https://builds.apache.org/job/kafka-trunk-git-pr/166/console for patch 
> https://github.com/apache/kafka/pull/142.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2454) Dead lock between delete log segment and shutting down.

2015-08-21 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707091#comment-14707091
 ] 

Joel Koshy commented on KAFKA-2454:
---

Thanks - got it. Will take a look at your patch today.

> Dead lock between delete log segment and shutting down.
> ---
>
> Key: KAFKA-2454
> URL: https://issues.apache.org/jira/browse/KAFKA-2454
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>
> When the broker shutdown, it will shutdown scheduler which grabs the 
> scheduler lock then wait for all the threads in scheduler to shutdown.
> The dead lock will happen when the scheduled task try to delete old log 
> segment, it will schedule a log delete task which also needs to acquire the 
> scheduler lock. In this case the shutdown thread will hold scheduler lock and 
> wait for the the log deletion thread to finish, but the log deletion thread 
> will block on waiting for the scheduler lock.
> Related stack trace:
> {noformat}
> "Thread-1" #21 prio=5 os_prio=0 tid=0x7fe7601a7000 nid=0x1a4e waiting on 
> condition [0x7fe7cf698000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000640d53540> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
> at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:94)
> - locked <0x000640b6d480> (a kafka.utils.KafkaScheduler)
> at 
> kafka.server.KafkaServer$$anonfun$shutdown$4.apply$mcV$sp(KafkaServer.scala:352)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:79)
> at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
> at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:51)
> at kafka.utils.Logging$class.swallow(Logging.scala:94)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:51)
> at kafka.server.KafkaServer.shutdown(KafkaServer.scala:352)
> at 
> kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42)
> at com.linkedin.kafka.KafkaServer.notifyShutdown(KafkaServer.java:99)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyShutdownListener(LifeCycleMgr.java:123)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyListeners(LifeCycleMgr.java:102)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyStop(LifeCycleMgr.java:82)
> - locked <0x000640b77bb0> (a java.util.ArrayDeque)
> at com.linkedin.util.factory.Generator.stop(Generator.java:177)
> - locked <0x000640b77bc8> (a java.lang.Object)
> at 
> com.linkedin.offspring.servlet.OffspringServletRuntime.destroy(OffspringServletRuntime.java:82)
> at 
> com.linkedin.offspring.servlet.OffspringServletContextListener.contextDestroyed(OffspringServletContextListener.java:51)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doStop(ContextHandler.java:813)
> at 
> org.eclipse.jetty.servlet.ServletContextHandler.doStop(ServletContextHandler.java:160)
> at 
> org.eclipse.jetty.webapp.WebAppContext.doStop(WebAppContext.java:516)
> at com.linkedin.emweb.WebappContext.doStop(WebappContext.java:35)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x0006400018b8> (a java.lang.Object)
> at 
> com.linkedin.emweb.ContextBasedHandlerImpl.doStop(ContextBasedHandlerImpl.java:112)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x000640001900> (a java.lang.Object)
> at 
> com.linkedin.emweb.WebappDeployerImpl.stop(WebappDeployerImpl.java:349)
> at 
> com.linkedin.emweb.WebappDeployerImpl.doStop(WebappDeployerImpl.java:414)
> - locked <0x0006400019c0> (a 
> com.linkedin.emweb.MapBasedHandlerImpl)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x0006404ee8e8> (a java.lang.Object)
> at 
> org.eclipse.jetty.util.component.AggregateLifeCycle.doStop(AggregateLifeCycle.java:107)
> at 
> org.eclipse.jetty.server.handler.AbstractHandler.doStop(AbstractHandler.java:69)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.doStop(HandlerWrapper.java:108)
> at org.eclipse.jetty.server.Server.doStop

[jira] [Commented] (KAFKA-2454) Dead lock between delete log segment and shutting down.

2015-08-20 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706019#comment-14706019
 ] 

Joel Koshy commented on KAFKA-2454:
---

I'm a bit unclear on the root cause that you describe. The thread-dump shows 
that the deletion task has not even entered the executor at this point. It is 
definitely blocked on entering the executor, but that means 
{{executor.awaitTermination}} must be from some other already executing task 
right?

> Dead lock between delete log segment and shutting down.
> ---
>
> Key: KAFKA-2454
> URL: https://issues.apache.org/jira/browse/KAFKA-2454
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>
> When the broker shutdown, it will shutdown scheduler which grabs the 
> scheduler lock then wait for all the threads in scheduler to shutdown.
> The dead lock will happen when the scheduled task try to delete old log 
> segment, it will schedule a log delete task which also needs to acquire the 
> scheduler lock. In this case the shutdown thread will hold scheduler lock and 
> wait for the the log deletion thread to finish, but the log deletion thread 
> will block on waiting for the scheduler lock.
> Related stack trace:
> {noformat}
> "Thread-1" #21 prio=5 os_prio=0 tid=0x7fe7601a7000 nid=0x1a4e waiting on 
> condition [0x7fe7cf698000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000640d53540> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
> at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:94)
> - locked <0x000640b6d480> (a kafka.utils.KafkaScheduler)
> at 
> kafka.server.KafkaServer$$anonfun$shutdown$4.apply$mcV$sp(KafkaServer.scala:352)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:79)
> at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
> at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:51)
> at kafka.utils.Logging$class.swallow(Logging.scala:94)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:51)
> at kafka.server.KafkaServer.shutdown(KafkaServer.scala:352)
> at 
> kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42)
> at com.linkedin.kafka.KafkaServer.notifyShutdown(KafkaServer.java:99)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyShutdownListener(LifeCycleMgr.java:123)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyListeners(LifeCycleMgr.java:102)
> at 
> com.linkedin.util.factory.lifecycle.LifeCycleMgr.notifyStop(LifeCycleMgr.java:82)
> - locked <0x000640b77bb0> (a java.util.ArrayDeque)
> at com.linkedin.util.factory.Generator.stop(Generator.java:177)
> - locked <0x000640b77bc8> (a java.lang.Object)
> at 
> com.linkedin.offspring.servlet.OffspringServletRuntime.destroy(OffspringServletRuntime.java:82)
> at 
> com.linkedin.offspring.servlet.OffspringServletContextListener.contextDestroyed(OffspringServletContextListener.java:51)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doStop(ContextHandler.java:813)
> at 
> org.eclipse.jetty.servlet.ServletContextHandler.doStop(ServletContextHandler.java:160)
> at 
> org.eclipse.jetty.webapp.WebAppContext.doStop(WebAppContext.java:516)
> at com.linkedin.emweb.WebappContext.doStop(WebappContext.java:35)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x0006400018b8> (a java.lang.Object)
> at 
> com.linkedin.emweb.ContextBasedHandlerImpl.doStop(ContextBasedHandlerImpl.java:112)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x000640001900> (a java.lang.Object)
> at 
> com.linkedin.emweb.WebappDeployerImpl.stop(WebappDeployerImpl.java:349)
> at 
> com.linkedin.emweb.WebappDeployerImpl.doStop(WebappDeployerImpl.java:414)
> - locked <0x0006400019c0> (a 
> com.linkedin.emweb.MapBasedHandlerImpl)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
> - locked <0x0006404ee8e8> (a java.lang.Object)
> at 
> org.eclipse.jetty.util.component.AggregateLifeCycle.doStop(AggregateLifeCycle.java:1

[jira] [Updated] (KAFKA-1901) Move Kafka version to be generated in code by build (instead of in manifest)

2015-08-20 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-1901:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks for the patch - committed to trunk.

> Move Kafka version to be generated in code by build (instead of in manifest)
> 
>
> Key: KAFKA-1901
> URL: https://issues.apache.org/jira/browse/KAFKA-1901
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.0
>Reporter: Jason Rosenberg
>Assignee: Manikumar Reddy
> Attachments: KAFKA-1901.patch, KAFKA-1901_2015-06-26_13:16:29.patch, 
> KAFKA-1901_2015-07-10_16:42:53.patch, KAFKA-1901_2015-07-14_17:59:56.patch, 
> KAFKA-1901_2015-08-09_15:04:39.patch, KAFKA-1901_2015-08-20_12:35:00.patch
>
>
> With 0.8.2 (rc2), I've started seeing this warning in the logs of apps 
> deployed to our staging (both server and client):
> {code}
> 2015-01-23 00:55:25,273  WARN [async-message-sender-0] common.AppInfo$ - 
> Can't read Kafka version from MANIFEST.MF. Possible cause: 
> java.lang.NullPointerException
> {code}
> The issues is that in our deployment, apps are deployed with single 'shaded' 
> jars (e.g. using the maven shade plugin).  This means the MANIFEST.MF file 
> won't have a kafka version.  Instead, suggest the kafka build generate the 
> proper version in code, as part of the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2446) KAFKA-2205 causes existing Topic config changes to be lost

2015-08-19 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy resolved KAFKA-2446.
---
Resolution: Fixed

Pushed to trunk.

> KAFKA-2205 causes existing Topic config changes to be lost
> --
>
> Key: KAFKA-2446
> URL: https://issues.apache.org/jira/browse/KAFKA-2446
> Project: Kafka
>  Issue Type: Bug
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>
> The path was changed from "/config/topics/" to "/config/topic". This causes 
> existing config overrides to not get read



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2084) byte rate metrics per client ID (producer and consumer)

2015-08-19 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703965#comment-14703965
 ] 

Joel Koshy commented on KAFKA-2084:
---

Actually I did not mean full-fledged 
[scalastyle|https://github.com/ngbinh/gradle-scalastyle-plugin], but a very 
simple inline gradle plugin (in our build.gradle) that does these sort of 
checks.

> byte rate metrics per client ID (producer and consumer)
> ---
>
> Key: KAFKA-2084
> URL: https://issues.apache.org/jira/browse/KAFKA-2084
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
>  Labels: quotas
> Attachments: KAFKA-2084.patch, KAFKA-2084_2015-04-09_18:10:56.patch, 
> KAFKA-2084_2015-04-10_17:24:34.patch, KAFKA-2084_2015-04-21_12:21:18.patch, 
> KAFKA-2084_2015-04-21_12:28:05.patch, KAFKA-2084_2015-05-05_15:27:35.patch, 
> KAFKA-2084_2015-05-05_17:52:02.patch, KAFKA-2084_2015-05-11_16:16:01.patch, 
> KAFKA-2084_2015-05-26_11:50:50.patch, KAFKA-2084_2015-06-02_17:02:00.patch, 
> KAFKA-2084_2015-06-02_17:09:28.patch, KAFKA-2084_2015-06-02_17:10:52.patch, 
> KAFKA-2084_2015-06-04_16:31:22.patch, KAFKA-2084_2015-06-12_10:39:35.patch, 
> KAFKA-2084_2015-06-29_17:53:44.patch, KAFKA-2084_2015-08-04_18:50:51.patch, 
> KAFKA-2084_2015-08-04_19:07:46.patch, KAFKA-2084_2015-08-07_11:27:51.patch, 
> KAFKA-2084_2015-08-10_13:48:50.patch, KAFKA-2084_2015-08-10_21:57:48.patch, 
> KAFKA-2084_2015-08-12_12:02:33.patch, KAFKA-2084_2015-08-12_12:04:51.patch, 
> KAFKA-2084_2015-08-12_12:08:17.patch, KAFKA-2084_2015-08-12_21:24:07.patch, 
> KAFKA-2084_2015-08-13_19:08:27.patch, KAFKA-2084_2015-08-13_19:19:16.patch, 
> KAFKA-2084_2015-08-14_17:43:00.patch
>
>
> We need to be able to track the bytes-in/bytes-out rate on a per-client ID 
> basis. This is necessary for quotas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   >