[jira] [Updated] (KAFKA-3185) Allow users to cleanup internal data
[ https://issues.apache.org/jira/browse/KAFKA-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-3185: - Labels: developer-experience (was: ) > Allow users to cleanup internal data > > > Key: KAFKA-3185 > URL: https://issues.apache.org/jira/browse/KAFKA-3185 > Project: Kafka > Issue Type: Sub-task > Components: kafka streams >Reporter: Guozhang Wang >Priority: Blocker > Labels: developer-experience > Fix For: 0.10.1.0 > > > Currently the internal data is managed completely by Kafka Streams framework > and users cannot clean them up actively. This results in a bad out-of-the-box > user experience especially for running demo programs since it results > internal data (changelog topics, RocksDB files, etc) that need to be cleaned > manually. It will be better to add a > {code} > KafkaStreams.cleanup() > {code} > function call to clean up these internal data programmatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-3262) Make KafkaStreams debugging friendly
[ https://issues.apache.org/jira/browse/KAFKA-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-3262: - Labels: developer-experience (was: ) > Make KafkaStreams debugging friendly > > > Key: KAFKA-3262 > URL: https://issues.apache.org/jira/browse/KAFKA-3262 > Project: Kafka > Issue Type: Sub-task > Components: kafka streams >Affects Versions: 0.10.0.0 >Reporter: Yasuhiro Matsuda > Labels: developer-experience > Fix For: 0.10.1.0 > > > Current KafkaStreams polls records in the same thread as the data processing > thread. This makes debugging user code, as well as KafkaStreams itself, > difficult. When the thread is suspended by the debugger, the next heartbeat > of the consumer tie to the thread won't be send until the thread is resumed. > This often results in missed heartbeats and causes a group rebalance. So it > may will be a completely different context then the thread hits the break > point the next time. > We should consider using separate threads for polling and processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-3337) Extract selector as a separate groupBy operator for KTable aggregations
[ https://issues.apache.org/jira/browse/KAFKA-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-3337: - Labels: api-change newbie++ (was: newbie++) > Extract selector as a separate groupBy operator for KTable aggregations > --- > > Key: KAFKA-3337 > URL: https://issues.apache.org/jira/browse/KAFKA-3337 > Project: Kafka > Issue Type: Sub-task > Components: kafka streams >Reporter: Guozhang Wang >Assignee: Matthias J. Sax > Labels: api-change, newbie++ > Fix For: 0.10.1.0 > > > Currently KTable aggregation takes a selector used for selecting the > aggregate key.and an aggregator for aggregating the values with the same > selected key, which makes the function a little bit "heavy": > {code} > table.groupBy(initializer, adder, substractor, selector, /* optional serde*/); > {code} > It is better to extract the selector in a separate groupBy function such that > {code} > KTableGrouped KTable#groupBy(selector); > KTable KTableGrouped#aggregate(initializer, adder, substractor, /* optional > serde*/); > {code} > Note that "KTableGrouped" only have APIs for aggregate and reduce, and none > else. So users have to follow the pattern below: > {code} > table.groupBy(...).aggregate(...); > {code} > This pattern is more natural for users who are familiar with SQL / Pig or > Spark DSL, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146250#comment-15146250 ] Neha Narkhede commented on KAFKA-1464: -- The most useful resource to throttle for is network bandwidth usage by replication, as measured by the rate of total outgoing replication data on every leader. Adding the ability on every leader to cap data transferred under an upper limit is what we are looking for. This can be a config option similar to the one we have for the log cleaner. It seems to be that it is better to have the leader send less instead of have the replica fetch less as the leader has a holistic view of the total amount of data being transferred out. Data transferred from a leader includes - Fetch requests from an in-sync replica - Fetch requests from an out-of-sync replica of a partition being reassigned - Fetch requests from an out-of-sync replica of a partition not being reassigned Data transferred across 1+2+3 should stay roughly within the configured upper limit. If the limit is crossed, we want to start throttling requests, all except the ones that fall under #1. The leader can assign the remaining available bandwidth amongst partitions that fall under #2 and #3 by allowing more bandwidth to #3 since presumably it is fine to let partitions being reassigned to catch up slower than the rest. Throttling could involve returning fewer bytes as determined by this computation for each such partition as Jay suggests. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146250#comment-15146250 ] Neha Narkhede edited comment on KAFKA-1464 at 2/13/16 11:36 PM: The most useful resource to throttle for is network bandwidth usage by replication, as measured by the rate of total outgoing replication data on every leader. Adding the ability on every leader to cap data transferred under an upper limit is what we are looking for. This can be a config option similar to the one we have for the log cleaner. It seems to be that it is better to have the leader send less instead of have the replica fetch less as the leader has a holistic view of the total amount of data being transferred out. Data transferred from a leader includes # Fetch requests from an in-sync replica # Fetch requests from an out-of-sync replica of a partition being reassigned # Fetch requests from an out-of-sync replica of a partition not being reassigned Data transferred across 1+2+3 should stay roughly within the configured upper limit. If the limit is crossed, we want to start throttling requests, all except the ones that fall under #1. The leader can assign the remaining available bandwidth amongst partitions that fall under #2 and #3 by allowing more bandwidth to #3 since presumably it is fine to let partitions being reassigned to catch up slower than the rest. Throttling could involve returning fewer bytes as determined by this computation for each such partition as Jay suggests. was (Author: nehanarkhede): The most useful resource to throttle for is network bandwidth usage by replication, as measured by the rate of total outgoing replication data on every leader. Adding the ability on every leader to cap data transferred under an upper limit is what we are looking for. This can be a config option similar to the one we have for the log cleaner. It seems to be that it is better to have the leader send less instead of have the replica fetch less as the leader has a holistic view of the total amount of data being transferred out. Data transferred from a leader includes - Fetch requests from an in-sync replica - Fetch requests from an out-of-sync replica of a partition being reassigned - Fetch requests from an out-of-sync replica of a partition not being reassigned Data transferred across 1+2+3 should stay roughly within the configured upper limit. If the limit is crossed, we want to start throttling requests, all except the ones that fall under #1. The leader can assign the remaining available bandwidth amongst partitions that fall under #2 and #3 by allowing more bandwidth to #3 since presumably it is fine to let partitions being reassigned to catch up slower than the rest. Throttling could involve returning fewer bytes as determined by this computation for each such partition as Jay suggests. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-3209) Support single message transforms in Kafka Connect
Neha Narkhede created KAFKA-3209: Summary: Support single message transforms in Kafka Connect Key: KAFKA-3209 URL: https://issues.apache.org/jira/browse/KAFKA-3209 Project: Kafka Issue Type: Improvement Components: copycat Reporter: Neha Narkhede Users should be able to perform light transformations on messages between a connector and Kafka. This is needed because some transformations must be performed before the data hits Kafka (e.g. filtering certain types of events or PII filtering). It's also useful for very light, single-message modifications that are easier to perform inline with the data import/export. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2807) Movement of throughput throttler to common broke upgrade tests
[ https://issues.apache.org/jira/browse/KAFKA-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000969#comment-15000969 ] Neha Narkhede commented on KAFKA-2807: -- We don't want our tests to fail, so marking this as a blocker. > Movement of throughput throttler to common broke upgrade tests > -- > > Key: KAFKA-2807 > URL: https://issues.apache.org/jira/browse/KAFKA-2807 > Project: Kafka > Issue Type: Bug >Reporter: Geoff Anderson >Assignee: Geoff Anderson >Priority: Blocker > Fix For: 0.9.0.0 > > > In order to run compatibility tests with an 0.8.2 producer, and using > VerifiableProducer, we use the 0.8.2 kafka-run-tools.sh classpath augmented > by the 0.9.0 tools and tools dependencies classpaths. > Recently, some refactoring efforts moved ThroughputThrottler to > org.apache.kafka.common.utils package, but this breaks the existing > compatibility tests: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/kafka/common/utils/ThroughputThrottler > at > org.apache.kafka.tools.VerifiableProducer.main(VerifiableProducer.java:334) > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.utils.ThroughputThrottler > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 1 more > {code} > Given the need to be able to run VerifiableProducer against 0.8.X, I'm not > sure VerifiableProducer can depend on org.apache.kafka.common.utils at this > point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2807) Movement of throughput throttler to common broke upgrade tests
[ https://issues.apache.org/jira/browse/KAFKA-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2807: - Priority: Blocker (was: Major) > Movement of throughput throttler to common broke upgrade tests > -- > > Key: KAFKA-2807 > URL: https://issues.apache.org/jira/browse/KAFKA-2807 > Project: Kafka > Issue Type: Bug >Reporter: Geoff Anderson >Assignee: Geoff Anderson >Priority: Blocker > Fix For: 0.9.0.0 > > > In order to run compatibility tests with an 0.8.2 producer, and using > VerifiableProducer, we use the 0.8.2 kafka-run-tools.sh classpath augmented > by the 0.9.0 tools and tools dependencies classpaths. > Recently, some refactoring efforts moved ThroughputThrottler to > org.apache.kafka.common.utils package, but this breaks the existing > compatibility tests: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/kafka/common/utils/ThroughputThrottler > at > org.apache.kafka.tools.VerifiableProducer.main(VerifiableProducer.java:334) > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.utils.ThroughputThrottler > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 1 more > {code} > Given the need to be able to run VerifiableProducer against 0.8.X, I'm not > sure VerifiableProducer can depend on org.apache.kafka.common.utils at this > point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2517) Performance Regression post SSL implementation
[ https://issues.apache.org/jira/browse/KAFKA-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2517: - Priority: Blocker (was: Critical) > Performance Regression post SSL implementation > -- > > Key: KAFKA-2517 > URL: https://issues.apache.org/jira/browse/KAFKA-2517 > Project: Kafka > Issue Type: Bug >Reporter: Ben Stopford >Assignee: Ben Stopford >Priority: Blocker > Fix For: 0.8.3 > > > It would appear that we incurred a performance regression on submission of > the SSL work affecting the performance of the new Kafka Consumer. > Running with 1KB messages. Macbook 2.3 GHz Intel Core i7, 8GB, APPLE SSD > SM256E. Single server instance. All local. > kafka-consumer-perf-test.sh ... --messages 300 --new-consumer > Pre-SSL changes (commit 503bd36647695e8cc91893ffb80346dd03eb0bc5) > Steady state throughputs = 234.8 MB/s > (2861.5913, 234.8261, 3000596, 246233.0543) > Post-SSL changes (commit 13c432f7952de27e9bf8cb4adb33a91ae3a4b738) > Steady state throughput = 178.1 MB/s > (2861.5913, 178.1480, 3000596, 186801.7182) > Implication is a 25% reduction in consumer throughput for these test > conditions. > This appears to be caused by the use of PlaintextTransportLayer rather than > SocketChannel in FileMessageSet.writeTo() meaning a zero copy transfer is not > invoked. > Switching to the use of a SocketChannel directly in FileMessageSet.writeTo() > yields the following result: > Steady state throughput = 281.8 MB/s > (2861.5913, 281.8191, 3000596, 295508.7650) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2403) Expose offset commit metadata in new consumer
[ https://issues.apache.org/jira/browse/KAFKA-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708125#comment-14708125 ] Neha Narkhede commented on KAFKA-2403: -- [~hachikuji] [~ewencp] Cross posting my thoughts on this change here for more visibility: bq. On the other hand, maybe most users don't even specify the offsets manually anyway and the concern here is unwarranted since 99% of the cases are handled by commit(CommitType) and commit(CommitType, ConsumerCommitCallback) I think manual offset commit is really a very small percentage of all uses. Even though I agree that amongst that minority, fewer would have custom metadata, I'm not sure it is worth adding the extra commitWithMetadata API for. It may be ok in this case to go with public void commit(MapTopicPartition, OffsetMetadata offsets, CommitType commitType); Expose offset commit metadata in new consumer - Key: KAFKA-2403 URL: https://issues.apache.org/jira/browse/KAFKA-2403 Project: Kafka Issue Type: Sub-task Reporter: Jason Gustafson Assignee: Jason Gustafson Priority: Minor The offset commit protocol supports the ability to add user metadata to commits, but this is not yet exposed in the new consumer API. The straightforward way to add it is to create a container for the offset and metadata and adjust the KafkaConsumer API accordingly. {code} OffsetMetadata { long offset; String metadata; } KafkaConsumer { commit(MapTopicPartition, OffsetMetadata) OffsetMetadata committed(TopicPartition) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2367) Add Copycat runtime data API
[ https://issues.apache.org/jira/browse/KAFKA-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702063#comment-14702063 ] Neha Narkhede commented on KAFKA-2367: -- I think there are various tradeoffs, as with most choices that a framework is presented with :) The tradeoffs I see are: 1. Agility vs maturity: The maturity argument is that Avro is an advanced serialization library that already exists and in spite of having been through various compatibility issues, is now well tested and adopted. The agility argument against Avro is that for a new framework like Copycat, we might be able to move faster (over several releases) by owning and fixing our runtime data model, while not waiting for the Avro community to release a patched version. This is a problem we struggled with ZkClient, codahale-metrics and ZooKeeper on the core Kafka side and though one can argue that the Avro community is better, this still remains a concern. The success of the Copycat framework depends on its ability to always be the present framework for copying data to Kafka and as an early project, agility is key. 2. Cost/time savings vs control: The cost/time saving argument goes for adopting Avro even if we really need a very small percentage of it. This does save us a little time upfront but the downside is that now we end up having Copycat depend on Avro (and all its dependencies). I'm totally in favor of using a mature open source library but observing the size of the code we need to pull from Avro, I couldn't convince myself of the benefit it presents in saving some effort upfront. After all, there will be bugs in either codebase, we'd have to find the fastest way to fix those. 3. Generic public interface to encourage connector developers: This is a very right-brain argument and a subtle one. I agree with [~jkreps] here. Given that our goal should be to attract a large ecosystem of connectors, I would want us to remove every bit of pain and friction that would cause connector developers to either question our choice of Avro or spend time clarifying its impact on them. I understand that in practice this isn't a concern and as long we have the right serializers, this will not even be quite so visible but a simple SchemaBuilder imported from org.apache.avro can start this discussion and distract connector developers who aren't necessarily Avro fans. Overall, given the tradeoffs, I'm leaning towards us picking a custom one and not depending on all of Avro. Add Copycat runtime data API Key: KAFKA-2367 URL: https://issues.apache.org/jira/browse/KAFKA-2367 Project: Kafka Issue Type: Sub-task Components: copycat Reporter: Ewen Cheslack-Postava Assignee: Ewen Cheslack-Postava Fix For: 0.8.3 Design the API used for runtime data in Copycat. This API is used to construct schemas and records that Copycat processes. This needs to be a fairly general data model (think Avro, JSON, Protobufs, Thrift) in order to support complex, varied data types that may be input from/output to many data systems. This should issue should also address the serialization interfaces used within Copycat, which translate the runtime data into serialized byte[] form. It is important that these be considered together because the data format can be used in multiple ways (records, partition IDs, partition offsets), so it and the corresponding serializers must be sufficient for all these use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2365) Copycat checklist
[ https://issues.apache.org/jira/browse/KAFKA-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643087#comment-14643087 ] Neha Narkhede commented on KAFKA-2365: -- Worth discussing a process for including a connector in Kafka core, but I think we went through this in the KIP discussion and here is what I think. To keep packaging, review and code management easier, it is better to just include a couple lightweight connectors enough to show the usage of the copypcat APIs (file in/file out). Any connector that requires depending on an external system (HDFS, Elasticsearch) should really live elsewhere. We should also delete the ones under contrib, they never ended up getting supported by the community. Since there will always be connectors that need to live outside Kafka, I think we should instead focus on how to make tooling easier for discovering and using these federated connectors. Copycat checklist - Key: KAFKA-2365 URL: https://issues.apache.org/jira/browse/KAFKA-2365 Project: Kafka Issue Type: New Feature Components: copycat Reporter: Ewen Cheslack-Postava Assignee: Ewen Cheslack-Postava Labels: feature Fix For: 0.8.3 This covers the development plan for [KIP-26|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767]. There are a number of features that can be developed in sequence to make incremental progress, and often in parallel: * Initial patch - connector API and core implementation * Runtime data API * Standalone CLI * REST API * Distributed copycat - CLI * Distributed copycat - coordinator * Distributed copycat - config storage * Distributed copycat - offset storage * Log/file connector (sample source/sink connector) * Elasticsearch sink connector (sample sink connector for full log - Kafka - Elasticsearch sample pipeline) * Copycat metrics * System tests (including connector tests) * Mirrormaker connector * Copycat documentation This is an initial list, but it might need refinement to allow for more incremental progress and may be missing features we find we want before the initial release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2350) Add KafkaConsumer pause capability
[ https://issues.apache.org/jira/browse/KAFKA-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642229#comment-14642229 ] Neha Narkhede commented on KAFKA-2350: -- Few thoughts: bq. 1. Add pause/unpause bq. 2. Allow subscribe(topic) followed by unsubscribe(partition) to subscribe to a topic but suppress a partition bq. 3. Either of the above but making the use of group management explicit using an enable.group.management flag I'm in favor of 1. for several reasons: 1. It keeps the API semantics clean. subscribe/unsubscribe indicates intent to consume data, while pause/resume indicates a *temporary* preference for the purposes of flow control. 2. It avoids all the different permutations of subscribe/unsubscribe that we will need to worry about and each one of those would have to make sense and be explained clearly to the user. This discussion is confusing enough that I'm convinced that it will not be easy. 3. pause/resume moves the consumer to a different state in its state diagram. Overloading the same API to represent two different states is unintuitive. Also +1 on - 1. Renaming unpause to resume. 2. Not maintaining the pause/resume preference across consumer rebalances. There may be complications in the implementation of the above preferences that I may have overlooked, but I feel we should design APIs for the right behavior and figure out the implementation related issues that might come up as a result. Add KafkaConsumer pause capability -- Key: KAFKA-2350 URL: https://issues.apache.org/jira/browse/KAFKA-2350 Project: Kafka Issue Type: Improvement Reporter: Jason Gustafson Assignee: Jason Gustafson There are some use cases in stream processing where it is helpful to be able to pause consumption of a topic. For example, when joining two topics, you may need to delay processing of one topic while you wait for the consumer of the other topic to catch up. The new consumer currently doesn't provide a nice way to do this. If you skip poll() or if you unsubscribe, then a rebalance will be triggered and your partitions will be reassigned. One way to achieve this would be to add two new methods to KafkaConsumer: {code} void pause(String... topics); void unpause(String... topics); {code} When a topic is paused, a call to KafkaConsumer.poll will not initiate any new fetches for that topic. After it is unpaused, fetches will begin again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-2350) Add KafkaConsumer pause capability
[ https://issues.apache.org/jira/browse/KAFKA-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642229#comment-14642229 ] Neha Narkhede edited comment on KAFKA-2350 at 7/27/15 3:55 AM: --- Few thoughts: bq. 1. Add pause/unpause bq. 2. Allow subscribe(topic) followed by unsubscribe(partition) to subscribe to a topic but suppress a partition bq. 3. Either of the above but making the use of group management explicit using an enable.group.management flag I'm in favor of 1. for several reasons: 1. It keeps the API semantics clean. subscribe/unsubscribe indicates intent to consume data, while pause/resume indicates a *temporary* preference for the purposes of flow control. 2. It avoids all the different permutations of subscribe/unsubscribe that we will need to worry about and each one of those would have to make sense and be explained clearly to the user. This discussion is confusing enough that I'm convinced that it will not be easy. 3. pause/resume moves the consumer to a different state in its state diagram. Overloading the same API to represent two different states is unintuitive. Also +1 on - 1. Renaming unpause to resume. 2. Not maintaining the pause/resume preference across consumer rebalances. Also not in favor of adding the enable.group.management config. I agree with Jay that adding the config will just complicate the semantics and reduce operational simplicity, increasing the number of ways the API calls made by the user would not behave as expected. There may be complications in the implementation of the above preferences that I may have overlooked, but I feel we should design APIs for the right behavior and figure out the implementation related issues that might come up as a result. was (Author: nehanarkhede): Few thoughts: bq. 1. Add pause/unpause bq. 2. Allow subscribe(topic) followed by unsubscribe(partition) to subscribe to a topic but suppress a partition bq. 3. Either of the above but making the use of group management explicit using an enable.group.management flag I'm in favor of 1. for several reasons: 1. It keeps the API semantics clean. subscribe/unsubscribe indicates intent to consume data, while pause/resume indicates a *temporary* preference for the purposes of flow control. 2. It avoids all the different permutations of subscribe/unsubscribe that we will need to worry about and each one of those would have to make sense and be explained clearly to the user. This discussion is confusing enough that I'm convinced that it will not be easy. 3. pause/resume moves the consumer to a different state in its state diagram. Overloading the same API to represent two different states is unintuitive. Also +1 on - 1. Renaming unpause to resume. 2. Not maintaining the pause/resume preference across consumer rebalances. There may be complications in the implementation of the above preferences that I may have overlooked, but I feel we should design APIs for the right behavior and figure out the implementation related issues that might come up as a result. Add KafkaConsumer pause capability -- Key: KAFKA-2350 URL: https://issues.apache.org/jira/browse/KAFKA-2350 Project: Kafka Issue Type: Improvement Reporter: Jason Gustafson Assignee: Jason Gustafson There are some use cases in stream processing where it is helpful to be able to pause consumption of a topic. For example, when joining two topics, you may need to delay processing of one topic while you wait for the consumer of the other topic to catch up. The new consumer currently doesn't provide a nice way to do this. If you skip poll() or if you unsubscribe, then a rebalance will be triggered and your partitions will be reassigned. One way to achieve this would be to add two new methods to KafkaConsumer: {code} void pause(String... topics); void unpause(String... topics); {code} When a topic is paused, a call to KafkaConsumer.poll will not initiate any new fetches for that topic. After it is unpaused, fetches will begin again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2311) Consumer's ensureNotClosed method not thread safe
[ https://issues.apache.org/jira/browse/KAFKA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2311: - Reviewer: Neha Narkhede Consumer's ensureNotClosed method not thread safe - Key: KAFKA-2311 URL: https://issues.apache.org/jira/browse/KAFKA-2311 Project: Kafka Issue Type: Bug Components: clients Reporter: Tim Brooks Assignee: Tim Brooks Attachments: KAFKA-2311.patch, KAFKA-2311.patch When a call is to the consumer is made, the first check is to see that the consumer is not closed. This variable is not volatile so there is no guarantee previous stores will be visible before a read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2321) Introduce CONTRIBUTING.md
[ https://issues.apache.org/jira/browse/KAFKA-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2321: - Reviewer: Neha Narkhede Introduce CONTRIBUTING.md - Key: KAFKA-2321 URL: https://issues.apache.org/jira/browse/KAFKA-2321 Project: Kafka Issue Type: Improvement Reporter: Ismael Juma Assignee: Ismael Juma Attachments: KAFKA-2321.patch This file is displayed when people create a pull request in GitHub. It should link to the relevant pages in the wiki and website. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2323) Simplify ScalaTest dependency versions
[ https://issues.apache.org/jira/browse/KAFKA-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2323: - Reviewer: Jun Rao Simplify ScalaTest dependency versions -- Key: KAFKA-2323 URL: https://issues.apache.org/jira/browse/KAFKA-2323 Project: Kafka Issue Type: Improvement Reporter: Ismael Juma Assignee: Ismael Juma Priority: Minor Attachments: KAFKA-2323.patch We currently use the following ScalaTest versions: * 1.8 for Scala 2.9.x * 1.9.1 for Scala 2.10.x * 2.2.0 for Scala 2.11.x I propose we simplify it to: * 1.9.1 for Scala 2.9.x * 2.2.5 for every other Scala version (currently 2.10.x and 2.11.x) And since we will drop support for Scala 2.9.x soon, then the conditional check for ScalaTest can be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2236) offset request reply racing with segment rolling
[ https://issues.apache.org/jira/browse/KAFKA-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2236: - Assignee: Jason Gustafson offset request reply racing with segment rolling Key: KAFKA-2236 URL: https://issues.apache.org/jira/browse/KAFKA-2236 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.2.0 Environment: Linux x86_64, java.1.7.0_72, discovered using librdkafka based client. Reporter: Alfred Landrum Assignee: Jason Gustafson Priority: Critical Labels: newbie My use case with kafka involves an aggressive retention policy that rolls segment files frequently. My librdkafka based client sees occasional errors to offset requests, showing up in the broker log like: [2015-06-02 02:33:38,047] INFO Rolled new log segment for 'receiver-93b40462-3850-47c1-bcda-8a3e221328ca-50' in 1 ms. (kafka.log.Log) [2015-06-02 02:33:38,049] WARN [KafkaApi-0] Error while responding to offset request (kafka.server.KafkaApis) java.lang.ArrayIndexOutOfBoundsException: 3 at kafka.server.KafkaApis.fetchOffsetsBefore(KafkaApis.scala:469) at kafka.server.KafkaApis.fetchOffsets(KafkaApis.scala:449) at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:411) at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:402) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at kafka.server.KafkaApis.handleOffsetRequest(KafkaApis.scala:402) at kafka.server.KafkaApis.handle(KafkaApis.scala:61) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59) at java.lang.Thread.run(Thread.java:745) quoting Guozhang Wang's reply to my query on the users list: I check the 0.8.2 code and may probably find a bug related to your issue. Basically, segsArray.last.size is called multiple times during handling offset requests, while segsArray.last could get concurrent appends. Hence it is possible that in line 461, if(segsArray.last.size 0) returns false while later in line 468, if(segsArray.last.size 0) could return true. http://mail-archives.apache.org/mod_mbox/kafka-users/201506.mbox/%3CCAHwHRrUK-3wdoEAaFbsD0E859Ea0gXixfxgDzF8E3%3D_8r7K%2Bpw%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2187) Introduce merge-kafka-pr.py script
[ https://issues.apache.org/jira/browse/KAFKA-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569499#comment-14569499 ] Neha Narkhede commented on KAFKA-2187: -- [~ijuma] As I went through the patch again, I caught a few places that still refer to Spark instead of Kafka. If you can fix those, I'll check it in. Introduce merge-kafka-pr.py script -- Key: KAFKA-2187 URL: https://issues.apache.org/jira/browse/KAFKA-2187 Project: Kafka Issue Type: New Feature Reporter: Ismael Juma Assignee: Ismael Juma Priority: Minor Attachments: KAFKA-2187.patch, KAFKA-2187_2015-05-20_23:14:05.patch This script will be used to merge GitHub pull requests and it will pull from the Apache Git repo to the current branch, squash and merge the PR, push the commit to trunk, close the PR (via commit message) and close the relevant JIRA issue (via JIRA API). Spark has a script that does most (if not all) of this and that will be used as the starting point: https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2231) Deleting a topic fails
[ https://issues.apache.org/jira/browse/KAFKA-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566889#comment-14566889 ] Neha Narkhede commented on KAFKA-2231: -- [~JGH] Thanks for reporting the issue. Would you mind also trying to reproduce this on trunk? Deleting a topic fails -- Key: KAFKA-2231 URL: https://issues.apache.org/jira/browse/KAFKA-2231 Project: Kafka Issue Type: Bug Affects Versions: 0.8.2.1 Environment: Windows 8.1 Reporter: James G. Haberly Priority: Minor delete.topic.enable=true is in config\server.properties. Using --list shows the topic marked for deletion. Stopping and restarting kafka and zookeeper does not delete the topic; it remains marked for deletion. Trying to recreate the topic fails with Topic XXX already exists. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2233) Log deletion is not removing log metrics
[ https://issues.apache.org/jira/browse/KAFKA-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2233: - Labels: newbie (was: ) Log deletion is not removing log metrics Key: KAFKA-2233 URL: https://issues.apache.org/jira/browse/KAFKA-2233 Project: Kafka Issue Type: Bug Components: log Affects Versions: 0.8.2.1 Reporter: Stevo Slavic Assignee: Jay Kreps Priority: Minor Labels: newbie Topic deletion does not remove associated metrics. Any configured kafka metric reporter that gets triggered after a topic is deleted, when polling for log metrics for such deleted logs it will throw something like: {noformat} java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap$Iter.advance(ConcurrentSkipListMap.java:2299) at java.util.concurrent.ConcurrentSkipListMap$ValueIterator.next(ConcurrentSkipListMap.java:2326) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.AbstractIterable.head(Iterable.scala:54) at kafka.log.Log.logStartOffset(Log.scala:502) at kafka.log.Log$$anon$2.value(Log.scala:86) at kafka.log.Log$$anon$2.value(Log.scala:85) {noformat} since on log deletion, {{Log}} segments collection get cleared, so logSegments {{Iterable}} has no (next) elements. Known workaround is to restart broker - as metric registry is in memory, not persisted, on restart it will be recreated with metrics for existing/non-deleted topics only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2187) Introduce merge-kafka-pr.py script
[ https://issues.apache.org/jira/browse/KAFKA-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2187: - Reviewer: Neha Narkhede Introduce merge-kafka-pr.py script -- Key: KAFKA-2187 URL: https://issues.apache.org/jira/browse/KAFKA-2187 Project: Kafka Issue Type: New Feature Reporter: Ismael Juma Assignee: Ismael Juma Priority: Minor Attachments: KAFKA-2187.patch, KAFKA-2187_2015-05-20_23:14:05.patch This script will be used to merge GitHub pull requests and it will pull from the Apache Git repo to the current branch, squash and merge the PR, push the commit to trunk, close the PR (via commit message) and close the relevant JIRA issue (via JIRA API). Spark has a script that does most (if not all) of this and that will be used as the starting point: https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2187) Introduce merge-kafka-pr.py script
[ https://issues.apache.org/jira/browse/KAFKA-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566884#comment-14566884 ] Neha Narkhede commented on KAFKA-2187: -- [~ijuma] This looks great. I can check it in, would you mind updating the wiki too https://cwiki.apache.org/confluence/display/KAFKA/Patch+submission+and+review ? Introduce merge-kafka-pr.py script -- Key: KAFKA-2187 URL: https://issues.apache.org/jira/browse/KAFKA-2187 Project: Kafka Issue Type: New Feature Reporter: Ismael Juma Assignee: Ismael Juma Priority: Minor Attachments: KAFKA-2187.patch, KAFKA-2187_2015-05-20_23:14:05.patch This script will be used to merge GitHub pull requests and it will pull from the Apache Git repo to the current branch, squash and merge the PR, push the commit to trunk, close the PR (via commit message) and close the relevant JIRA issue (via JIRA API). Spark has a script that does most (if not all) of this and that will be used as the starting point: https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2168) New consumer poll() can block other calls like position(), commit(), and close() indefinitely
[ https://issues.apache.org/jira/browse/KAFKA-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562061#comment-14562061 ] Neha Narkhede commented on KAFKA-2168: -- There are tradeoffs to having multiple threads per consumer instance vs having a consumer instance per thread. The consumer code is simpler in the latter design, the throughput is better but the # of TCP connections are fewer in the former design. Some of the concerns [~ewencp] brings up above can be mitigated if there is a separate consumer instance per user thread and others can be mitigated by the user picking the right timeout on poll() that they are comfortable blocking on. All of this would mean explicitly stating that the consumer APIs are not threadsafe and that the user should create multiple consumer instances across threads instead of sharing one. We still need to make sure close() can be called from a separate thread as [~ewencp] correctly points out, though the change isn't large if we go down this route. It seems like it is simpler to stick to the original intention of the design and not share consumer instances across threads? New consumer poll() can block other calls like position(), commit(), and close() indefinitely - Key: KAFKA-2168 URL: https://issues.apache.org/jira/browse/KAFKA-2168 Project: Kafka Issue Type: Bug Components: clients, consumer Reporter: Ewen Cheslack-Postava Assignee: Jason Gustafson The new consumer is currently using very coarse-grained synchronization. For most methods this isn't a problem since they finish quickly once the lock is acquired, but poll() might run for a long time (and commonly will since polling with long timeouts is a normal use case). This means any operations invoked from another thread may block until the poll() call completes. Some example use cases where this can be a problem: * A shutdown hook is registered to trigger shutdown and invokes close(). It gets invoked from another thread and blocks indefinitely. * User wants to manage offset commit themselves in a background thread. If the commit policy is not purely time based, it's not currently possibly to make sure the call to commit() will be processed promptly. Two possible solutions to this: 1. Make sure a lock is not held during the actual select call. Since we have multiple layers (KafkaConsumer - NetworkClient - Selector - nio Selector) this is probably hard to make work cleanly since locking is currently only performed at the KafkaConsumer level and we'd want it unlocked around a single line of code in Selector. 2. Wake up the selector before synchronizing for certain operations. This would require some additional coordination to make sure the caller of wakeup() is able to acquire the lock promptly (instead of, e.g., the poll() thread being woken up and then promptly reacquiring the lock with a subsequent long poll() call). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2197) Controller not able to update state for broker on the same machine
[ https://issues.apache.org/jira/browse/KAFKA-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545751#comment-14545751 ] Neha Narkhede commented on KAFKA-2197: -- [~bedwards] Unassigning myself as I probably wouldn't get around to this, but you might have better visibility getting to the root cause on the mailing list (if you already haven't) Controller not able to update state for broker on the same machine -- Key: KAFKA-2197 URL: https://issues.apache.org/jira/browse/KAFKA-2197 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 0.8.2.1 Environment: docker 1.5, 64 bit Linux (4.0.1-1). Reporter: Ben Edwards Attachments: docker-compose-stripped.yml I am using kafka on docker. When I try to create a topic the controller seems to get stuck and the topic is never usable for consumers or producers. {noformat} [2015-05-15 15:51:10,201] WARN [Controller-9092-to-broker-9092-send-thread], Controller 9092 epoch 1 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:9092;ControllerEpoch:1;CorrelationId:7;ClientId:id_9092-host_null-port_9092;Leaders:id:9092,host:kafka,port:9092;PartitionState:(lol,0) - (LeaderAndIsrInfo:(Leader:9092,ISR:9092,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:9092) to broker id:9092,host:kafka,port:9092. Reconnecting to broker. (kafka.controller.RequestSendThread) java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) {noformat} Repro steps: run docker-compose up with the attached docker-compose yaml file. enter the kafka container (get the name with {{docker ps}}, {{docker exec -it name bash}} to enter). run the following {noformat} cd /opt/kafka_2.10-0.8.2.1/ ./bin/kafka-topics.sh --list --zookeeper zk:2181 ./bin/kafka-topics.sh --create --zookeeper zk:2181 --topic lol --replicatior-factor 1 --partitions 1 ./bin/kafka-topics.sh --create --zookeeper zk:2181 --topic testing --replication-factor 1 --partitions 1 ./bin/kafka-topics.sh --list --zookeeper zk:2181 tail -f logs/controller.log {noformat} This should allow you to observe the controller being upset. The zookeeper instance is definitely reachable, the hostnames are correct as far as I can tell. I am kind of at a loss as to what is happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2197) Controller not able to update state for broker on the same machine
[ https://issues.apache.org/jira/browse/KAFKA-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2197: - Assignee: (was: Neha Narkhede) Controller not able to update state for broker on the same machine -- Key: KAFKA-2197 URL: https://issues.apache.org/jira/browse/KAFKA-2197 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 0.8.2.1 Environment: docker 1.5, 64 bit Linux (4.0.1-1). Reporter: Ben Edwards Attachments: docker-compose-stripped.yml I am using kafka on docker. When I try to create a topic the controller seems to get stuck and the topic is never usable for consumers or producers. {noformat} [2015-05-15 15:51:10,201] WARN [Controller-9092-to-broker-9092-send-thread], Controller 9092 epoch 1 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:9092;ControllerEpoch:1;CorrelationId:7;ClientId:id_9092-host_null-port_9092;Leaders:id:9092,host:kafka,port:9092;PartitionState:(lol,0) - (LeaderAndIsrInfo:(Leader:9092,ISR:9092,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:9092) to broker id:9092,host:kafka,port:9092. Reconnecting to broker. (kafka.controller.RequestSendThread) java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) {noformat} Repro steps: run docker-compose up with the attached docker-compose yaml file. enter the kafka container (get the name with {{docker ps}}, {{docker exec -it name bash}} to enter). run the following {noformat} cd /opt/kafka_2.10-0.8.2.1/ ./bin/kafka-topics.sh --list --zookeeper zk:2181 ./bin/kafka-topics.sh --create --zookeeper zk:2181 --topic lol --replicatior-factor 1 --partitions 1 ./bin/kafka-topics.sh --create --zookeeper zk:2181 --topic testing --replication-factor 1 --partitions 1 ./bin/kafka-topics.sh --list --zookeeper zk:2181 tail -f logs/controller.log {noformat} This should allow you to observe the controller being upset. The zookeeper instance is definitely reachable, the hostnames are correct as far as I can tell. I am kind of at a loss as to what is happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2166) Recreation breaks topic-list
[ https://issues.apache.org/jira/browse/KAFKA-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526860#comment-14526860 ] Neha Narkhede commented on KAFKA-2166: -- [~Alien2150] Thanks for reporting this issue. Which Kafka version do you see this behavior on? Have you tried this on trunk? Recreation breaks topic-list Key: KAFKA-2166 URL: https://issues.apache.org/jira/browse/KAFKA-2166 Project: Kafka Issue Type: Bug Reporter: Thomas Zimmer Hi here are the steps the reproduce the issue: * Create a topic called test * Delete the topic test * Recreate the topic test What will happen is that you will see the topic in the topic-list but it's marked as deleted: ./kafka-topics.sh --list --zookeeper zookpeer1.dev, zookeeper2.dev test - marked for deletion Is there a way to fix it without having to delete everything? We also tried several restarts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-824) java.lang.NullPointerException in commitOffsets
[ https://issues.apache.org/jira/browse/KAFKA-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-824: Affects Version/s: 0.8.2.0 Assignee: (was: Neha Narkhede) Labels: newbie (was: ) java.lang.NullPointerException in commitOffsets Key: KAFKA-824 URL: https://issues.apache.org/jira/browse/KAFKA-824 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.7.2, 0.8.2.0 Reporter: Yonghui Zhao Labels: newbie Attachments: ZkClient.0.3.txt, ZkClient.0.4.txt, screenshot-1.jpg Neha Narkhede Yes, I have. Unfortunately, I never quite around to fixing it. My guess is that it is caused due to a race condition between the rebalance thread and the offset commit thread when a rebalance is triggered or the client is being shutdown. Do you mind filing a bug ? 2013/03/25 12:08:32.020 WARN [ZookeeperConsumerConnector] [] 0_lu-ml-test10.bj-1364184411339-7c88f710 exception during commitOffsets java.lang.NullPointerException at org.I0Itec.zkclient.ZkConnection.writeData(ZkConnection.java:111) at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:813) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.writeData(ZkClient.java:809) at org.I0Itec.zkclient.ZkClient.writeData(ZkClient.java:777) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:103) at kafka.consumer.ZookeeperConsumerConnector$$anonfun$commitOffsets$2$$anonfun$apply$4.apply(ZookeeperConsumerConnector.scala:251) at kafka.consumer.ZookeeperConsumerConnector$$anonfun$commitOffsets$2$$anonfun$apply$4.apply(ZookeeperConsumerConnector.scala:248) at scala.collection.Iterator$class.foreach(Iterator.scala:631) at scala.collection.JavaConversions$JIteratorWrapper.foreach(JavaConversions.scala:549) at scala.collection.IterableLike$class.foreach(IterableLike.scala:79) at scala.collection.JavaConversions$JCollectionWrapper.foreach(JavaConversions.scala:570) at kafka.consumer.ZookeeperConsumerConnector$$anonfun$commitOffsets$2.apply(ZookeeperConsumerConnector.scala:248) at kafka.consumer.ZookeeperConsumerConnector$$anonfun$commitOffsets$2.apply(ZookeeperConsumerConnector.scala:246) at scala.collection.Iterator$class.foreach(Iterator.scala:631) at kafka.utils.Pool$$anon$1.foreach(Pool.scala:53) at scala.collection.IterableLike$class.foreach(IterableLike.scala:79) at kafka.utils.Pool.foreach(Pool.scala:24) at kafka.consumer.ZookeeperConsumerConnector.commitOffsets(ZookeeperConsumerConnector.scala:246) at kafka.consumer.ZookeeperConsumerConnector.autoCommit(ZookeeperConsumerConnector.scala:232) at kafka.consumer.ZookeeperConsumerConnector$$anonfun$1.apply$mcV$sp(ZookeeperConsumerConnector.scala:126) at kafka.utils.Utils$$anon$2.run(Utils.scala:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-2169) Upgrade to zkclient-0.5
Neha Narkhede created KAFKA-2169: Summary: Upgrade to zkclient-0.5 Key: KAFKA-2169 URL: https://issues.apache.org/jira/browse/KAFKA-2169 Project: Kafka Issue Type: Bug Affects Versions: 0.8.2.0 Reporter: Neha Narkhede Priority: Critical zkclient-0.5 is released http://mvnrepository.com/artifact/com.101tec/zkclient/0.5 and has the fix for KAFKA-824 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2159) offsets.topic.segment.bytes and offsets.topic.retention.minutes are ignored
[ https://issues.apache.org/jira/browse/KAFKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2159: - Labels: newbie (was: ) offsets.topic.segment.bytes and offsets.topic.retention.minutes are ignored --- Key: KAFKA-2159 URL: https://issues.apache.org/jira/browse/KAFKA-2159 Project: Kafka Issue Type: Bug Components: offset manager Affects Versions: 0.8.2.1 Reporter: Rafał Boniecki Labels: newbie My broker configuration: {quote}offsets.topic.num.partitions=20 offsets.topic.segment.bytes=10485760 offsets.topic.retention.minutes=10080{quote} Describe of __consumer_offsets topic: {quote}Topic:__consumer_offsets PartitionCount:20 ReplicationFactor:3 Configs:segment.bytes=104857600,cleanup.policy=compact Topic: __consumer_offsets Partition: 0Leader: 112 Replicas: 112,212,312 Isr: 212,312,112 Topic: __consumer_offsets Partition: 1Leader: 212 Replicas: 212,312,412 Isr: 212,312,412 Topic: __consumer_offsets Partition: 2Leader: 312 Replicas: 312,412,512 Isr: 312,412,512 Topic: __consumer_offsets Partition: 3Leader: 412 Replicas: 412,512,112 Isr: 412,512,112 Topic: __consumer_offsets Partition: 4Leader: 512 Replicas: 512,112,212 Isr: 512,212,112 Topic: __consumer_offsets Partition: 5Leader: 112 Replicas: 112,312,412 Isr: 312,412,112 Topic: __consumer_offsets Partition: 6Leader: 212 Replicas: 212,412,512 Isr: 212,412,512 Topic: __consumer_offsets Partition: 7Leader: 312 Replicas: 312,512,112 Isr: 312,512,112 Topic: __consumer_offsets Partition: 8Leader: 412 Replicas: 412,112,212 Isr: 412,212,112 Topic: __consumer_offsets Partition: 9Leader: 512 Replicas: 512,212,312 Isr: 512,212,312 Topic: __consumer_offsets Partition: 10 Leader: 112 Replicas: 112,412,512 Isr: 412,512,112 Topic: __consumer_offsets Partition: 11 Leader: 212 Replicas: 212,512,112 Isr: 212,512,112 Topic: __consumer_offsets Partition: 12 Leader: 312 Replicas: 312,112,212 Isr: 312,212,112 Topic: __consumer_offsets Partition: 13 Leader: 412 Replicas: 412,212,312 Isr: 412,212,312 Topic: __consumer_offsets Partition: 14 Leader: 512 Replicas: 512,312,412 Isr: 512,312,412 Topic: __consumer_offsets Partition: 15 Leader: 112 Replicas: 112,512,212 Isr: 512,212,112 Topic: __consumer_offsets Partition: 16 Leader: 212 Replicas: 212,112,312 Isr: 212,312,112 Topic: __consumer_offsets Partition: 17 Leader: 312 Replicas: 312,212,412 Isr: 312,212,412 Topic: __consumer_offsets Partition: 18 Leader: 412 Replicas: 412,312,512 Isr: 412,312,512 Topic: __consumer_offsets Partition: 19 Leader: 512 Replicas: 512,412,112 Isr: 512,412,112{quote} OffsetManager logs: {quote}2015-04-29 17:58:43:403 CEST DEBUG [kafka-scheduler-3][kafka.server.OffsetManager] Compacting offsets cache. 2015-04-29 17:58:43:403 CEST DEBUG [kafka-scheduler-3][kafka.server.OffsetManager] Found 1 stale offsets (older than 8640 ms). 2015-04-29 17:58:43:404 CEST TRACE [kafka-scheduler-3][kafka.server.OffsetManager] Removing stale offset and metadata for [drafts,tasks,1]: OffsetAndMetadata[824,consumer_id = drafts, time = 1430322433,0] 2015-04-29 17:58:43:404 CEST TRACE [kafka-scheduler-3][kafka.server.OffsetManager] Marked 1 offsets in [__consumer_offsets,2] for deletion. 2015-04-29 17:58:43:404 CEST DEBUG [kafka-scheduler-3][kafka.server.OffsetManager] Removed 1 stale offsets in 1 milliseconds.{quote} Parameters are ignored and default values are used instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2159) offsets.topic.segment.bytes and offsets.topic.retention.minutes are ignored
[ https://issues.apache.org/jira/browse/KAFKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2159: - Component/s: offset manager offsets.topic.segment.bytes and offsets.topic.retention.minutes are ignored --- Key: KAFKA-2159 URL: https://issues.apache.org/jira/browse/KAFKA-2159 Project: Kafka Issue Type: Bug Components: offset manager Affects Versions: 0.8.2.1 Reporter: Rafał Boniecki Labels: newbie My broker configuration: {quote}offsets.topic.num.partitions=20 offsets.topic.segment.bytes=10485760 offsets.topic.retention.minutes=10080{quote} Describe of __consumer_offsets topic: {quote}Topic:__consumer_offsets PartitionCount:20 ReplicationFactor:3 Configs:segment.bytes=104857600,cleanup.policy=compact Topic: __consumer_offsets Partition: 0Leader: 112 Replicas: 112,212,312 Isr: 212,312,112 Topic: __consumer_offsets Partition: 1Leader: 212 Replicas: 212,312,412 Isr: 212,312,412 Topic: __consumer_offsets Partition: 2Leader: 312 Replicas: 312,412,512 Isr: 312,412,512 Topic: __consumer_offsets Partition: 3Leader: 412 Replicas: 412,512,112 Isr: 412,512,112 Topic: __consumer_offsets Partition: 4Leader: 512 Replicas: 512,112,212 Isr: 512,212,112 Topic: __consumer_offsets Partition: 5Leader: 112 Replicas: 112,312,412 Isr: 312,412,112 Topic: __consumer_offsets Partition: 6Leader: 212 Replicas: 212,412,512 Isr: 212,412,512 Topic: __consumer_offsets Partition: 7Leader: 312 Replicas: 312,512,112 Isr: 312,512,112 Topic: __consumer_offsets Partition: 8Leader: 412 Replicas: 412,112,212 Isr: 412,212,112 Topic: __consumer_offsets Partition: 9Leader: 512 Replicas: 512,212,312 Isr: 512,212,312 Topic: __consumer_offsets Partition: 10 Leader: 112 Replicas: 112,412,512 Isr: 412,512,112 Topic: __consumer_offsets Partition: 11 Leader: 212 Replicas: 212,512,112 Isr: 212,512,112 Topic: __consumer_offsets Partition: 12 Leader: 312 Replicas: 312,112,212 Isr: 312,212,112 Topic: __consumer_offsets Partition: 13 Leader: 412 Replicas: 412,212,312 Isr: 412,212,312 Topic: __consumer_offsets Partition: 14 Leader: 512 Replicas: 512,312,412 Isr: 512,312,412 Topic: __consumer_offsets Partition: 15 Leader: 112 Replicas: 112,512,212 Isr: 512,212,112 Topic: __consumer_offsets Partition: 16 Leader: 212 Replicas: 212,112,312 Isr: 212,312,112 Topic: __consumer_offsets Partition: 17 Leader: 312 Replicas: 312,212,412 Isr: 312,212,412 Topic: __consumer_offsets Partition: 18 Leader: 412 Replicas: 412,312,512 Isr: 412,312,512 Topic: __consumer_offsets Partition: 19 Leader: 512 Replicas: 512,412,112 Isr: 512,412,112{quote} OffsetManager logs: {quote}2015-04-29 17:58:43:403 CEST DEBUG [kafka-scheduler-3][kafka.server.OffsetManager] Compacting offsets cache. 2015-04-29 17:58:43:403 CEST DEBUG [kafka-scheduler-3][kafka.server.OffsetManager] Found 1 stale offsets (older than 8640 ms). 2015-04-29 17:58:43:404 CEST TRACE [kafka-scheduler-3][kafka.server.OffsetManager] Removing stale offset and metadata for [drafts,tasks,1]: OffsetAndMetadata[824,consumer_id = drafts, time = 1430322433,0] 2015-04-29 17:58:43:404 CEST TRACE [kafka-scheduler-3][kafka.server.OffsetManager] Marked 1 offsets in [__consumer_offsets,2] for deletion. 2015-04-29 17:58:43:404 CEST DEBUG [kafka-scheduler-3][kafka.server.OffsetManager] Removed 1 stale offsets in 1 milliseconds.{quote} Parameters are ignored and default values are used instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2156) Possibility to plug in custom MetricRegistry
[ https://issues.apache.org/jira/browse/KAFKA-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526932#comment-14526932 ] Neha Narkhede commented on KAFKA-2156: -- [~sandris] KM = Kafka Metrics (org.apache.kafka.common.metrics) :) Worth considering supporting this directly on Kafka Metrics. Possibility to plug in custom MetricRegistry Key: KAFKA-2156 URL: https://issues.apache.org/jira/browse/KAFKA-2156 Project: Kafka Issue Type: Improvement Components: producer Affects Versions: 0.8.1.2 Reporter: Andras Sereny Assignee: Jun Rao The trait KafkaMetricsGroup refers to Metrics.defaultRegistry() throughout. It would be nice to be able to inject any MetricsRegistry instead of the default one. (My usecase is that I'd like to channel Kafka metrics into our application's metrics system, for which I'd need custom implementations of com.yammer.metrics.core.Metric.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526945#comment-14526945 ] Neha Narkhede edited comment on KAFKA-1387 at 5/4/15 5:58 PM: -- When this happens, there isn't a way to get out of this without killing the broker. Marking it as a blocker. was (Author: nehanarkhede): When this happens, there isn't a way to get out of this without killing the broker. Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time - Key: KAFKA-1387 URL: https://issues.apache.org/jira/browse/KAFKA-1387 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Fedor Korotkiy Priority: Blocker Labels: newbie, patch Attachments: kafka-1387.patch Kafka broker re-registers itself in zookeeper every time handleNewSession() callback is invoked. https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala Now imagine the following sequence of events. 1) Zookeeper session reestablishes. handleNewSession() callback is queued by the zkClient, but not invoked yet. 2) Zookeeper session reestablishes again, queueing callback second time. 3) First callback is invoked, creating /broker/[id] ephemeral path. 4) Second callback is invoked and it tries to create /broker/[id] path using createEphemeralPathExpectConflictHandleZKBug() function. But the path is already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting stuck in the infinite loop. Seems like controller election code have the same issue. I'am able to reproduce this issue on the 0.8.1 branch from github using the following configs. # zookeeper tickTime=10 dataDir=/tmp/zk/ clientPort=2101 maxClientCnxns=0 # kafka broker.id=1 log.dir=/tmp/kafka zookeeper.connect=localhost:2101 zookeeper.connection.timeout.ms=100 zookeeper.sessiontimeout.ms=100 Just start kafka and zookeeper and then pause zookeeper several times using Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1387: - Priority: Blocker (was: Major) When this happens, there isn't a way to get out of this without killing the broker. Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time - Key: KAFKA-1387 URL: https://issues.apache.org/jira/browse/KAFKA-1387 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Fedor Korotkiy Priority: Blocker Labels: newbie, patch Attachments: kafka-1387.patch Kafka broker re-registers itself in zookeeper every time handleNewSession() callback is invoked. https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala Now imagine the following sequence of events. 1) Zookeeper session reestablishes. handleNewSession() callback is queued by the zkClient, but not invoked yet. 2) Zookeeper session reestablishes again, queueing callback second time. 3) First callback is invoked, creating /broker/[id] ephemeral path. 4) Second callback is invoked and it tries to create /broker/[id] path using createEphemeralPathExpectConflictHandleZKBug() function. But the path is already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting stuck in the infinite loop. Seems like controller election code have the same issue. I'am able to reproduce this issue on the 0.8.1 branch from github using the following configs. # zookeeper tickTime=10 dataDir=/tmp/zk/ clientPort=2101 maxClientCnxns=0 # kafka broker.id=1 log.dir=/tmp/kafka zookeeper.connect=localhost:2101 zookeeper.connection.timeout.ms=100 zookeeper.sessiontimeout.ms=100 Just start kafka and zookeeper and then pause zookeeper several times using Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1387: - Labels: newbie patch zkclient-problems (was: newbie patch) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time - Key: KAFKA-1387 URL: https://issues.apache.org/jira/browse/KAFKA-1387 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Fedor Korotkiy Priority: Blocker Labels: newbie, patch, zkclient-problems Attachments: kafka-1387.patch Kafka broker re-registers itself in zookeeper every time handleNewSession() callback is invoked. https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala Now imagine the following sequence of events. 1) Zookeeper session reestablishes. handleNewSession() callback is queued by the zkClient, but not invoked yet. 2) Zookeeper session reestablishes again, queueing callback second time. 3) First callback is invoked, creating /broker/[id] ephemeral path. 4) Second callback is invoked and it tries to create /broker/[id] path using createEphemeralPathExpectConflictHandleZKBug() function. But the path is already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting stuck in the infinite loop. Seems like controller election code have the same issue. I'am able to reproduce this issue on the 0.8.1 branch from github using the following configs. # zookeeper tickTime=10 dataDir=/tmp/zk/ clientPort=2101 maxClientCnxns=0 # kafka broker.id=1 log.dir=/tmp/kafka zookeeper.connect=localhost:2101 zookeeper.connection.timeout.ms=100 zookeeper.sessiontimeout.ms=100 Just start kafka and zookeeper and then pause zookeeper several times using Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2157) kafka-console-consumer.sh: Mismatch in CLI usage docs vs. Scala Option parsing
[ https://issues.apache.org/jira/browse/KAFKA-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526972#comment-14526972 ] Neha Narkhede commented on KAFKA-2157: -- [~tvaughan77] topics is correct. Patch is appreciated :) kafka-console-consumer.sh: Mismatch in CLI usage docs vs. Scala Option parsing Key: KAFKA-2157 URL: https://issues.apache.org/jira/browse/KAFKA-2157 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Thomas Vaughan Priority: Minor I built kafka-0.8.2.1 from source and noticed there's a mismatch between the command line usage help and the actual arguments expected by the ConsumerPerformace$ConsumerPerfConfig.class file. On the command line if you run bin/kafka-console-consumer.sh with no arguments it claims you need these required fields: {code} --broker-list hostname:port,..,REQUIRED: broker info (the list of hostname:port broker host and port for bootstrap. ... --topics topic1,topic2.. REQUIRED: The comma separated list of topics to produce to {code} Supplying that script with those flags will cause joptsimple.OptionException.unrecognizedOption exceptions because what's really needed are \--zookeeper and \--topic (singular, not plural) options. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KAFKA-2155) Add option to control ZK root for kafka.tools.ConsumerOffsetChecker
[ https://issues.apache.org/jira/browse/KAFKA-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede resolved KAFKA-2155. -- Resolution: Not A Problem Add option to control ZK root for kafka.tools.ConsumerOffsetChecker --- Key: KAFKA-2155 URL: https://issues.apache.org/jira/browse/KAFKA-2155 Project: Kafka Issue Type: Improvement Components: admin Affects Versions: 0.8.1.2 Environment: Hortonworks 2.2.4 with Kafka 0.8.1.2.2 Reporter: Kjell Tore Fossbakk Priority: Minor Labels: features Hello. We need to add an option to kafka.tools.ConsumerOffsetChecker which will allow the control of ZK root. It is at the moment hardcoded to /consumers/. A new option --zkroot would simply replace the contents of /consumers, which defaultsTo(/consumers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2155) Add option to control ZK root for kafka.tools.ConsumerOffsetChecker
[ https://issues.apache.org/jira/browse/KAFKA-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526978#comment-14526978 ] Neha Narkhede commented on KAFKA-2155: -- You can just specify the zk root as part of the ZK connection string. {code} ./bin/kafka-consumer-offset-checker.sh --zookeeper localhost:2181/namespace {code} Add option to control ZK root for kafka.tools.ConsumerOffsetChecker --- Key: KAFKA-2155 URL: https://issues.apache.org/jira/browse/KAFKA-2155 Project: Kafka Issue Type: Improvement Components: admin Affects Versions: 0.8.1.2 Environment: Hortonworks 2.2.4 with Kafka 0.8.1.2.2 Reporter: Kjell Tore Fossbakk Priority: Minor Labels: features Hello. We need to add an option to kafka.tools.ConsumerOffsetChecker which will allow the control of ZK root. It is at the moment hardcoded to /consumers/. A new option --zkroot would simply replace the contents of /consumers, which defaultsTo(/consumers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1843) Metadata fetch/refresh in new producer should handle all node connection states gracefully
[ https://issues.apache.org/jira/browse/KAFKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1843: - Labels: newbie++ (was: ) Metadata fetch/refresh in new producer should handle all node connection states gracefully -- Key: KAFKA-1843 URL: https://issues.apache.org/jira/browse/KAFKA-1843 Project: Kafka Issue Type: Bug Components: clients, producer Affects Versions: 0.8.2.0 Reporter: Ewen Cheslack-Postava Priority: Blocker Labels: newbie++ KAFKA-1642 resolved some issues with the handling of broker connection states to avoid high CPU usage, but made the minimal fix rather than the ideal one. The code for handling the metadata fetch is difficult to get right because it has to handle a lot of possible connectivity states and failure modes across all the known nodes. It also needs to correctly integrate with the surrounding event loop, providing correct poll() timeouts to both avoid busy looping and make sure it wakes up and tries new nodes in the face of both connection and request failures. A patch here should address a few issues: 1. Make sure connection timeouts, as implemented in KAFKA-1842, are cleanly integrated. This mostly means that when a connecting node is selected to fetch metadata from, that the code notices that and sets the next timeout based on the connection timeout rather than some other backoff. 2. Rethink the logic and naming of NetworkClient.leastLoadedNode. That method actually takes into account a) the current connectivity of each node, b) whether the node had a recent connection failure, c) the load in terms of in flight requests. It also needs to ensure that different clients don't use the same ordering across multiple calls (which is already addressed in the current code by nodeIndexOffset) and that we always eventually try all nodes in the face of connection failures (which isn't currently handled by leastLoadedNode and probably cannot be without tracking additional state). This method also has to work for new consumer use cases even though it is currently only used by the new producer's metadata fetch. Finally it has to properly handle when other code calls initiateConnect() since the normal path for sending messages also initiates connections. We can already say that there is an order of preference given a single call (as follows), but making this work across multiple calls when some initial choices fail to connect or return metadata *and* connection states may be changing is much more difficult. * Connected, zero in flight requests - the request can be sent immediately * Connecting node - it will hopefully be connected very soon and by definition has no in flight requests * Disconnected - same reasoning as for a connecting node * Connected, 0 in flight requests - we consider any # of in flight requests as a big enough backlog to delay the request a lot. We could use an approach that better accounts for # of in flight requests rather than just turning it into a boolean variable, but that probably introduces much more complexity than it is worth. 3. The most difficult case to handle so far has been when leastLoadedNode returns a disconnected node to maybeUpdateMetadata as its best option. Properly handling the two resulting cases (initiateConnect fails immediately vs. taking some time to possibly establish the connection) is tricky. 4. Consider optimizing for the failure cases. The most common cases are when you already have an active connection and can immediately get the metadata or you need to establish a connection, but the connection and metadata request/response happen very quickly. These common cases are infrequent enough (default every 5 min) that establishing an extra connection isn't a big deal as long as it's eventually cleaned up. The edge cases, like network partitions where some subset of nodes become unreachable for a long period, are harder to reason about but we should be sure we will always be able to gracefully recover from them. KAFKA-1642 enumerated the possible outcomes of a single call to maybeUpdateMetadata. A good fix for this would consider all of those outcomes for repeated calls to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1843) Metadata fetch/refresh in new producer should handle all node connection states gracefully
[ https://issues.apache.org/jira/browse/KAFKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1843: - Priority: Blocker (was: Major) Metadata fetch/refresh in new producer should handle all node connection states gracefully -- Key: KAFKA-1843 URL: https://issues.apache.org/jira/browse/KAFKA-1843 Project: Kafka Issue Type: Bug Components: clients, producer Affects Versions: 0.8.2.0 Reporter: Ewen Cheslack-Postava Priority: Blocker Labels: newbie++ KAFKA-1642 resolved some issues with the handling of broker connection states to avoid high CPU usage, but made the minimal fix rather than the ideal one. The code for handling the metadata fetch is difficult to get right because it has to handle a lot of possible connectivity states and failure modes across all the known nodes. It also needs to correctly integrate with the surrounding event loop, providing correct poll() timeouts to both avoid busy looping and make sure it wakes up and tries new nodes in the face of both connection and request failures. A patch here should address a few issues: 1. Make sure connection timeouts, as implemented in KAFKA-1842, are cleanly integrated. This mostly means that when a connecting node is selected to fetch metadata from, that the code notices that and sets the next timeout based on the connection timeout rather than some other backoff. 2. Rethink the logic and naming of NetworkClient.leastLoadedNode. That method actually takes into account a) the current connectivity of each node, b) whether the node had a recent connection failure, c) the load in terms of in flight requests. It also needs to ensure that different clients don't use the same ordering across multiple calls (which is already addressed in the current code by nodeIndexOffset) and that we always eventually try all nodes in the face of connection failures (which isn't currently handled by leastLoadedNode and probably cannot be without tracking additional state). This method also has to work for new consumer use cases even though it is currently only used by the new producer's metadata fetch. Finally it has to properly handle when other code calls initiateConnect() since the normal path for sending messages also initiates connections. We can already say that there is an order of preference given a single call (as follows), but making this work across multiple calls when some initial choices fail to connect or return metadata *and* connection states may be changing is much more difficult. * Connected, zero in flight requests - the request can be sent immediately * Connecting node - it will hopefully be connected very soon and by definition has no in flight requests * Disconnected - same reasoning as for a connecting node * Connected, 0 in flight requests - we consider any # of in flight requests as a big enough backlog to delay the request a lot. We could use an approach that better accounts for # of in flight requests rather than just turning it into a boolean variable, but that probably introduces much more complexity than it is worth. 3. The most difficult case to handle so far has been when leastLoadedNode returns a disconnected node to maybeUpdateMetadata as its best option. Properly handling the two resulting cases (initiateConnect fails immediately vs. taking some time to possibly establish the connection) is tricky. 4. Consider optimizing for the failure cases. The most common cases are when you already have an active connection and can immediately get the metadata or you need to establish a connection, but the connection and metadata request/response happen very quickly. These common cases are infrequent enough (default every 5 min) that establishing an extra connection isn't a big deal as long as it's eventually cleaned up. The edge cases, like network partitions where some subset of nodes become unreachable for a long period, are harder to reason about but we should be sure we will always be able to gracefully recover from them. KAFKA-1642 enumerated the possible outcomes of a single call to maybeUpdateMetadata. A good fix for this would consider all of those outcomes for repeated calls to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2139) Add a separate controller messge queue with higher priority on broker side
[ https://issues.apache.org/jira/browse/KAFKA-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527048#comment-14527048 ] Neha Narkhede commented on KAFKA-2139: -- [~becket_qin] Left some comments regarding the design on the wiki. Add a separate controller messge queue with higher priority on broker side --- Key: KAFKA-2139 URL: https://issues.apache.org/jira/browse/KAFKA-2139 Project: Kafka Issue Type: Improvement Reporter: Jiangjie Qin Assignee: Jiangjie Qin This ticket is supposed to be working together with KAFKA-2029. There are two issues with current controller to broker messages. 1. On the controller side the message are sent without synchronization. 2. On broker side the controller messages share the same queue as client messages. The problem here is that brokers process the controller messages for the same partition at different times and the variation could be big. This causes unnecessary data loss and prolong the preferred leader election / controlled shutdown/ partition reassignment, etc. KAFKA-2029 was trying to add a boundary between messages for different partitions. For example, before leader migration for previous partition finishes, the leader migration for next partition won't begin. This ticket is trying to let broker process controller messages faster. So the idea is have separate queue to hold controller messages, if there are controller messages, KafkaApi thread will first take care of those messages, otherwise it will proceed messages from clients. Those two tickets are not ultimate solution to current controller problems, but just mitigate them with minor code changes. Moving forward, we still need to think about rewriting controller in a cleaner way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1843) Metadata fetch/refresh in new producer should handle all node connection states gracefully
[ https://issues.apache.org/jira/browse/KAFKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1843: - Labels: (was: newbie++) Metadata fetch/refresh in new producer should handle all node connection states gracefully -- Key: KAFKA-1843 URL: https://issues.apache.org/jira/browse/KAFKA-1843 Project: Kafka Issue Type: Bug Components: clients, producer Affects Versions: 0.8.2.0 Reporter: Ewen Cheslack-Postava Priority: Blocker KAFKA-1642 resolved some issues with the handling of broker connection states to avoid high CPU usage, but made the minimal fix rather than the ideal one. The code for handling the metadata fetch is difficult to get right because it has to handle a lot of possible connectivity states and failure modes across all the known nodes. It also needs to correctly integrate with the surrounding event loop, providing correct poll() timeouts to both avoid busy looping and make sure it wakes up and tries new nodes in the face of both connection and request failures. A patch here should address a few issues: 1. Make sure connection timeouts, as implemented in KAFKA-1842, are cleanly integrated. This mostly means that when a connecting node is selected to fetch metadata from, that the code notices that and sets the next timeout based on the connection timeout rather than some other backoff. 2. Rethink the logic and naming of NetworkClient.leastLoadedNode. That method actually takes into account a) the current connectivity of each node, b) whether the node had a recent connection failure, c) the load in terms of in flight requests. It also needs to ensure that different clients don't use the same ordering across multiple calls (which is already addressed in the current code by nodeIndexOffset) and that we always eventually try all nodes in the face of connection failures (which isn't currently handled by leastLoadedNode and probably cannot be without tracking additional state). This method also has to work for new consumer use cases even though it is currently only used by the new producer's metadata fetch. Finally it has to properly handle when other code calls initiateConnect() since the normal path for sending messages also initiates connections. We can already say that there is an order of preference given a single call (as follows), but making this work across multiple calls when some initial choices fail to connect or return metadata *and* connection states may be changing is much more difficult. * Connected, zero in flight requests - the request can be sent immediately * Connecting node - it will hopefully be connected very soon and by definition has no in flight requests * Disconnected - same reasoning as for a connecting node * Connected, 0 in flight requests - we consider any # of in flight requests as a big enough backlog to delay the request a lot. We could use an approach that better accounts for # of in flight requests rather than just turning it into a boolean variable, but that probably introduces much more complexity than it is worth. 3. The most difficult case to handle so far has been when leastLoadedNode returns a disconnected node to maybeUpdateMetadata as its best option. Properly handling the two resulting cases (initiateConnect fails immediately vs. taking some time to possibly establish the connection) is tricky. 4. Consider optimizing for the failure cases. The most common cases are when you already have an active connection and can immediately get the metadata or you need to establish a connection, but the connection and metadata request/response happen very quickly. These common cases are infrequent enough (default every 5 min) that establishing an extra connection isn't a big deal as long as it's eventually cleaned up. The edge cases, like network partitions where some subset of nodes become unreachable for a long period, are harder to reason about but we should be sure we will always be able to gracefully recover from them. KAFKA-1642 enumerated the possible outcomes of a single call to maybeUpdateMetadata. A good fix for this would consider all of those outcomes for repeated calls to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KAFKA-1886) SimpleConsumer swallowing ClosedByInterruptException
[ https://issues.apache.org/jira/browse/KAFKA-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede resolved KAFKA-1886. -- Resolution: Fixed Thanks, pushed to trunk SimpleConsumer swallowing ClosedByInterruptException Key: KAFKA-1886 URL: https://issues.apache.org/jira/browse/KAFKA-1886 Project: Kafka Issue Type: Bug Components: producer Reporter: Aditya A Auradkar Assignee: Aditya Auradkar Attachments: KAFKA-1886.patch, KAFKA-1886.patch, KAFKA-1886_2015-02-02_13:57:23.patch, KAFKA-1886_2015-04-28_10:27:39.patch This issue was originally reported by a Samza developer. I've included an exchange of mine with Chris Riccomini. I'm trying to reproduce the problem on my dev setup. From: criccomi Hey all, Samza's BrokerProxy [1] threads appear to be wedging randomly when we try to interrupt its fetcher thread. I noticed that SimpleConsumer.scala catches Throwable in its sendRequest method [2]. I'm wondering: if blockingChannel.send/receive throws a ClosedByInterruptException when the thread is interrupted, what happens? It looks like sendRequest will catch the exception (which I think clears the thread's interrupted flag), and then retries the send. If the send succeeds on the retry, I think that the ClosedByInterruptException exception is effectively swallowed, and the BrokerProxy will continue fetching messages as though its thread was never interrupted. Am I misunderstanding how things work? Cheers, Chris [1] https://github.com/apache/incubator-samza/blob/master/samza-kafka/src/main/scala/org/apache/samza/system/kafka/BrokerProxy.scala#L126 [2] https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/consumer/SimpleConsumer.scala#L75 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2158) Close all fetchers in AbstractFetcherManager without blocking
[ https://issues.apache.org/jira/browse/KAFKA-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2158: - Reviewer: Jun Rao Close all fetchers in AbstractFetcherManager without blocking - Key: KAFKA-2158 URL: https://issues.apache.org/jira/browse/KAFKA-2158 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 0.8.2.0 Reporter: Jiasheng Wang Attachments: KAFKA-2158.patch def closeAllFetchers() { mapLock synchronized { for ( (_, fetcher) - fetcherThreadMap) { fetcher.shutdown() } fetcherThreadMap.clear() } } It is time consuming for closeAllFetchers() in AbstractFetcherManager.scala because each time a fetcher calls shutdown method it will block until awaitShutdown() returns. As a result it will slow down the restart of kafka service. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2153) kafka-patch-review tool uploads a patch even if it is empty
[ https://issues.apache.org/jira/browse/KAFKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2153: - Resolution: Fixed Status: Resolved (was: Patch Available) Thanks, pushed to trunk. kafka-patch-review tool uploads a patch even if it is empty --- Key: KAFKA-2153 URL: https://issues.apache.org/jira/browse/KAFKA-2153 Project: Kafka Issue Type: Bug Reporter: Ashish K Singh Assignee: Ashish K Singh Attachments: KAFKA-2153.patch kafka-patch-review tool is great and a big help. However, sometimes one forgets to commit the changes made and runs this tool. The tool ends up uploading an empty patch. It will be nice to catch and intimate the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-313) Add JSON/CSV output and looping options to ConsumerGroupCommand
[ https://issues.apache.org/jira/browse/KAFKA-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527075#comment-14527075 ] Neha Narkhede commented on KAFKA-313: - [~singhashish] I'm sorry for the delay. Not sure how this slipped through the list that [~junrao] is maintaining for patches under review. We really need our review-nag script :) This just crossed my mind. Since the last time you uploaded this patch, we started the KIP Review process for all user-facing changes. My guess is that changes to user-facing tools are included in that. Would you mind starting a quick KIP discussion on this, it should be a quick vote. I can help merge your patch right after (we might need a rebase) Add JSON/CSV output and looping options to ConsumerGroupCommand --- Key: KAFKA-313 URL: https://issues.apache.org/jira/browse/KAFKA-313 Project: Kafka Issue Type: Improvement Reporter: Dave DeMaagd Assignee: Ashish K Singh Priority: Minor Labels: newbie, patch Fix For: 0.8.3 Attachments: KAFKA-313-2012032200.diff, KAFKA-313.1.patch, KAFKA-313.patch, KAFKA-313_2015-02-23_18:11:32.patch Adds: * '--loop N' - causes the program to loop forever, sleeping for up to N seconds between loops (loop time minus collection time, unless that's less than 0, at which point it will just run again immediately) * '--asjson' - display as a JSON string instead of the more human readable output format. Neither of the above depend on each other (you can loop in the human readable output, or do a single shot execution with JSON output). Existing behavior/output maintained if neither of the above are used. Diff Attached. Impacted files: core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1621) Standardize --messages option in perf scripts
[ https://issues.apache.org/jira/browse/KAFKA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1621: - Resolution: Fixed Assignee: Rekha Joshi Status: Resolved (was: Patch Available) Merged PR #58 Standardize --messages option in perf scripts - Key: KAFKA-1621 URL: https://issues.apache.org/jira/browse/KAFKA-1621 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Jay Kreps Assignee: Rekha Joshi Labels: newbie This option is specified in PerfConfig and is used by the producer, consumer and simple consumer perf commands. The docstring on the argument does not list it as required but the producer performance test requires it--others don't. We should standardize this so that either all the commands require the option and it is marked as required in the docstring or none of them list it as required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1367) Broker topic metadata not kept in sync with ZooKeeper
[ https://issues.apache.org/jira/browse/KAFKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516298#comment-14516298 ] Neha Narkhede commented on KAFKA-1367: -- [~jjkoshy] That would work too but looks like [~abiletskyi] is suggesting that it is not included as part of KIP-4. Maybe we can have whoever picks this JIRA discuss this change as part of a separate KIP? Broker topic metadata not kept in sync with ZooKeeper - Key: KAFKA-1367 URL: https://issues.apache.org/jira/browse/KAFKA-1367 Project: Kafka Issue Type: Bug Affects Versions: 0.8.0, 0.8.1 Reporter: Ryan Berdeen Assignee: Ashish K Singh Labels: newbie++ Fix For: 0.8.3 Attachments: KAFKA-1367.txt When a broker is restarted, the topic metadata responses from the brokers will be incorrect (different from ZooKeeper) until a preferred replica leader election. In the metadata, it looks like leaders are correctly removed from the ISR when a broker disappears, but followers are not. Then, when a broker reappears, the ISR is never updated. I used a variation of the Vagrant setup created by Joe Stein to reproduce this with latest from the 0.8.1 branch: https://github.com/also/kafka/commit/dba36a503a5e22ea039df0f9852560b4fb1e067c -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1054) Eliminate Compilation Warnings for 0.8 Final Release
[ https://issues.apache.org/jira/browse/KAFKA-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1054: - Resolution: Fixed Status: Resolved (was: Patch Available) Thanks. Merged the github PR. Eliminate Compilation Warnings for 0.8 Final Release Key: KAFKA-1054 URL: https://issues.apache.org/jira/browse/KAFKA-1054 Project: Kafka Issue Type: Improvement Reporter: Guozhang Wang Assignee: Ismael Juma Labels: newbie Fix For: 0.9.0 Attachments: KAFKA-1054-20150426-V1.patch, KAFKA-1054-20150426-V2.patch, KAFKA-1054.patch, KAFKA-1054_Mar_10_2015.patch Currently we have a total number of 38 warnings for source code compilation of 0.8. 1) 3 from Unchecked type pattern 2) 6 from Unchecked conversion 3) 29 from Deprecated Hadoop API functions It's better we finish these before the final release of 0.8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1748) Decouple system test cluster resources definition from service definitions
[ https://issues.apache.org/jira/browse/KAFKA-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513111#comment-14513111 ] Neha Narkhede commented on KAFKA-1748: -- [~ewencp] With the ducktape work, I guess we can close this? Decouple system test cluster resources definition from service definitions -- Key: KAFKA-1748 URL: https://issues.apache.org/jira/browse/KAFKA-1748 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Ewen Cheslack-Postava Assignee: Ewen Cheslack-Postava Attachments: KAFKA-1748.patch, KAFKA-1748_2014-11-03_12:04:18.patch, KAFKA-1748_2014-11-14_14:54:17.patch Currently the system tests use JSON files that specify the set of services for each test and where they should run (i.e. hostname). These currently assume that you already have SSH keys setup, use the same username on the host running the tests and the test cluster, don't require any additional ssh/scp/rsync flags, and assume you'll always have a fixed set of compute resources (or that you'll spend a lot of time editing config files). While we don't want a whole cluster resource manager in the system tests, a bit more flexibility would make it easier to, e.g., run tests against a local vagrant cluster or on dynamically allocated EC2 instances. We can separate out the basic resource spec (i.e. json specifying how to access machines) from the service definition (i.e. a broker should run with settings x, y, z). Restricting to a very simple set of mappings (i.e. map services to hosts with round robin, optionally restricting to no reuse of hosts) should keep things simple. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2034) sourceCompatibility not set in Kafka build.gradle
[ https://issues.apache.org/jira/browse/KAFKA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2034: - Resolution: Fixed Reviewer: Ewen Cheslack-Postava Assignee: Ismael Juma Status: Resolved (was: Patch Available) Thanks for the patch. Pushed to trunk. sourceCompatibility not set in Kafka build.gradle - Key: KAFKA-2034 URL: https://issues.apache.org/jira/browse/KAFKA-2034 Project: Kafka Issue Type: Bug Components: build Affects Versions: 0.8.2.1 Reporter: Derek Bassett Assignee: Ismael Juma Priority: Minor Labels: newbie Attachments: KAFKA-2034.patch Original Estimate: 4h Remaining Estimate: 4h The build.gradle does not explicitly set the sourceCompatibility version in build.gradle. This allows kafka when built by Java 1.8 to incorrectly set the wrong version of the class files. This also would allow Java 1.8 features to be merged into Kafka. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1351) String.format is very expensive in Scala
[ https://issues.apache.org/jira/browse/KAFKA-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513096#comment-14513096 ] Neha Narkhede commented on KAFKA-1351: -- Not sure that this is still a concern. I was hoping whoever picks this up can do a quick microbenchmark to see if the suggestion in the description is really worth a change or not. String.format is very expensive in Scala Key: KAFKA-1351 URL: https://issues.apache.org/jira/browse/KAFKA-1351 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.7.2, 0.8.0, 0.8.1 Reporter: Neha Narkhede Labels: newbie Fix For: 0.8.3 Attachments: KAFKA-1351.patch, KAFKA-1351_2014-04-07_18:02:18.patch, KAFKA-1351_2014-04-09_15:40:11.patch As found in KAFKA-1350, logging is causing significant overhead in the performance of a Kafka server. There are several info statements that use String.format which is particularly expensive. We should investigate adding our own version of String.format that merely uses string concatenation under the covers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KAFKA-1293) Mirror maker housecleaning
[ https://issues.apache.org/jira/browse/KAFKA-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede resolved KAFKA-1293. -- Resolution: Fixed Closing based on [~mwarhaftig]'s latest comment. Mirror maker housecleaning -- Key: KAFKA-1293 URL: https://issues.apache.org/jira/browse/KAFKA-1293 Project: Kafka Issue Type: Bug Components: tools Affects Versions: 0.8.1 Reporter: Jay Kreps Priority: Minor Labels: usability Attachments: KAFKA-1293.patch Mirror maker uses it's own convention for command-line arguments, e.g. --num.producers, where everywhere else follows the unix convention like --num-producers. This is annoying because when running different tools you have to constantly remember whatever quirks of the person who wrote that tool. Mirror maker should also have a top-level wrapper script in bin/ to make tab completion work and so you don't have to remember the fully qualified class name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KAFKA-1277) Keep the summery/description when updating the RB with kafka-patch-review
[ https://issues.apache.org/jira/browse/KAFKA-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede resolved KAFKA-1277. -- Resolution: Incomplete Closing due to inactivity. Keep the summery/description when updating the RB with kafka-patch-review - Key: KAFKA-1277 URL: https://issues.apache.org/jira/browse/KAFKA-1277 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Assignee: Manikumar Reddy Attachments: KAFKA-1277.patch, KAFKA-1277.patch, KAFKA-1277.patch, KAFKA-1277_2014-10-04_16:39:56.patch, KAFKA-1277_2014-10-04_16:51:20.patch, KAFKA-1277_2014-10-04_16:57:30.patch, KAFKA-1277_2014-10-04_17:00:37.patch, KAFKA-1277_2014-10-04_17:01:43.patch, KAFKA-1277_2014-10-04_17:03:08.patch, KAFKA-1277_2014-10-04_17:09:02.patch, KAFKA-1277_2014-10-05_11:04:33.patch, KAFKA-1277_2014-10-05_11:09:08.patch, KAFKA-1277_2014-10-05_11:10:50.patch, KAFKA-1277_2014-10-05_11:18:17.patch Today kafka-patch-review tool will always use a default title and description if they are not specified, even when updating an existing RB. Would better change to leave the current title/description as is. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2140) Improve code readability
[ https://issues.apache.org/jira/browse/KAFKA-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513094#comment-14513094 ] Neha Narkhede commented on KAFKA-2140: -- +1 Improve code readability Key: KAFKA-2140 URL: https://issues.apache.org/jira/browse/KAFKA-2140 Project: Kafka Issue Type: Improvement Reporter: Ismael Juma Assignee: Ismael Juma Priority: Minor Attachments: KAFKA-2140.patch There are a number of places where code could be written in a more readable and idiomatic form. It's easier to explain with a patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1737) Document required ZkSerializer for ZkClient used with AdminUtils
[ https://issues.apache.org/jira/browse/KAFKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513312#comment-14513312 ] Neha Narkhede commented on KAFKA-1737: -- [~vivekpm] Makes sense. Would you like to upload what you have? Document required ZkSerializer for ZkClient used with AdminUtils Key: KAFKA-1737 URL: https://issues.apache.org/jira/browse/KAFKA-1737 Project: Kafka Issue Type: Improvement Components: tools Affects Versions: 0.8.1.1 Reporter: Stevo Slavic Assignee: Vivek Madani Priority: Minor Attachments: KAFKA-1737.patch {{ZkClient}} instances passed to {{AdminUtils}} calls must have {{kafka.utils.ZKStringSerializer}} set as {{ZkSerializer}}. Otherwise commands executed via {{AdminUtils}} may not be seen/recognizable to broker, producer or consumer. E.g. producer (with auto topic creation turned off) will not be able to send messages to a topic created via {{AdminUtils}}, it will throw {{UnknownTopicOrPartitionException}}. Please consider at least documenting this requirement in {{AdminUtils}} scaladoc. For more info see [related discussion on Kafka user mailing list|http://mail-archives.apache.org/mod_mbox/kafka-users/201410.mbox/%3CCAAUywg-oihNiXuQRYeS%3D8Z3ymsmEHo6ghLs%3Dru4nbm%2BdHVz6TA%40mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2080) quick cleanup of producer performance scripts
[ https://issues.apache.org/jira/browse/KAFKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2080: - Component/s: tools quick cleanup of producer performance scripts - Key: KAFKA-2080 URL: https://issues.apache.org/jira/browse/KAFKA-2080 Project: Kafka Issue Type: Bug Components: tools Reporter: Gwen Shapira Assignee: Gwen Shapira Labels: newbie We have two producer performance tools at the moment: one at o.a.k.client.tools and one at kafka.tools bin/kafka-producer-perf-test.sh is calling the kafka.tools one. org.apache.kafka.clients.tools.ProducerPerformance has --messages listed as optional (with default) while leaving the parameter out results in an error. Cleanup will include: * Removing the kafka.tools performance tool * Changing the shellscript to use new tool * Fix the misleading documentation for --messages * Adding both performance tools to the kafka docs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2041) Add ability to specify a KeyClass for KafkaLog4jAppender
[ https://issues.apache.org/jira/browse/KAFKA-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2041: - Status: Patch Available (was: Open) Add ability to specify a KeyClass for KafkaLog4jAppender Key: KAFKA-2041 URL: https://issues.apache.org/jira/browse/KAFKA-2041 Project: Kafka Issue Type: Improvement Components: producer Reporter: Benoy Antony Assignee: Jun Rao Attachments: KAFKA-2041.patch, kafka-2041-001.patch, kafka-2041-002.patch, kafka-2041-003.patch KafkaLog4jAppender is the Log4j Appender to publish messages to Kafka. Since there is no key or explicit partition number, the messages are sent to random partitions. In some cases, it is possible to derive a key from the message itself. So it may be beneficial to enable KafkaLog4jAppender to accept KeyClass which will provide a key for a given message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2003) Add upgrade tests
[ https://issues.apache.org/jira/browse/KAFKA-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2003: - Component/s: system tests Add upgrade tests - Key: KAFKA-2003 URL: https://issues.apache.org/jira/browse/KAFKA-2003 Project: Kafka Issue Type: Improvement Components: system tests Reporter: Gwen Shapira Assignee: Ashish K Singh To test protocol changes, compatibility and upgrade process, we need a good way to test different versions of the product together and to test end-to-end upgrade process. For example, for 0.8.2 to 0.8.3 test we want to check: * Can we start a cluster with a mix of 0.8.2 and 0.8.3 brokers? * Can a cluster of 0.8.3 brokers bump the protocol level one broker at a time? * Can 0.8.2 clients run against a cluster of 0.8.3 brokers? There are probably more questions. But an automated framework that can test those and report results will be a good start. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KAFKA-2108) Node deleted all data and re-sync from replicas after attempted upgrade from 0.8.1.1 to 0.8.2.0
[ https://issues.apache.org/jira/browse/KAFKA-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede resolved KAFKA-2108. -- Resolution: Incomplete Closing due to inactivity. [~slimunholyone], [~becket_qin] feel free to reopen if you have new findings. Node deleted all data and re-sync from replicas after attempted upgrade from 0.8.1.1 to 0.8.2.0 --- Key: KAFKA-2108 URL: https://issues.apache.org/jira/browse/KAFKA-2108 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Thunder Stumpges Per [email thread|http://mail-archives.apache.org/mod_mbox/kafka-users/201504.mbox/%3CCAA%2BBczTUBqg1-tpcUjwfZgZYZyOXC-Myuhd_2EaGkeKWkrCVUQ%40mail.gmail.com%3E] in user group. We ran into an issue in an attempt to perform a rolling upgrade from 0.8.1.1 to 0.8.2.0 (we should have had 0.8.2.1 but got the wrong binaries accidentally). In shutting down the first node, it failed a clean controlled shutdown due to one corrupt topic. The message in server.log was: {noformat} [2015-03-31 10:21:46,250] INFO [Kafka Server 6], Remaining partitions to move: [__samza_checkpoint_ver_1_for_usersessions_1,0] (kafka.server.KafkaServer) [2015-03-31 10:21:46,250] INFO [Kafka Server 6], Error code from controller: 0 (kafka.server.KafkaServer) {noformat} And related message in state-change.log: {noformat} [2015-03-31 10:21:42,622] TRACE Controller 6 epoch 23 started leader election for partition [__samza_checkpoint_ver_1_for_usersessions_1,0] (state.change.logger) [2015-03-31 10:21:42,623] ERROR Controller 6 epoch 23 encountered error while electing leader for partition [__samza_checkpoint_ver_1_for_usersessions_1,0] due to: LeaderAndIsr information doesn't exist for partition [__samza_checkpoint_ver_1_for_usersessions_1,0] in OnlinePartition state. (state.change.logger) [2015-03-31 10:21:42,623] TRACE Controller 6 epoch 23 received response correlationId 2360 for a request sent to broker id:8,host:xxx,port:9092 (state.change.logger) [2015-03-31 10:21:42,623] ERROR Controller 6 epoch 23 initiated state change for partition [__samza_checkpoint_ver_1_for_usersessions_1,0] from OnlinePartition to OnlinePartition failed (state.change.logger) kafka.common.StateChangeFailedException: encountered error while electing leader for partition [__samza_checkpoint_ver_1_for_usersessions_1,0] due to: LeaderAndIsr information doesn't exist for partition [__samza_checkpoint_ver_1_for_usersessions_1,0] in OnlinePartition state. at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:360) at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:187) at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:125) at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:124) at scala.collection.immutable.Set$Set1.foreach(Set.scala:86) at kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:124) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply(KafkaController.scala:257) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply(KafkaController.scala:253) at scala.Option.foreach(Option.scala:197) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply$mcV$sp(KafkaController.scala:253) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply(KafkaController.scala:253) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply(KafkaController.scala:253) at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3.apply(KafkaController.scala:252) at kafka.controller.KafkaController$$anonfun$shutdownBroker$3.apply(KafkaController.scala:249) at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:130) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275) at kafka.controller.KafkaController.shutdownBroker(KafkaController.scala:249) at kafka.server.KafkaApis.handleControlledShutdownRequest(KafkaApis.scala:264) at kafka.server.KafkaApis.handle(KafkaApis.scala:192) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42) at java.lang.Thread.run(Thread.java:744) Caused
[jira] [Commented] (KAFKA-2101) Metric metadata-age is reset on a failed update
[ https://issues.apache.org/jira/browse/KAFKA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513318#comment-14513318 ] Neha Narkhede commented on KAFKA-2101: -- [~timbrooks] To clarify your question - we shouldn't update the metric on failed attempts. Metric metadata-age is reset on a failed update --- Key: KAFKA-2101 URL: https://issues.apache.org/jira/browse/KAFKA-2101 Project: Kafka Issue Type: Bug Reporter: Tim Brooks In org.apache.kafka.clients.Metadata there is a lastUpdate() method that returns the time the metadata was lasted updated. This is only called by metadata-age metric. However the lastRefreshMs is updated on a failed update (when MetadataResponse has not valid nodes). This is confusing since the metric's name suggests that it is a true reflection of the age of the current metadata. But the age might be reset by a failed update. Additionally, lastRefreshMs is not reset on a failed update due to no node being available. This seems slightly inconsistent, since one failure condition resets the metrics, but another one does not. Especially since both failure conditions do trigger the backoff (for the next attempt). I have not implemented a patch yet, because I am unsure what expected behavior is. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2100) Client Error doesn't preserve or display original server error code when it is an unknown code
[ https://issues.apache.org/jira/browse/KAFKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2100: - Component/s: clients Client Error doesn't preserve or display original server error code when it is an unknown code -- Key: KAFKA-2100 URL: https://issues.apache.org/jira/browse/KAFKA-2100 Project: Kafka Issue Type: Bug Components: clients Reporter: Gwen Shapira Assignee: Gwen Shapira Labels: newbie When the java client receives an unfamiliar error code, it translates it into UNKNOWN(-1, new UnknownServerException(The server experienced an unexpected error when processing the request)) This completely loses the original code, which makes troubleshooting from the client impossible. Will be better to preserve the original code and write it to the log when logging the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1621) Standardize --messages option in perf scripts
[ https://issues.apache.org/jira/browse/KAFKA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513324#comment-14513324 ] Neha Narkhede commented on KAFKA-1621: -- [~rekhajoshm] Sorry for the delay, reviewed your PR. However, could you send it for trunk instead of 0.8.2? Standardize --messages option in perf scripts - Key: KAFKA-1621 URL: https://issues.apache.org/jira/browse/KAFKA-1621 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Reporter: Jay Kreps Labels: newbie This option is specified in PerfConfig and is used by the producer, consumer and simple consumer perf commands. The docstring on the argument does not list it as required but the producer performance test requires it--others don't. We should standardize this so that either all the commands require the option and it is marked as required in the docstring or none of them list it as required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1904) run sanity failed test
[ https://issues.apache.org/jira/browse/KAFKA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1904: - Component/s: system tests run sanity failed test -- Key: KAFKA-1904 URL: https://issues.apache.org/jira/browse/KAFKA-1904 Project: Kafka Issue Type: Bug Components: system tests Reporter: Joe Stein Priority: Blocker Fix For: 0.8.3 Attachments: run_sanity.log.gz _test_case_name : testcase_1 _test_class_name : ReplicaBasicTest arg : bounce_broker : true arg : broker_type : leader arg : message_producing_free_time_sec : 15 arg : num_iteration : 2 arg : num_messages_to_produce_per_producer_call : 50 arg : num_partition : 2 arg : replica_factor : 3 arg : sleep_seconds_between_producer_calls : 1 validation_status : Test completed : FAILED -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2080) quick cleanup of producer performance scripts
[ https://issues.apache.org/jira/browse/KAFKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2080: - Labels: newbie (was: ) quick cleanup of producer performance scripts - Key: KAFKA-2080 URL: https://issues.apache.org/jira/browse/KAFKA-2080 Project: Kafka Issue Type: Bug Components: tools Reporter: Gwen Shapira Assignee: Gwen Shapira Labels: newbie We have two producer performance tools at the moment: one at o.a.k.client.tools and one at kafka.tools bin/kafka-producer-perf-test.sh is calling the kafka.tools one. org.apache.kafka.clients.tools.ProducerPerformance has --messages listed as optional (with default) while leaving the parameter out results in an error. Cleanup will include: * Removing the kafka.tools performance tool * Changing the shellscript to use new tool * Fix the misleading documentation for --messages * Adding both performance tools to the kafka docs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1858) Make ServerShutdownTest a bit less flaky
[ https://issues.apache.org/jira/browse/KAFKA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1858: - Reviewer: Jun Rao Status: Patch Available (was: Open) Make ServerShutdownTest a bit less flaky Key: KAFKA-1858 URL: https://issues.apache.org/jira/browse/KAFKA-1858 Project: Kafka Issue Type: Bug Reporter: Gwen Shapira Attachments: KAFKA-1858.patch ServerShutdownTest currently: * Starts a KafkaServer * Does stuff * Stops the server * Counts if there are any live kafka threads This is fine on its own. But when running in a test suite (i.e gradle test), the test is very very sensitive to any other test freeing all resources. If you start a server in a previous test and forgot to close it, the ServerShutdownTest will find threads from the previous test and fail. This makes for a flaky test that is pretty challenging to troubleshoot. I suggest counting the threads at the beginning and end of each test in the class, and only failing if the number at the end is greater than the number at the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2014) Chaos Monkey / Failure Inducer for Kafka
[ https://issues.apache.org/jira/browse/KAFKA-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2014: - Component/s: system tests Chaos Monkey / Failure Inducer for Kafka Key: KAFKA-2014 URL: https://issues.apache.org/jira/browse/KAFKA-2014 Project: Kafka Issue Type: Task Components: system tests Reporter: Mayuresh Gharat Assignee: Mayuresh Gharat Implement a Chaos Monkey for kafka, that will help us catch any shortcomings in the test environment before going to production. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2103) kafka.producer.AsyncProducerTest failure.
[ https://issues.apache.org/jira/browse/KAFKA-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513316#comment-14513316 ] Neha Narkhede commented on KAFKA-2103: -- [~becket_qin] Is this still an issue? If so, can we make this a sub task of the umbrella JIRA for unit test failures that [~guozhang] created. kafka.producer.AsyncProducerTest failure. - Key: KAFKA-2103 URL: https://issues.apache.org/jira/browse/KAFKA-2103 Project: Kafka Issue Type: Bug Reporter: Jiangjie Qin I saw this test consistently failing on trunk. The recent changes are KAFKA-2099, KAFKA-1926, KAFKA-1809. kafka.producer.AsyncProducerTest testNoBroker FAILED org.scalatest.junit.JUnitTestFailedError: Should fail with FailedToSendMessageException at org.scalatest.junit.AssertionsForJUnit$class.newAssertionFailedException(AssertionsForJUnit.scala:101) at org.scalatest.junit.JUnit3Suite.newAssertionFailedException(JUnit3Suite.scala:149) at org.scalatest.Assertions$class.fail(Assertions.scala:711) at org.scalatest.junit.JUnit3Suite.fail(JUnit3Suite.scala:149) at kafka.producer.AsyncProducerTest.testNoBroker(AsyncProducerTest.scala:300) kafka.producer.AsyncProducerTest testIncompatibleEncoder PASSED kafka.producer.AsyncProducerTest testRandomPartitioner PASSED kafka.producer.AsyncProducerTest testFailedSendRetryLogic FAILED kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries. at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:91) at kafka.producer.AsyncProducerTest.testFailedSendRetryLogic(AsyncProducerTest.scala:415) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2100) Client Error doesn't preserve or display original server error code when it is an unknown code
[ https://issues.apache.org/jira/browse/KAFKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2100: - Labels: newbie (was: ) Client Error doesn't preserve or display original server error code when it is an unknown code -- Key: KAFKA-2100 URL: https://issues.apache.org/jira/browse/KAFKA-2100 Project: Kafka Issue Type: Bug Components: clients Reporter: Gwen Shapira Assignee: Gwen Shapira Labels: newbie When the java client receives an unfamiliar error code, it translates it into UNKNOWN(-1, new UnknownServerException(The server experienced an unexpected error when processing the request)) This completely loses the original code, which makes troubleshooting from the client impossible. Will be better to preserve the original code and write it to the log when logging the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1758) corrupt recovery file prevents startup
[ https://issues.apache.org/jira/browse/KAFKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1758: - Reviewer: Neha Narkhede corrupt recovery file prevents startup -- Key: KAFKA-1758 URL: https://issues.apache.org/jira/browse/KAFKA-1758 Project: Kafka Issue Type: Bug Components: log Reporter: Jason Rosenberg Assignee: Manikumar Reddy Labels: newbie Fix For: 0.9.0 Attachments: KAFKA-1758.patch Hi, We recently had a kafka node go down suddenly. When it came back up, it apparently had a corrupt recovery file, and refused to startup: {code} 2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error starting up KafkaServer java.lang.NumberFormatException: For input string: ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76) at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106) at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at kafka.log.LogManager.loadLogs(LogManager.scala:105) at kafka.log.LogManager.init(LogManager.scala:57) at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275) at kafka.server.KafkaServer.startup(KafkaServer.scala:72) {code} And the app is under a monitor (so it was repeatedly restarting and failing with this error for several minutes before we got to it)… We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it then restarted cleanly (but of course re-synced all it’s data from replicas, so we had no data loss). Anyway, I’m wondering if that’s the expected behavior? Or should it not declare it corrupt and then proceed automatically to an unclean restart? Should this NumberFormatException be handled a bit more gracefully? We saved the corrupt file if it’s worth inspecting (although I doubt it will be useful!)…. The corrupt files appeared to be all zeroes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1758) corrupt recovery file prevents startup
[ https://issues.apache.org/jira/browse/KAFKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513186#comment-14513186 ] Neha Narkhede commented on KAFKA-1758: -- [~omkreddy] I took a quick look and left a few review comments. Should be able to merge once you fix those. corrupt recovery file prevents startup -- Key: KAFKA-1758 URL: https://issues.apache.org/jira/browse/KAFKA-1758 Project: Kafka Issue Type: Bug Components: log Reporter: Jason Rosenberg Assignee: Manikumar Reddy Labels: newbie Fix For: 0.9.0 Attachments: KAFKA-1758.patch Hi, We recently had a kafka node go down suddenly. When it came back up, it apparently had a corrupt recovery file, and refused to startup: {code} 2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error starting up KafkaServer java.lang.NumberFormatException: For input string: ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76) at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106) at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at kafka.log.LogManager.loadLogs(LogManager.scala:105) at kafka.log.LogManager.init(LogManager.scala:57) at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275) at kafka.server.KafkaServer.startup(KafkaServer.scala:72) {code} And the app is under a monitor (so it was repeatedly restarting and failing with this error for several minutes before we got to it)… We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it then restarted cleanly (but of course re-synced all it’s data from replicas, so we had no data loss). Anyway, I’m wondering if that’s the expected behavior? Or should it not declare it corrupt and then proceed automatically to an unclean restart? Should this NumberFormatException be handled a bit more gracefully? We saved the corrupt file if it’s worth inspecting (although I doubt it will be useful!)…. The corrupt files appeared to be all zeroes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2150) FetcherThread backoff need to grab lock before wait on condition.
[ https://issues.apache.org/jira/browse/KAFKA-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2150: - Reviewer: Guozhang Wang [~guozhang] This was introduced in KAFKA-1461, so assigning to you for review since you reviewed that one :) FetcherThread backoff need to grab lock before wait on condition. - Key: KAFKA-2150 URL: https://issues.apache.org/jira/browse/KAFKA-2150 Project: Kafka Issue Type: Bug Reporter: Jiangjie Qin Assignee: Sriharsha Chintalapani Labels: newbie++ Attachments: KAFKA-2150.patch, KAFKA-2150_2015-04-25_13:14:05.patch, KAFKA-2150_2015-04-25_13:18:35.patch, KAFKA-2150_2015-04-25_13:35:36.patch Saw the following error: kafka.api.ProducerBounceTest testBrokerFailure STANDARD_OUT [2015-04-25 00:40:43,997] ERROR [ReplicaFetcherThread-0-0], Error due to (kafka.server.ReplicaFetcherThread:103) java.lang.IllegalMonitorStateException at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:127) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1239) at java.util.concurrent.locks.AbstractQueuedSynchronizer.fullyRelease(AbstractQueuedSynchronizer.java:1668) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2107) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:95) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) [2015-04-25 00:40:47,064] ERROR [ReplicaFetcherThread-0-1], Error due to (kafka.server.ReplicaFetcherThread:103) java.lang.IllegalMonitorStateException at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:127) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1239) at java.util.concurrent.locks.AbstractQueuedSynchronizer.fullyRelease(AbstractQueuedSynchronizer.java:1668) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2107) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:95) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) We should grab the lock before waiting on the condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1367) Broker topic metadata not kept in sync with ZooKeeper
[ https://issues.apache.org/jira/browse/KAFKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513265#comment-14513265 ] Neha Narkhede commented on KAFKA-1367: -- [~jkreps] Yup, issue still exists and the solution I still recommend is to have the controller register watches and know the latest ISR for all partitions. This change isn't big if someone wants to take a stab. Broker topic metadata not kept in sync with ZooKeeper - Key: KAFKA-1367 URL: https://issues.apache.org/jira/browse/KAFKA-1367 Project: Kafka Issue Type: Bug Affects Versions: 0.8.0, 0.8.1 Reporter: Ryan Berdeen Labels: newbie++ Fix For: 0.8.3 Attachments: KAFKA-1367.txt When a broker is restarted, the topic metadata responses from the brokers will be incorrect (different from ZooKeeper) until a preferred replica leader election. In the metadata, it looks like leaders are correctly removed from the ISR when a broker disappears, but followers are not. Then, when a broker reappears, the ISR is never updated. I used a variation of the Vagrant setup created by Joe Stein to reproduce this with latest from the 0.8.1 branch: https://github.com/also/kafka/commit/dba36a503a5e22ea039df0f9852560b4fb1e067c -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2027) kafka never notifies the zookeeper client when a partition moved with due to an auto-rebalance (when auto.leader.rebalance.enable=true)
[ https://issues.apache.org/jira/browse/KAFKA-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513151#comment-14513151 ] Neha Narkhede commented on KAFKA-2027: -- Can you describe the end user behavior that you are looking for and the reason behind it? kafka never notifies the zookeeper client when a partition moved with due to an auto-rebalance (when auto.leader.rebalance.enable=true) --- Key: KAFKA-2027 URL: https://issues.apache.org/jira/browse/KAFKA-2027 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 0.8.1.1 Environment: Kafka 0.8.1.1, Node.js Mac OS Reporter: Sampath Reddy Lambu Assignee: Neha Narkhede Priority: Blocker I would like report an issue when auto.leader.rebalance.enable=true. Kafka never sends an event/notification to its zookeeper client after preferred election complete. This works fine with manual rebalance from CLI (kafka-preferred-replica-election.sh). Initially i thought this issue was with Kafka-Node, but its not. An event should be emitted from zookeeper if any partition moved while preferred election. Im working with kafka_2.9.2-0.8.1.1 (Broker's) Kafka-Node(Node.JS). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2105) NullPointerException in client on MetadataRequest
[ https://issues.apache.org/jira/browse/KAFKA-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2105: - Status: Patch Available (was: Open) NullPointerException in client on MetadataRequest - Key: KAFKA-2105 URL: https://issues.apache.org/jira/browse/KAFKA-2105 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 0.8.2.1 Reporter: Roger Hoover Priority: Minor Attachments: guard-from-null.patch With the new producer, if you accidentally pass null to KafkaProducer.partitionsFor(null), it will cause the IO thread to throw NPE. Uncaught error in kafka producer I/O thread: java.lang.NullPointerException at org.apache.kafka.common.utils.Utils.utf8Length(Utils.java:174) at org.apache.kafka.common.protocol.types.Type$5.sizeOf(Type.java:176) at org.apache.kafka.common.protocol.types.ArrayOf.sizeOf(ArrayOf.java:55) at org.apache.kafka.common.protocol.types.Schema.sizeOf(Schema.java:81) at org.apache.kafka.common.protocol.types.Struct.sizeOf(Struct.java:218) at org.apache.kafka.common.requests.RequestSend.serialize(RequestSend.java:35) at org.apache.kafka.common.requests.RequestSend.init(RequestSend.java:29) at org.apache.kafka.clients.NetworkClient.metadataRequest(NetworkClient.java:369) at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:391) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:188) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2029) Improving controlled shutdown for rolling updates
[ https://issues.apache.org/jira/browse/KAFKA-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513138#comment-14513138 ] Neha Narkhede commented on KAFKA-2029: -- [~jkreps] That's somewhat right. When the controller was simple and we didn't have the new non blocking NetworkClient, the single queue based blocking I/O wasn't a bad place to start. Now, it is time to refactor. Though I haven't given this a lot of thought, I'm not sure if a single-threaded design will work. It *might* be possible to end up with a single-threaded controller, though I can't be entirely sure. Basically, we have to think about situations when something has to happen in the callback and where that callback executes and if that callback execution should block sending other commands. [~dmitrybugaychenko] Thanks for looking into this. Some of the problems you listed above should be resolved with the unlimited controller to broker queue change. For other changes, I highly recommend we look at the new network client and propose a design based on that. I also think we should avoid patching the current controller and be really really careful about testing. We have found any change to the controller to introduce subtle bugs causing serious instability. Let's start with a design doc and agree on that. I can help review it. Improving controlled shutdown for rolling updates - Key: KAFKA-2029 URL: https://issues.apache.org/jira/browse/KAFKA-2029 Project: Kafka Issue Type: Improvement Components: controller Affects Versions: 0.8.1.1 Reporter: Dmitry Bugaychenko Assignee: Neha Narkhede Priority: Critical Attachments: KAFKA-2029.patch, KAFKA-2029.patch Controlled shutdown as implemented currently can cause numerous problems: deadlocks, local and global datalos, partitions without leader and etc. In some cases the only way to restore cluster is to stop it completelly using kill -9 and start again. Note 1: I'm aware of KAFKA-1305, but the proposed workaround to increase queue size makes things much worse (see discussion there). Note 2: The problems described here can occure in any setup, but they are extremly painful in setup with large brokers (36 disks, 1000+ assigned partitions per broker in our case). Note 3: These improvements are actually workarounds and it is worth to consider global refactoring of the controller (make it single thread, or even get rid of it in the favour of ZK leader elections for partitions). The problems and improvements are: # Controlled shutdown takes a long time (10+ minutes), broker sends multiple shutdown requests and finally considers it as failed and procedes to unclean shutdow, controller got stuck for a while (holding a lock waiting for free space in controller-to-broker queue). After broker starts back it receives followers request and erases highwatermarks (with a message that replica does not exists - controller hadn't yet sent a request with replica assignment), then controller starts replicas on the broker it deletes all local data (due to missing highwatermarks). Furthermore, controller starts processing pending shutdown request and stops replicas on the broker letting it in a non-functional state. Solution to the problem might be to increase time broker waits for controller reponse to shutdown request, but this timeout is taken from controller.socket.timeout.ms which is global for all broker-controller communication and setting it to 30 minutes is dangerous. *Proposed solution: introduce dedicated config parameter for this timeout with a high default*. # If a broker gets down during controlled shutdown and did not come back controller got stuck in a deadlock (one thread owns the lock and tries to add message to the dead broker's queue, send thread is a infinite loop trying to retry message to the dead broker, and the broker failure handler is waiting for a lock). There are numerous partitions without a leader and the only way out is to kill -9 the controller. *Proposed solution: add timeout for adding message to broker's queue*. ControllerChannelManager.sendRequest: {code} def sendRequest(brokerId : Int, request : RequestOrResponse, callback: (RequestOrResponse) = Unit = null) { brokerLock synchronized { val stateInfoOpt = brokerStateInfo.get(brokerId) stateInfoOpt match { case Some(stateInfo) = // ODKL Patch: prevent infinite hang on trying to send message to a dead broker. // TODO: Move timeout to config if (!stateInfo.messageQueue.offer((request, callback), 10, TimeUnit.SECONDS)) { error(Timed out trying to send message to broker + brokerId.toString) // Do not throw, as it
[jira] [Updated] (KAFKA-2029) Improving controlled shutdown for rolling updates
[ https://issues.apache.org/jira/browse/KAFKA-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2029: - Reviewer: Neha Narkhede Improving controlled shutdown for rolling updates - Key: KAFKA-2029 URL: https://issues.apache.org/jira/browse/KAFKA-2029 Project: Kafka Issue Type: Improvement Components: controller Affects Versions: 0.8.1.1 Reporter: Dmitry Bugaychenko Assignee: Neha Narkhede Priority: Critical Attachments: KAFKA-2029.patch, KAFKA-2029.patch Controlled shutdown as implemented currently can cause numerous problems: deadlocks, local and global datalos, partitions without leader and etc. In some cases the only way to restore cluster is to stop it completelly using kill -9 and start again. Note 1: I'm aware of KAFKA-1305, but the proposed workaround to increase queue size makes things much worse (see discussion there). Note 2: The problems described here can occure in any setup, but they are extremly painful in setup with large brokers (36 disks, 1000+ assigned partitions per broker in our case). Note 3: These improvements are actually workarounds and it is worth to consider global refactoring of the controller (make it single thread, or even get rid of it in the favour of ZK leader elections for partitions). The problems and improvements are: # Controlled shutdown takes a long time (10+ minutes), broker sends multiple shutdown requests and finally considers it as failed and procedes to unclean shutdow, controller got stuck for a while (holding a lock waiting for free space in controller-to-broker queue). After broker starts back it receives followers request and erases highwatermarks (with a message that replica does not exists - controller hadn't yet sent a request with replica assignment), then controller starts replicas on the broker it deletes all local data (due to missing highwatermarks). Furthermore, controller starts processing pending shutdown request and stops replicas on the broker letting it in a non-functional state. Solution to the problem might be to increase time broker waits for controller reponse to shutdown request, but this timeout is taken from controller.socket.timeout.ms which is global for all broker-controller communication and setting it to 30 minutes is dangerous. *Proposed solution: introduce dedicated config parameter for this timeout with a high default*. # If a broker gets down during controlled shutdown and did not come back controller got stuck in a deadlock (one thread owns the lock and tries to add message to the dead broker's queue, send thread is a infinite loop trying to retry message to the dead broker, and the broker failure handler is waiting for a lock). There are numerous partitions without a leader and the only way out is to kill -9 the controller. *Proposed solution: add timeout for adding message to broker's queue*. ControllerChannelManager.sendRequest: {code} def sendRequest(brokerId : Int, request : RequestOrResponse, callback: (RequestOrResponse) = Unit = null) { brokerLock synchronized { val stateInfoOpt = brokerStateInfo.get(brokerId) stateInfoOpt match { case Some(stateInfo) = // ODKL Patch: prevent infinite hang on trying to send message to a dead broker. // TODO: Move timeout to config if (!stateInfo.messageQueue.offer((request, callback), 10, TimeUnit.SECONDS)) { error(Timed out trying to send message to broker + brokerId.toString) // Do not throw, as it brings controller into completely non-functional state // Controller to broker state change requests batch is not empty while creating a new one //throw new IllegalStateException(Timed out trying to send message to broker + brokerId.toString) } case None = warn(Not sending request %s to broker %d, since it is offline..format(request, brokerId)) } } } {code} # When broker which is a controler starts shut down if auto leader rebalance is running it deadlocks in the end (shutdown thread owns the lock and waits for rebalance thread to exit and rebalance thread wait for lock). *Proposed solution: use bounded wait in rebalance thread*. KafkaController.scala: {code} // ODKL Patch to prevent deadlocks in shutdown. /** * Execute the given function inside the lock */ def inLockIfRunning[T](lock: ReentrantLock)(fun: = T): T = { if (isRunning || lock.isHeldByCurrentThread) { // TODO: Configure timeout. if (!lock.tryLock(10, TimeUnit.SECONDS)) { throw new IllegalStateException(Failed to acquire controller lock in 10 seconds.); } try {
[jira] [Commented] (KAFKA-2140) Improve code readability
[ https://issues.apache.org/jira/browse/KAFKA-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513202#comment-14513202 ] Neha Narkhede commented on KAFKA-2140: -- Fixed and pushed. Improve code readability Key: KAFKA-2140 URL: https://issues.apache.org/jira/browse/KAFKA-2140 Project: Kafka Issue Type: Improvement Reporter: Ismael Juma Assignee: Ismael Juma Priority: Minor Attachments: KAFKA-2140-fix.patch, KAFKA-2140.patch There are a number of places where code could be written in a more readable and idiomatic form. It's easier to explain with a patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1887) controller error message on shutting the last broker
[ https://issues.apache.org/jira/browse/KAFKA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513159#comment-14513159 ] Neha Narkhede commented on KAFKA-1887: -- [~sriharsha] The patch is waiting for a comment. If you can squeeze that in, I'll help you merge this change. controller error message on shutting the last broker Key: KAFKA-1887 URL: https://issues.apache.org/jira/browse/KAFKA-1887 Project: Kafka Issue Type: Bug Components: core Reporter: Jun Rao Assignee: Sriharsha Chintalapani Priority: Minor Fix For: 0.8.3 Attachments: KAFKA-1887.patch, KAFKA-1887_2015-02-21_01:12:25.patch We always see the following error in state-change log on shutting down the last broker. [2015-01-20 13:21:04,036] ERROR Controller 0 epoch 3 initiated state change for partition [test,0] from OfflinePartition to OnlinePartition failed (state.change.logger) kafka.common.NoReplicaOnlineException: No replica for partition [test,0] is alive. Live brokers are: [Set()], Assigned replicas are: [List(0)] at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:75) at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:357) at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:206) at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:120) at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:117) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:117) at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:446) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) at kafka.utils.Utils$.inLock(Utils.scala:535) at kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356) at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1940) Initial checkout and build failing
[ https://issues.apache.org/jira/browse/KAFKA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1940: - Status: Patch Available (was: Open) Initial checkout and build failing -- Key: KAFKA-1940 URL: https://issues.apache.org/jira/browse/KAFKA-1940 Project: Kafka Issue Type: Bug Components: build Affects Versions: 0.8.1.2 Environment: Groovy: 1.8.6 Ant: Apache Ant(TM) version 1.9.2 compiled on July 8 2013 Ivy: 2.2.0 JVM: 1.8.0_25 (Oracle Corporation 25.25-b02) OS: Windows 7 6.1 amd64 Reporter: Martin Lemanski Labels: build Attachments: zinc-upgrade.patch when performing `gradle wrapper` and `gradlew build` as a new developer, I get an exception: {code} C:\development\git\kafkagradlew build --stacktrace ... FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':core:compileScala'. com.typesafe.zinc.Setup.create(Lcom/typesafe/zinc/ScalaLocation;Lcom/typesafe/zinc/SbtJars;Ljava/io/File;)Lcom/typesaf e/zinc/Setup; {code} Details: https://gist.github.com/mlem/ddff83cc8a25b040c157 Current Commit: {code} C:\development\git\kafkagit rev-parse --verify HEAD 71602de0bbf7727f498a812033027f6cbfe34eb8 {code} I am evaluating kafka for my company and wanted to run some tests with it, but couldn't due to this error. I know gradle can be tricky and it's not easy to setup everything correct, but this kind of bugs turns possible commiters/users off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-873) Consider replacing zkclient with curator (with zkclient-bridge)
[ https://issues.apache.org/jira/browse/KAFKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503169#comment-14503169 ] Neha Narkhede commented on KAFKA-873: - I'm not so sure yet that moving to Curator is a good idea, at least not until we do a full analysis of the current zkclient problems and how Curator fixes those. Agreed that zkclient is not very well supported, but anytime we have found a serious bug, they have accepted the patch and released it. But my understanding is that the upside of Curator is that it includes a set of recipes for common operations that people use ZooKeeper for. Let me elaborate on what I think is the problem with zkclient. It wraps the zookeeper client APIs with the purpose of making it easy to perform common ZooKeeper operations. However, this limits the user to the behavior dictated by the wrapper, irrespective of how the underlying zookeeper library behaves. An example of this is the indefinite retries during a ZooKeeper disconnect. You may not want to retry indefinitely and might want to quit the operation after a timeout. Then there are various bugs introduced due to the zkclient wrapper design. For example, we have seen weird bugs due to the fact that zkclient saves the list of triggered watches in an internal queue and invokes the configured user callback in a background thread. The problems we've seen with zkclient will not be fixed with another wrapper (Curator). It looks like it will be better for us to just write a simple zookeeper client utility inside Kafka itself. If you look at zkclient, it is a pretty simple wrapper over the zookeeper client APIs. So this may not be a huge undertaking and will be a better long-term solution Consider replacing zkclient with curator (with zkclient-bridge) --- Key: KAFKA-873 URL: https://issues.apache.org/jira/browse/KAFKA-873 Project: Kafka Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Scott Clasen Assignee: Grant Henke If zkclient was replaced with curator and curator-x-zkclient-bridge it would be initially a drop-in replacement https://github.com/Netflix/curator/wiki/ZKClient-Bridge With the addition of a few more props to ZkConfig, and a bit of code this would open up the possibility of using ACLs in zookeeper (which arent supported directly by zkclient), as well as integrating with netflix exhibitor for those of us using that. Looks like KafkaZookeeperClient needs some love anyhow... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2057) DelayedOperationTest.testRequestExpiry transient failure
[ https://issues.apache.org/jira/browse/KAFKA-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2057: - Resolution: Fixed Reviewer: Neha Narkhede Status: Resolved (was: Patch Available) Thanks for the patch. Pushed to trunk. DelayedOperationTest.testRequestExpiry transient failure Key: KAFKA-2057 URL: https://issues.apache.org/jira/browse/KAFKA-2057 Project: Kafka Issue Type: Sub-task Reporter: Guozhang Wang Assignee: Rajini Sivaram Labels: newbie Attachments: KAFKA-2057.patch {code} kafka.server.DelayedOperationTest testRequestExpiry FAILED junit.framework.AssertionFailedError: Time for expiration 19 should at least 20 at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.assertTrue(Assert.java:20) at kafka.server.DelayedOperationTest.testRequestExpiry(DelayedOperationTest.scala:68) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2087) TopicConfigManager javadoc references incorrect paths
[ https://issues.apache.org/jira/browse/KAFKA-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2087: - Reviewer: Neha Narkhede TopicConfigManager javadoc references incorrect paths - Key: KAFKA-2087 URL: https://issues.apache.org/jira/browse/KAFKA-2087 Project: Kafka Issue Type: Bug Reporter: Aditya Auradkar Assignee: Aditya Auradkar Priority: Trivial Attachments: KAFKA-2087.patch The TopicConfigManager docs refer to znodes in /brokers/topics/topic_name/config which is incorrect. Fix javadoc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2087) TopicConfigManager javadoc references incorrect paths
[ https://issues.apache.org/jira/browse/KAFKA-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2087: - Resolution: Fixed Status: Resolved (was: Patch Available) Pushed to trunk. Thanks! TopicConfigManager javadoc references incorrect paths - Key: KAFKA-2087 URL: https://issues.apache.org/jira/browse/KAFKA-2087 Project: Kafka Issue Type: Bug Reporter: Aditya Auradkar Assignee: Aditya Auradkar Priority: Trivial Attachments: KAFKA-2087.patch The TopicConfigManager docs refer to znodes in /brokers/topics/topic_name/config which is incorrect. Fix javadoc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1961) Looks like its possible to delete _consumer_offsets topic
[ https://issues.apache.org/jira/browse/KAFKA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394856#comment-14394856 ] Neha Narkhede commented on KAFKA-1961: -- +1. Checking it in... Looks like its possible to delete _consumer_offsets topic - Key: KAFKA-1961 URL: https://issues.apache.org/jira/browse/KAFKA-1961 Project: Kafka Issue Type: Bug Affects Versions: 0.8.2.0 Reporter: Gwen Shapira Assignee: Gwen Shapira Labels: newbie Attachments: KAFKA-1961-6.patch, KAFKA-1961.3.patch, KAFKA-1961.4.patch Noticed that kafka-topics.sh --delete can successfully delete internal topics (__consumer_offsets). I'm pretty sure we want to prevent that, to avoid users shooting themselves in the foot. Topic admin command should check for internal topics, just like ReplicaManager does and not let users delete them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1961) Looks like its possible to delete _consumer_offsets topic
[ https://issues.apache.org/jira/browse/KAFKA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1961: - Resolution: Fixed Assignee: Ted Malaska (was: Gwen Shapira) Status: Resolved (was: Patch Available) Thanks for the patch, Ted! Pushed to trunk. Looks like its possible to delete _consumer_offsets topic - Key: KAFKA-1961 URL: https://issues.apache.org/jira/browse/KAFKA-1961 Project: Kafka Issue Type: Bug Affects Versions: 0.8.2.0 Reporter: Gwen Shapira Assignee: Ted Malaska Labels: newbie Attachments: KAFKA-1961-6.patch, KAFKA-1961.3.patch, KAFKA-1961.4.patch Noticed that kafka-topics.sh --delete can successfully delete internal topics (__consumer_offsets). I'm pretty sure we want to prevent that, to avoid users shooting themselves in the foot. Topic admin command should check for internal topics, just like ReplicaManager does and not let users delete them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KAFKA-1293) Mirror maker housecleaning
[ https://issues.apache.org/jira/browse/KAFKA-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386039#comment-14386039 ] Neha Narkhede edited comment on KAFKA-1293 at 3/30/15 12:15 AM: cc [~junrao] who can help with the access to the wiki. was (Author: nehanarkhede): cc [~junrao] Mirror maker housecleaning -- Key: KAFKA-1293 URL: https://issues.apache.org/jira/browse/KAFKA-1293 Project: Kafka Issue Type: Bug Components: tools Affects Versions: 0.8.1 Reporter: Jay Kreps Priority: Minor Labels: usability Attachments: KAFKA-1293.patch Mirror maker uses it's own convention for command-line arguments, e.g. --num.producers, where everywhere else follows the unix convention like --num-producers. This is annoying because when running different tools you have to constantly remember whatever quirks of the person who wrote that tool. Mirror maker should also have a top-level wrapper script in bin/ to make tab completion work and so you don't have to remember the fully qualified class name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1293) Mirror maker housecleaning
[ https://issues.apache.org/jira/browse/KAFKA-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386039#comment-14386039 ] Neha Narkhede commented on KAFKA-1293: -- cc [~junrao] Mirror maker housecleaning -- Key: KAFKA-1293 URL: https://issues.apache.org/jira/browse/KAFKA-1293 Project: Kafka Issue Type: Bug Components: tools Affects Versions: 0.8.1 Reporter: Jay Kreps Priority: Minor Labels: usability Attachments: KAFKA-1293.patch Mirror maker uses it's own convention for command-line arguments, e.g. --num.producers, where everywhere else follows the unix convention like --num-producers. This is annoying because when running different tools you have to constantly remember whatever quirks of the person who wrote that tool. Mirror maker should also have a top-level wrapper script in bin/ to make tab completion work and so you don't have to remember the fully qualified class name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1994) Evaluate performance effect of chroot check on Topic creation
[ https://issues.apache.org/jira/browse/KAFKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1994: - Reviewer: Jun Rao Evaluate performance effect of chroot check on Topic creation - Key: KAFKA-1994 URL: https://issues.apache.org/jira/browse/KAFKA-1994 Project: Kafka Issue Type: Improvement Reporter: Ashish K Singh Assignee: Ashish K Singh Attachments: KAFKA-1994.patch, KAFKA-1994_2015-03-03_18:19:45.patch KAFKA-1664 adds check for chroot while creating a node in ZK. ZkPath checks if namespace exists before trying to create a path in ZK. This raises a concern that checking namespace for each path creation might be unnecessary and can potentially make creations expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1215) Rack-Aware replica assignment option
[ https://issues.apache.org/jira/browse/KAFKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386037#comment-14386037 ] Neha Narkhede commented on KAFKA-1215: -- [~allenxwang] This was inactive for a while, but I think it will be good to wait until KAFKA-1792 is done to propose a solution for rack-awareness. Rack-Aware replica assignment option Key: KAFKA-1215 URL: https://issues.apache.org/jira/browse/KAFKA-1215 Project: Kafka Issue Type: Improvement Components: replication Affects Versions: 0.8.0 Reporter: Joris Van Remoortere Assignee: Jun Rao Fix For: 0.9.0 Attachments: rack_aware_replica_assignment_v1.patch, rack_aware_replica_assignment_v2.patch Adding a rack-id to kafka config. This rack-id can be used during replica assignment by using the max-rack-replication argument in the admin scripts (create topic, etc.). By default the original replication assignment algorithm is used because max-rack-replication defaults to -1. max-rack-replication -1 is not honored if you are doing manual replica assignment (preffered). If this looks good I can add some test cases specific to the rack-aware assignment. I can also port this to trunk. We are currently running 0.8.0 in production and need this, so i wrote the patch against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2046) Delete topic still doesn't work
[ https://issues.apache.org/jira/browse/KAFKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385996#comment-14385996 ] Neha Narkhede commented on KAFKA-2046: -- [~onurkaraman] Shouldn't controller.message.queue.size be infinite? It seems that the moment there is some backlog of state changes on the controller, it will deadlock causing bad things to happen. Delete topic still doesn't work --- Key: KAFKA-2046 URL: https://issues.apache.org/jira/browse/KAFKA-2046 Project: Kafka Issue Type: Bug Reporter: Clark Haskins Assignee: Onur Karaman I just attempted to delete at 128 partition topic with all inbound producers stopped. The result was as follows: The /admin/delete_topics znode was empty the topic under /brokers/topics was removed The Kafka topics command showed that the topic was removed However, the data on disk on each broker was not deleted and the topic has not yet been re-created by starting up the inbound mirror maker. Let me know what additional information is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-527) Compression support does numerous byte copies
[ https://issues.apache.org/jira/browse/KAFKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385999#comment-14385999 ] Neha Narkhede commented on KAFKA-527: - [~guozhang] +1. Will be great to re-run the test after your patch. Compression support does numerous byte copies - Key: KAFKA-527 URL: https://issues.apache.org/jira/browse/KAFKA-527 Project: Kafka Issue Type: Bug Components: compression Reporter: Jay Kreps Assignee: Yasuhiro Matsuda Priority: Critical Attachments: KAFKA-527.message-copy.history, KAFKA-527.patch, KAFKA-527_2015-03-16_15:19:29.patch, KAFKA-527_2015-03-19_21:32:24.patch, KAFKA-527_2015-03-25_12:08:00.patch, KAFKA-527_2015-03-25_13:26:36.patch, java.hprof.no-compression.txt, java.hprof.snappy.text The data path for compressing or decompressing messages is extremely inefficient. We do something like 7 (?) complete copies of the data, often for simple things like adding a 4 byte size to the front. I am not sure how this went by unnoticed. This is likely the root cause of the performance issues we saw in doing bulk recompression of data in mirror maker. The mismatch between the InputStream and OutputStream interfaces and the Message/MessageSet interfaces which are based on byte buffers is the cause of many of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2046) Delete topic still doesn't work
[ https://issues.apache.org/jira/browse/KAFKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386021#comment-14386021 ] Neha Narkhede commented on KAFKA-2046: -- [~onurkaraman] Could [~clarkhaskins] reproduce the delete topic issue once you changed the controller.message.queue.size to its default size (infinite)? Delete topic still doesn't work --- Key: KAFKA-2046 URL: https://issues.apache.org/jira/browse/KAFKA-2046 Project: Kafka Issue Type: Bug Reporter: Clark Haskins Assignee: Onur Karaman I just attempted to delete at 128 partition topic with all inbound producers stopped. The result was as follows: The /admin/delete_topics znode was empty the topic under /brokers/topics was removed The Kafka topics command showed that the topic was removed However, the data on disk on each broker was not deleted and the topic has not yet been re-created by starting up the inbound mirror maker. Let me know what additional information is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1983) TestEndToEndLatency can be unreliable after hard kill
[ https://issues.apache.org/jira/browse/KAFKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386028#comment-14386028 ] Neha Narkhede commented on KAFKA-1983: -- cc [~junrao] TestEndToEndLatency can be unreliable after hard kill - Key: KAFKA-1983 URL: https://issues.apache.org/jira/browse/KAFKA-1983 Project: Kafka Issue Type: Improvement Reporter: Jun Rao Assignee: Grayson Chao Labels: newbie If you hard kill TestEndToEndLatency, the committed offset remains the last checkpointed one. However, more messages are now appended after the last checkpointed offset. When restarting TestEndToEndLatency, the consumer resumes from the last checkpointed offset and will report really low latency since it doesn't need to wait for a new message to be produced to read the next message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2034) sourceCompatibility not set in Kafka build.gradle
[ https://issues.apache.org/jira/browse/KAFKA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2034: - Labels: newbie (was: ) sourceCompatibility not set in Kafka build.gradle - Key: KAFKA-2034 URL: https://issues.apache.org/jira/browse/KAFKA-2034 Project: Kafka Issue Type: Bug Components: build Affects Versions: 0.8.2.1 Reporter: Derek Bassett Priority: Minor Labels: newbie Original Estimate: 4h Remaining Estimate: 4h The build.gradle does not explicitly set the sourceCompatibility version in build.gradle. This allows kafka when built by Java 1.8 to incorrectly set the wrong version of the class files. This also would allow Java 1.8 features to be merged into Kafka. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-2034) sourceCompatibility not set in Kafka build.gradle
[ https://issues.apache.org/jira/browse/KAFKA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-2034: - Component/s: build sourceCompatibility not set in Kafka build.gradle - Key: KAFKA-2034 URL: https://issues.apache.org/jira/browse/KAFKA-2034 Project: Kafka Issue Type: Bug Components: build Affects Versions: 0.8.2.1 Reporter: Derek Bassett Priority: Minor Labels: newbie Original Estimate: 4h Remaining Estimate: 4h The build.gradle does not explicitly set the sourceCompatibility version in build.gradle. This allows kafka when built by Java 1.8 to incorrectly set the wrong version of the class files. This also would allow Java 1.8 features to be merged into Kafka. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2046) Delete topic still doesn't work
[ https://issues.apache.org/jira/browse/KAFKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378783#comment-14378783 ] Neha Narkhede commented on KAFKA-2046: -- [~clarkhaskins] As discussed previously, the minimum amount of information that is needed to troubleshoot any issue, not just delete topic is- 1. Controller logs (possibly at TRACE) 2. Server logs 3 State change log (DEBUG works) Delete topic still doesn't work --- Key: KAFKA-2046 URL: https://issues.apache.org/jira/browse/KAFKA-2046 Project: Kafka Issue Type: Bug Reporter: Clark Haskins Assignee: Sriharsha Chintalapani I just attempted to delete at 128 partition topic with all inbound producers stopped. The result was as follows: The /admin/delete_topics znode was empty the topic under /brokers/topics was removed The Kafka topics command showed that the topic was removed However, the data on disk on each broker was not deleted and the topic has not yet been re-created by starting up the inbound mirror maker. Let me know what additional information is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1928) Move kafka.network over to using the network classes in org.apache.kafka.common.network
[ https://issues.apache.org/jira/browse/KAFKA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1928: - Issue Type: Sub-task (was: Improvement) Parent: KAFKA-1682 Move kafka.network over to using the network classes in org.apache.kafka.common.network --- Key: KAFKA-1928 URL: https://issues.apache.org/jira/browse/KAFKA-1928 Project: Kafka Issue Type: Sub-task Components: security Reporter: Jay Kreps Assignee: Gwen Shapira As part of the common package we introduced a bunch of network related code and abstractions. We should look into replacing a lot of what is in kafka.network with this code. Duplicate classes include things like Receive, Send, etc. It is likely possible to also refactor the SocketServer to make use of Selector which should significantly simplify it's code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-1928) Move kafka.network over to using the network classes in org.apache.kafka.common.network
[ https://issues.apache.org/jira/browse/KAFKA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-1928: - Component/s: security Move kafka.network over to using the network classes in org.apache.kafka.common.network --- Key: KAFKA-1928 URL: https://issues.apache.org/jira/browse/KAFKA-1928 Project: Kafka Issue Type: Improvement Components: security Reporter: Jay Kreps Assignee: Gwen Shapira As part of the common package we introduced a bunch of network related code and abstractions. We should look into replacing a lot of what is in kafka.network with this code. Duplicate classes include things like Receive, Send, etc. It is likely possible to also refactor the SocketServer to make use of Selector which should significantly simplify it's code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1546) Automate replica lag tuning
[ https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362581#comment-14362581 ] Neha Narkhede commented on KAFKA-1546: -- [~aauradkar] Thanks for the patch. Overall, the changes look correct. I left a few review comments. And thanks for sharing the test results. Look forward to the updated patch. Automate replica lag tuning --- Key: KAFKA-1546 URL: https://issues.apache.org/jira/browse/KAFKA-1546 Project: Kafka Issue Type: Improvement Components: replication Affects Versions: 0.8.0, 0.8.1, 0.8.1.1 Reporter: Neha Narkhede Assignee: Aditya Auradkar Labels: newbie++ Fix For: 0.8.3 Attachments: KAFKA-1546.patch, KAFKA-1546_2015-03-11_18:48:09.patch, KAFKA-1546_2015-03-12_13:42:01.patch Currently, there is no good way to tune the replica lag configs to automatically account for high and low volume topics on the same cluster. For the low-volume topic it will take a very long time to detect a lagging replica, and for the high-volume topic it will have false-positives. One approach to making this easier would be to have the configuration be something like replica.lag.max.ms and translate this into a number of messages dynamically based on the throughput of the partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)