[jira] [Comment Edited] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
[ https://issues.apache.org/jira/browse/KAFKA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387028#comment-17387028 ] David Mao edited comment on KAFKA-13135 at 7/26/21, 4:18 AM: - Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. -If we can maintain correctness without passing the groupLock to DelayedProduce, we can skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there.- https://issues.apache.org/jira/browse/KAFKA-6042 looks relevant before changing any of the locking semantics here. It looks like the DelayedProduce lock was originally added to avoid deadlock. was (Author: david.mao): Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. If we can maintain correctness without passing the groupLock to DelayedProduce, we can skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there. https://issues.apache.org/jira/browse/KAFKA-6042 may be relevant before changing any of the locking semantics here. > Reduce GroupMetadata lock contention for offset commit requests > --- > > Key: KAFKA-13135 > URL: https://issues.apache.org/jira/browse/KAFKA-13135 > Project: Kafka > Issue Type: Improvement >Reporter: David Mao >Priority: Major > > as suggested by [~lbradstreet], we can look for similar optimizations to > https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. > It looks like there are some straightforward optimizations possible for the > error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
[ https://issues.apache.org/jira/browse/KAFKA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387032#comment-17387032 ] Ismael Juma commented on KAFKA-13135: - The locking for this code is extremely complex and it's been a source of a lot of bugs, so we should be very careful. > Reduce GroupMetadata lock contention for offset commit requests > --- > > Key: KAFKA-13135 > URL: https://issues.apache.org/jira/browse/KAFKA-13135 > Project: Kafka > Issue Type: Improvement >Reporter: David Mao >Priority: Major > > as suggested by [~lbradstreet], we can look for similar optimizations to > https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. > It looks like there are some straightforward optimizations possible for the > error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
[ https://issues.apache.org/jira/browse/KAFKA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387028#comment-17387028 ] David Mao edited comment on KAFKA-13135 at 7/26/21, 3:48 AM: - Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. If we can maintain correctness without passing the groupLock to DelayedProduce, we can skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there. https://issues.apache.org/jira/browse/KAFKA-6042 may be relevant before changing any of the locking semantics here. was (Author: david.mao): Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. If we can maintain correctness without passing the groupLock to DelayedProduce, we can skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there. > Reduce GroupMetadata lock contention for offset commit requests > --- > > Key: KAFKA-13135 > URL: https://issues.apache.org/jira/browse/KAFKA-13135 > Project: Kafka > Issue Type: Improvement >Reporter: David Mao >Priority: Major > > as suggested by [~lbradstreet], we can look for similar optimizations to > https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. > It looks like there are some straightforward optimizations possible for the > error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
[ https://issues.apache.org/jira/browse/KAFKA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387028#comment-17387028 ] David Mao edited comment on KAFKA-13135 at 7/26/21, 3:43 AM: - Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. If we can maintain correctness without passing the groupLock to DelayedProduce, we can skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there. was (Author: david.mao): Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. If we can maintain correctness without passing the groupLock to DelayedProduce, we can achieve finer grained locking, and skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there. > Reduce GroupMetadata lock contention for offset commit requests > --- > > Key: KAFKA-13135 > URL: https://issues.apache.org/jira/browse/KAFKA-13135 > Project: Kafka > Issue Type: Improvement >Reporter: David Mao >Priority: Major > > as suggested by [~lbradstreet], we can look for similar optimizations to > https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. > It looks like there are some straightforward optimizations possible for the > error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
[ https://issues.apache.org/jira/browse/KAFKA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387028#comment-17387028 ] David Mao edited comment on KAFKA-13135 at 7/26/21, 3:42 AM: - Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state. If we can maintain correctness without passing the groupLock to DelayedProduce, we can achieve finer grained locking, and skip locking the group when sending back the offset response. I need to take a look at storeGroup to see if the groupLock is necessary there. was (Author: david.mao): Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state, so we may be able to avoid passing in the groupLock, and achieve some finer grained locking. I need to take a look at storeGroup to see if the groupLock is necessary there. > Reduce GroupMetadata lock contention for offset commit requests > --- > > Key: KAFKA-13135 > URL: https://issues.apache.org/jira/browse/KAFKA-13135 > Project: Kafka > Issue Type: Improvement >Reporter: David Mao >Priority: Major > > as suggested by [~lbradstreet], we can look for similar optimizations to > https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. > It looks like there are some straightforward optimizations possible for the > error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
[ https://issues.apache.org/jira/browse/KAFKA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387028#comment-17387028 ] David Mao commented on KAFKA-13135: --- Taking a closer look, I think we can also optimize this for the happy path. appendForGroup passes in the groupLock which gets locked during the entire putCacheCallback when completing the DelayedProduce from appending offset messages. We already lock the groupLock inside of the callback when reading group state, so we may be able to avoid passing in the groupLock, and achieve some finer grained locking. I need to take a look at storeGroup to see if the groupLock is necessary there. > Reduce GroupMetadata lock contention for offset commit requests > --- > > Key: KAFKA-13135 > URL: https://issues.apache.org/jira/browse/KAFKA-13135 > Project: Kafka > Issue Type: Improvement >Reporter: David Mao >Priority: Major > > as suggested by [~lbradstreet], we can look for similar optimizations to > https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. > It looks like there are some straightforward optimizations possible for the > error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] chia7712 commented on a change in pull request #11128: MINOR: remove partition-level error from MetadataResponse#errorCounts
chia7712 commented on a change in pull request #11128: URL: https://github.com/apache/kafka/pull/11128#discussion_r676262149 ## File path: clients/src/main/java/org/apache/kafka/common/requests/MetadataResponse.java ## @@ -100,10 +100,7 @@ public int throttleTimeMs() { @Override public Map errorCounts() { Map errorCounts = new HashMap<>(); -data.topics().forEach(metadata -> { -metadata.partitions().forEach(p -> updateErrorCounts(errorCounts, Errors.forCode(p.errorCode(; Review comment: According to https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/metadata/ZkMetadataCache.scala#L103, it is possible that partition-level has error but topic-level gets none. Hence, it produces inaccurate count of "error" if we remove this line. If we want to make correct count of `error` and `none`, we have to add following check. 1. if topic has error, we don't need to loop partitions. 2. if topic has no error, we have to loop partitions. @ijuma WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (KAFKA-13132) Upgrading to topic IDs in LISR requests has gaps introduced in 3.0
[ https://issues.apache.org/jira/browse/KAFKA-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-13132: --- Description: With the change in 3.0 to how topic IDs are assigned to logs, a bug was inadvertently introduced. Now, topic IDs will only be assigned on the load of the log to a partition in LISR requests. This means we will only assign topic IDs for newly created topics/partitions, on broker startup, or potentially when a partition is reassigned. In the case of upgrading from an IBP before 2.8, we may have a scenario where we upgrade the controller to IBP 3.0 (or even 2.8) last. (Ie, the controller is IBP < 2.8 and all other brokers are on the newest IBP) Upon the last broker upgrading, we will elect a new controller but its LISR request will not result in topic IDs being assigned to logs of existing topics. They will only be assigned in the cases mentioned above. *Keep in mind, in this scenario, topic IDs will be still be assigned in the controller/ZK to all new and pre-existing topics and will show up in metadata.* This means we are not ensured the same guarantees we had in 2.8. *It is just the LISR/partition.metadata part of the code that is affected.* The problem is two-fold 1. We ignore LISR requests when the partition leader epoch has not increased (previously we assigned the ID before this check) 2. We only assign the topic ID when we are associating the log with the partition in replicamanager for the first time. Though in the scenario described above, we have logs associated with partitions that need to be upgraded. We should check the if the LISR request is resulting in a topic ID addition and add logic to logs already associated to partitions in replica manager. was: With the change in 3.0 to how topic IDs are assigned to logs, a bug was inadvertently introduced. Now, topic IDs will only be assigned on the load of the log to a partition in LISR requests. This means we will only assign topic IDs for newly created topics/partitions, on broker startup, or potentially when a partition is reassigned. In the case of upgrading from an IBP before 2.8, we may have a scenario where we upgrade the controller to IBP 3.0 last. (Ie, the controller is IBP < 2.8 and all other brokers are on IBP 3.0) Upon the last broker upgrading, we will elect a new controller but its LISR request will not result in topic IDs being assigned to logs. They will only be assigned in the cases mentioned above. Keep in mind, in this scenario, topic IDs will be still be newly assigned to all pre-existing topics and will show up in metadata. This means we are not ensured the same guarantees we had in 2.8. *It is just the LISR/partition.metadata part of the code that is affected. The controller and ZooKeeper will still correctly assign topic IDs to new topics upon upgrade. We will also see this reflected in metadata responses.* The problem is two-fold 1. We ignore LISR requests when the partition leader epoch has not increased (previously we assigned the ID before this check) 2. We only assign the topic ID when we are associating the log with the partition in replicamanager for the first time. Though in the scenario described above, we have logs associated with partitions that need to be upgraded. We should check the if the LISR request is resulting in a topic ID addition and add logic to logs already associated to partitions in replica manager. > Upgrading to topic IDs in LISR requests has gaps introduced in 3.0 > -- > > Key: KAFKA-13132 > URL: https://issues.apache.org/jira/browse/KAFKA-13132 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > With the change in 3.0 to how topic IDs are assigned to logs, a bug was > inadvertently introduced. Now, topic IDs will only be assigned on the load of > the log to a partition in LISR requests. This means we will only assign topic > IDs for newly created topics/partitions, on broker startup, or potentially > when a partition is reassigned. > > In the case of upgrading from an IBP before 2.8, we may have a scenario where > we upgrade the controller to IBP 3.0 (or even 2.8) last. (Ie, the controller > is IBP < 2.8 and all other brokers are on the newest IBP) Upon the last > broker upgrading, we will elect a new controller but its LISR request will > not result in topic IDs being assigned to logs of existing topics. They will > only be assigned in the cases mentioned above. > *Keep in mind, in this scenario, topic IDs will be still be assigned in the > controller/ZK to all new and pre-existing topics and will show up in > metadata.* This means we are not ensured the same
[GitHub] [kafka] chia7712 commented on pull request #11128: MINOR: remove partition-level error from MetadataResponse#errorCounts
chia7712 commented on pull request #11128: URL: https://github.com/apache/kafka/pull/11128#issuecomment-886328806 > Negative performance impact this is a good reason. > The count for NONE would be wrong As the "scoped error" can be various between responses, we need to count NONE case by case. thanks for response. will update code later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] ijuma commented on pull request #11128: MINOR: remove partition-level error from MetadataResponse#errorCounts
ijuma commented on pull request #11128: URL: https://github.com/apache/kafka/pull/11128#issuecomment-886323968 There are two problems: 1. The count for NONE would be wrong 2. Negative performance impact -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] chia7712 commented on pull request #11128: MINOR: remove partition-level error from MetadataResponse#errorCounts
chia7712 commented on pull request #11128: URL: https://github.com/apache/kafka/pull/11128#issuecomment-886321557 > Are there other similar cases in the original PR or is this the only one? `StopReplicResponse` (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/requests/StopReplicaResponse.java#L55) has similar issue. will take a look at others. For another, I'm thinking about better code style. As server-side does not add partition error to response when top-level gets error, `errorCounts` should be fine to collect all error code from all elements. It seems to me the only challenge is that the count of `NONE` may get higher. Could you share the regression you mentioned to me if I overlook the real problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] ijuma commented on pull request #11128: MINOR: remove partition-level error from MetadataResponse#errorCounts
ijuma commented on pull request #11128: URL: https://github.com/apache/kafka/pull/11128#issuecomment-886314473 Thanks for the PR. Are there other similar cases in the original PR or is this the only one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (KAFKA-13135) Reduce GroupMetadata lock contention for offset commit requests
David Mao created KAFKA-13135: - Summary: Reduce GroupMetadata lock contention for offset commit requests Key: KAFKA-13135 URL: https://issues.apache.org/jira/browse/KAFKA-13135 Project: Kafka Issue Type: Improvement Reporter: David Mao as suggested by [~lbradstreet], we can look for similar optimizations to https://issues.apache.org/jira/browse/KAFKA-13134 in the offset commit path. It looks like there are some straightforward optimizations possible for the error path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] chia7712 commented on a change in pull request #9433: KAFKA-10607: Consistent behaviour for response errorCounts()
chia7712 commented on a change in pull request #9433: URL: https://github.com/apache/kafka/pull/9433#discussion_r676217594 ## File path: clients/src/main/java/org/apache/kafka/common/requests/MetadataResponse.java ## @@ -109,8 +109,10 @@ public int throttleTimeMs() { @Override public Map errorCounts() { Map errorCounts = new HashMap<>(); -data.topics().forEach(metadata -> -updateErrorCounts(errorCounts, Errors.forCode(metadata.errorCode(; +data.topics().forEach(metadata -> { +metadata.partitions().forEach(p -> updateErrorCounts(errorCounts, Errors.forCode(p.errorCode(; Review comment: https://github.com/apache/kafka/pull/11127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] chia7712 opened a new pull request #11128: MINOR: remove partition-level error from MetadataResponse#errorCounts
chia7712 opened a new pull request #11128: URL: https://github.com/apache/kafka/pull/11128 see https://github.com/apache/kafka/pull/9433#discussion_r676083224 ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] chia7712 commented on a change in pull request #9433: KAFKA-10607: Consistent behaviour for response errorCounts()
chia7712 commented on a change in pull request #9433: URL: https://github.com/apache/kafka/pull/9433#discussion_r676212229 ## File path: clients/src/main/java/org/apache/kafka/common/requests/MetadataResponse.java ## @@ -109,8 +109,10 @@ public int throttleTimeMs() { @Override public Map errorCounts() { Map errorCounts = new HashMap<>(); -data.topics().forEach(metadata -> -updateErrorCounts(errorCounts, Errors.forCode(metadata.errorCode(; +data.topics().forEach(metadata -> { +metadata.partitions().forEach(p -> updateErrorCounts(errorCounts, Errors.forCode(p.errorCode(; Review comment: > A metadata request has topics as the "scoped error" (one cannot request a metadata request for a given partition). I check the server-side code. You are right. This change should be reverted. I will file a PR to fix it ASAP. This behavior can be broken inadvertently (server-side/client-side add/remove specifically scoped error). not sure whether we can prevent it effectively (maybe add more tests) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException
[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386931#comment-17386931 ] Shamsher Singh Rana edited comment on KAFKA-5431 at 7/25/21, 5:33 PM: -- Hi [~mswathi] , please update the status of above issue. was (Author: rana6627): Hi Swathi, please update the status of above issue. > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > --- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.1 >Reporter: Carsten Rietz >Assignee: huxihx >Priority: Major > Labels: reliability > Fix For: 0.11.0.1, 1.0.0 > > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException
[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386931#comment-17386931 ] Shamsher Singh Rana commented on KAFKA-5431: Hi Swathi, please update the status of above issue. > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > --- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.1 >Reporter: Carsten Rietz >Assignee: huxihx >Priority: Major > Labels: reliability > Fix For: 0.11.0.1, 1.0.0 > > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] omkreddy merged pull request #11125: MINOR: Update `./gradlew allDepInsight` example in README
omkreddy merged pull request #11125: URL: https://github.com/apache/kafka/pull/11125 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] splett2 opened a new pull request #11127: KAFKA-13134: Give up group metadata lock before sending heartbeat response
splett2 opened a new pull request #11127: URL: https://github.com/apache/kafka/pull/11127 ### What Small locking improvement to drop the group metadata lock before invoking the response callback. ### Testing Relying on existing unit tests since this is a minor change. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (KAFKA-13134) Heartbeat Request high lock contention
David Mao created KAFKA-13134: - Summary: Heartbeat Request high lock contention Key: KAFKA-13134 URL: https://issues.apache.org/jira/browse/KAFKA-13134 Project: Kafka Issue Type: Improvement Components: core Reporter: David Mao Assignee: David Mao On a cluster with high heartbeat rate, a lock profile showed high contention for the GroupMetadata lock. We can significantly reduce this by invoking the response callback outside of the group metadata lock. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13133) Replace EasyMock and PowerMock with Mockito for AbstractHerderTest
YI-CHEN WANG created KAFKA-13133: Summary: Replace EasyMock and PowerMock with Mockito for AbstractHerderTest Key: KAFKA-13133 URL: https://issues.apache.org/jira/browse/KAFKA-13133 Project: Kafka Issue Type: Sub-task Reporter: YI-CHEN WANG Assignee: YI-CHEN WANG -- This message was sent by Atlassian Jira (v8.3.4#803005)
permission request
Hi, I'm interested to contribute on the kafka project, jira user name: KahnCheny