[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`
[ https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-16710: --- Affects Version/s: 3.8.0 > Continuously `makeFollower` may cause the replica fetcher thread to encounter > an offset mismatch exception when `processPartitionData` > -- > > Key: KAFKA-16710 > URL: https://issues.apache.org/jira/browse/KAFKA-16710 > Project: Kafka > Issue Type: Bug > Components: core, replication >Affects Versions: 2.8.1, 3.8.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png, > 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png, > 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png > > > The scenario where this case occurs is during a reassignment of a partition: > 110879, 110880 (original leader, original follower) ---> 110879, 110880, > 110881, 113915 (the latter two replicas are new leader and new follower) ---> > 110881, 113915 (new leader, new follower). The "Offset mismatch" exception > occurs on the new follower 113915. > Through analysis, the exception occurs in the reassignment process: > # After the new replicas 110881, 113915 are fully enqueued into the ISR, the > controller will switch the leader from 110879 to 110881, and then send a new > `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to > 110881, 113915. > # This time, 110881 executes `makeLeader`, and 113915 executes > `makeFollower`. After the new follower 113915 completes > `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts > fetching data from the new leader 110881, but because the log end offset of > the new leader 110881 (18735600055) is smaller than the log end offset of the > new follower 113915 (18735600059), the new follower 113915 adds the partition > to `divergingEndOffsets` during `processFetchRequest` and then executes > `truncateOnFetchResponse` to truncate the local log to 18735600055. > # However, unfortunately, `truncateOnFetchResponse` needs to acquire the > `partitionMapLock` lock, and at the same time, the new leader 110881 and the > new follower 113915 also receive another `leaderAndIsr` request from the > controller (to remove the old replicas 110879, 110880 from the ISR), and the > `ReplicaFetcherManager` thread of the new follower 113915 executes the second > `makeFollower` to acquire the `partitionMapLock` lock firstly and execute > `removeFetcherForPartitions`, and then gets the local log end offset > (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` > again to update the fetch offset (18735600059) to the `partitionStates`. > # But unfortunately, the follower fetcher thread that was ready to truncate > the local log to 18735600055 firstly obtained the `partitionMapLock` lock and > completed the truncation, and the log end offset is now 18735600055. > # Then, the thread that executed the second `makeFollower` obtained the > `partitionMapLock` lock and executed `addFetcherForPartitions` to update the > outdated fetch offset (18735600059) to the `partitionStates`. > # Finally, it leads to: the follower thread throws the following exception > during `processPartitionData`: "java.lang.IllegalStateException: Offset > mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = > 18735600059, log end offset = 18735600055." > > The relevant logs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`
[ https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-16710: --- Attachment: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png > Continuously `makeFollower` may cause the replica fetcher thread to encounter > an offset mismatch exception when `processPartitionData` > -- > > Key: KAFKA-16710 > URL: https://issues.apache.org/jira/browse/KAFKA-16710 > Project: Kafka > Issue Type: Bug > Components: core, replication >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png, > 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png, > 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png > > > The scenario where this case occurs is during a reassignment of a partition: > 110879, 110880 (original leader, original follower) ---> 110879, 110880, > 110881, 113915 (the latter two replicas are new leader and new follower) ---> > 110881, 113915 (new leader, new follower). The "Offset mismatch" exception > occurs on the new follower 113915. > Through analysis, the exception occurs in the reassignment process: > # After the new replicas 110881, 113915 are fully enqueued into the ISR, the > controller will switch the leader from 110879 to 110881, and then send a new > `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to > 110881, 113915. > # This time, 110881 executes `makeLeader`, and 113915 executes > `makeFollower`. After the new follower 113915 completes > `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts > fetching data from the new leader 110881, but because the log end offset of > the new leader 110881 (18735600055) is smaller than the log end offset of the > new follower 113915 (18735600059), the new follower 113915 adds the partition > to `divergingEndOffsets` during `processFetchRequest` and then executes > `truncateOnFetchResponse` to truncate the local log to 18735600055. > # However, unfortunately, `truncateOnFetchResponse` needs to acquire the > `partitionMapLock` lock, and at the same time, the new leader 110881 and the > new follower 113915 also receive another `leaderAndIsr` request from the > controller (to remove the old replicas 110879, 110880 from the ISR), and the > `ReplicaFetcherManager` thread of the new follower 113915 executes the second > `makeFollower` to acquire the `partitionMapLock` lock firstly and execute > `removeFetcherForPartitions`, and then gets the local log end offset > (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` > again to update the fetch offset (18735600059) to the `partitionStates`. > # But unfortunately, the follower fetcher thread that was ready to truncate > the local log to 18735600055 firstly obtained the `partitionMapLock` lock and > completed the truncation, and the log end offset is now 18735600055. > # Then, the thread that executed the second `makeFollower` obtained the > `partitionMapLock` lock and executed `addFetcherForPartitions` to update the > outdated fetch offset (18735600059) to the `partitionStates`. > # Finally, it leads to: the follower thread throws the following exception > during `processPartitionData`: "java.lang.IllegalStateException: Offset > mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = > 18735600059, log end offset = 18735600055." > > The relevant logs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`
[ https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-16710: --- Description: The scenario where this case occurs is during a reassignment of a partition: 110879, 110880 (original leader, original follower) ---> 110879, 110880, 110881, 113915 (the latter two replicas are new leader and new follower) ---> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception occurs on the new follower 113915. Through analysis, the exception occurs in the reassignment process: # After the new replicas 110881, 113915 are fully enqueued into the ISR, the controller will switch the leader from 110879 to 110881, and then send a new `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 110881, 113915. # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. After the new follower 113915 completes `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts fetching data from the new leader 110881, but because the log end offset of the new leader 110881 (18735600055) is smaller than the log end offset of the new follower 113915 (18735600059), the new follower 113915 adds the partition to `divergingEndOffsets` during `processFetchRequest` and then executes `truncateOnFetchResponse` to truncate the local log to 18735600055. # However, unfortunately, `truncateOnFetchResponse` needs to acquire the `partitionMapLock` lock, and at the same time, the new leader 110881 and the new follower 113915 also receive another `leaderAndIsr` request from the controller (to remove the old replicas 110879, 110880 from the ISR), and the `ReplicaFetcherManager` thread of the new follower 113915 executes the second `makeFollower` to acquire the `partitionMapLock` lock firstly and execute `removeFetcherForPartitions`, and then gets the local log end offset (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` again to update the fetch offset (18735600059) to the `partitionStates`. # But unfortunately, the follower fetcher thread that was ready to truncate the local log to 18735600055 firstly obtained the `partitionMapLock` lock and completed the truncation, and the log end offset is now 18735600055. # Then, the thread that executed the second `makeFollower` obtained the `partitionMapLock` lock and executed `addFetcherForPartitions` to update the outdated fetch offset (18735600059) to the `partitionStates`. # Finally, it leads to: the follower thread throws the following exception during `processPartitionData`: "java.lang.IllegalStateException: Offset mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 18735600059, log end offset = 18735600055." The relevant logs are attached. was: The scenario where this case occurs is during a reassignment of a partition: 110879, 110880 (original leader, original follower) ---> 110879, 110880, 110881, 113915 (the latter two replicas are new leader and new follower) ---> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception occurs on the new follower 113915. Through analysis, the exception occurs in the reassignment process: # After the new replicas 110881, 113915 are fully enqueued into the ISR, the controller will switch the leader from 110879 to 110881, and then send a new `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 110881, 113915. # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. After the new follower 113915 completes `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts fetching data from the new leader 110881, but because the log end offset of the new leader 110881 (18735600055) is smaller than the log end offset of the new follower 113915 (18735600059), the new follower 113915 adds the partition to `divergingEndOffsets` during `processFetchRequest` and then executes `truncateOnFetchResponse` to truncate the local log to 18735600055. # However, unfortunately, `truncateOnFetchResponse` needs to acquire the `partitionMapLock` lock, and at the same time, the new leader 110881 and the new follower 113915 also receive another `leaderAndIsr` request from the controller (to remove the old replicas 110879, 110880 from the ISR), and the `ReplicaFetcherManager` thread of the new follower 113915 executes the second `makeFollower` to acquire the `partitionMapLock` lock firstly and execute `removeFetcherForPartitions`, and then gets the local log end offset (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` again to update the fetch offset (18735600059) to the `partitionStates`. # But unfortunately, the follower fetcher thread that was ready to truncate the local log to 18735600055 firstly obtained the `partitionMapLock` lock and completed the truncation, and the log end offset is now 187356
[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`
[ https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-16710: --- Description: The scenario where this case occurs is during a reassignment of a partition: 110879, 110880 (original leader, original follower) ---> 110879, 110880, 110881, 113915 (the latter two replicas are new leader and new follower) ---> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception occurs on the new follower 113915. Through analysis, the exception occurs in the reassignment process: # After the new replicas 110881, 113915 are fully enqueued into the ISR, the controller will switch the leader from 110879 to 110881, and then send a new `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 110881, 113915. # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. After the new follower 113915 completes `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts fetching data from the new leader 110881, but because the log end offset of the new leader 110881 (18735600055) is smaller than the log end offset of the new follower 113915 (18735600059), the new follower 113915 adds the partition to `divergingEndOffsets` during `processFetchRequest` and then executes `truncateOnFetchResponse` to truncate the local log to 18735600055. # However, unfortunately, `truncateOnFetchResponse` needs to acquire the `partitionMapLock` lock, and at the same time, the new leader 110881 and the new follower 113915 also receive another `leaderAndIsr` request from the controller (to remove the old replicas 110879, 110880 from the ISR), and the `ReplicaFetcherManager` thread of the new follower 113915 executes the second `makeFollower` to acquire the `partitionMapLock` lock firstly and execute `removeFetcherForPartitions`, and then gets the local log end offset (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` again to update the fetch offset (18735600059) to the `partitionStates`. # But unfortunately, the follower fetcher thread that was ready to truncate the local log to 18735600055 firstly obtained the `partitionMapLock` lock and completed the truncation, and the log end offset is now 18735600055. # Then, the thread that executed the second `makeFollower` obtained the `partitionMapLock` lock and executed `addFetcherForPartitions` to update the outdated fetch offset (18735600059) to the `partitionStates`. # Finally, it leads to: the follower thread throws the following exception during `processPartitionData`: "java.lang.IllegalStateException: Offset mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 18735600059, log end offset = 18735600055." > Continuously `makeFollower` may cause the replica fetcher thread to encounter > an offset mismatch exception when `processPartitionData` > -- > > Key: KAFKA-16710 > URL: https://issues.apache.org/jira/browse/KAFKA-16710 > Project: Kafka > Issue Type: Bug > Components: core, replication >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > > The scenario where this case occurs is during a reassignment of a partition: > 110879, 110880 (original leader, original follower) ---> 110879, 110880, > 110881, 113915 (the latter two replicas are new leader and new follower) ---> > 110881, 113915 (new leader, new follower). The "Offset mismatch" exception > occurs on the new follower 113915. > Through analysis, the exception occurs in the reassignment process: > # After the new replicas 110881, 113915 are fully enqueued into the ISR, the > controller will switch the leader from 110879 to 110881, and then send a new > `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to > 110881, 113915. > # This time, 110881 executes `makeLeader`, and 113915 executes > `makeFollower`. After the new follower 113915 completes > `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts > fetching data from the new leader 110881, but because the log end offset of > the new leader 110881 (18735600055) is smaller than the log end offset of the > new follower 113915 (18735600059), the new follower 113915 adds the partition > to `divergingEndOffsets` during `processFetchRequest` and then executes > `truncateOnFetchResponse` to truncate the local log to 18735600055. > # However, unfortunately, `truncateOnFetchResponse` needs to acquire the > `partitionMapLock` lock, and at the same time, the new leader 110881 and the > new follower 113915 also receive another `leaderAndIsr` request from the > controller (to remove the old replicas 110879, 110880 from the ISR)
[jira] [Created] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`
hudeqi created KAFKA-16710: -- Summary: Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData` Key: KAFKA-16710 URL: https://issues.apache.org/jira/browse/KAFKA-16710 Project: Kafka Issue Type: Bug Components: core, replication Affects Versions: 2.8.1 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16543) There may be ambiguous deletions in the `cleanupGroupMetadata` when the generation of the group is less than or equal to 0
[ https://issues.apache.org/jira/browse/KAFKA-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-16543: --- Description: In the `cleanupGroupMetadata` method, tombstone messages is written to delete the group's MetadataKey only when the group is in the Dead state and the generation is greater than 0. The comment indicates: 'We avoid writing the tombstone when the generationId is 0, since this group is only using Kafka for offset storage.' This means that groups that only use Kafka for offset storage should not be deleted. However, there is a situation where, for example, Flink commit offsets with a generationId equal to -1. If the ApiKeys.DELETE_GROUPS is called to delete this group, Flink's group metadata will never be deleted. Yet, the logic above has already cleaned up commitKey by writing tombstone messages with `removedOffsets`. Therefore, the actual manifestation is: the group no longer exists (since the offsets have been cleaned up, there is no possibility of adding the group back to the `groupMetadataCache` unless offsets are committed again with the same group name), but the corresponding group metadata information still exists in __consumer_offsets. This leads to the problem that deleting the group does not completely clean up its related information. The group's state is set to Dead only in the following three situations: 1. The group information is unloaded 2. The group is deleted by ApiKeys.DELETE_GROUPS 3. All offsets of the group have expired or removed. Therefore, since the group is already in the Dead state and has been removed from the `groupMetadataCache`, why not directly clean up all the information of the group? Even if it is only used for storing offsets. was: In the `cleanupGroupMetadata` method, tombstone messages is written to delete the group's MetadataKey only when the group is in the Dead state and the generation is greater than 0. The comment indicates: 'We avoid writing the tombstone when the generationId is 0, since this group is only using Kafka for offset storage.' This means that groups that only use Kafka for offset storage should not be deleted. However, there is a situation where, for example, Flink commit offsets with a generationId equal to -1. If the ApiKeys.DELETE_GROUPS is called to delete this group, Flink's group metadata will never be deleted. Yet, the logic above has already cleaned up commitKey by writing tombstone messages with removedOffsets. Therefore, the actual manifestation is: the group no longer exists (since the offsets have been cleaned up, there is no possibility of adding the group back to the `groupMetadataCache` unless offsets are committed again with the same group name), but the corresponding group metadata information still exists in __consumer_offsets. This leads to the problem that deleting the group does not completely clean up its related information. The group's state is set to Dead only in the following three situations: 1. The group information is unloaded 2. The group is deleted by ApiKeys.DELETE_GROUPS 3. All offsets of the group have expired or removed. Therefore, since the group is already in the Dead state and has been removed from the `groupMetadataCache`, why not directly clean up all the information of the group? Even if it is only used for storing offsets. > There may be ambiguous deletions in the `cleanupGroupMetadata` when the > generation of the group is less than or equal to 0 > -- > > Key: KAFKA-16543 > URL: https://issues.apache.org/jira/browse/KAFKA-16543 > Project: Kafka > Issue Type: Bug > Components: group-coordinator >Affects Versions: 3.6.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > In the `cleanupGroupMetadata` method, tombstone messages is written to delete > the group's MetadataKey only when the group is in the Dead state and the > generation is greater than 0. The comment indicates: 'We avoid writing the > tombstone when the generationId is 0, since this group is only using Kafka > for offset storage.' This means that groups that only use Kafka for offset > storage should not be deleted. However, there is a situation where, for > example, Flink commit offsets with a generationId equal to -1. If the > ApiKeys.DELETE_GROUPS is called to delete this group, Flink's group metadata > will never be deleted. Yet, the logic above has already cleaned up commitKey > by writing tombstone messages with `removedOffsets`. Therefore, the actual > manifestation is: the group no longer exists (since the offsets have been > cleaned up, there is no possibility of adding the group back to the > `groupMetadataCache` unless offsets are committed again with the same gr
[jira] [Created] (KAFKA-16543) There may be ambiguous deletions in the `cleanupGroupMetadata` when the generation of the group is less than or equal to 0
hudeqi created KAFKA-16543: -- Summary: There may be ambiguous deletions in the `cleanupGroupMetadata` when the generation of the group is less than or equal to 0 Key: KAFKA-16543 URL: https://issues.apache.org/jira/browse/KAFKA-16543 Project: Kafka Issue Type: Bug Components: group-coordinator Affects Versions: 3.6.2 Reporter: hudeqi Assignee: hudeqi In the `cleanupGroupMetadata` method, tombstone messages is written to delete the group's MetadataKey only when the group is in the Dead state and the generation is greater than 0. The comment indicates: 'We avoid writing the tombstone when the generationId is 0, since this group is only using Kafka for offset storage.' This means that groups that only use Kafka for offset storage should not be deleted. However, there is a situation where, for example, Flink commit offsets with a generationId equal to -1. If the ApiKeys.DELETE_GROUPS is called to delete this group, Flink's group metadata will never be deleted. Yet, the logic above has already cleaned up commitKey by writing tombstone messages with removedOffsets. Therefore, the actual manifestation is: the group no longer exists (since the offsets have been cleaned up, there is no possibility of adding the group back to the `groupMetadataCache` unless offsets are committed again with the same group name), but the corresponding group metadata information still exists in __consumer_offsets. This leads to the problem that deleting the group does not completely clean up its related information. The group's state is set to Dead only in the following three situations: 1. The group information is unloaded 2. The group is deleted by ApiKeys.DELETE_GROUPS 3. All offsets of the group have expired or removed. Therefore, since the group is already in the Dead state and has been removed from the `groupMetadataCache`, why not directly clean up all the information of the group? Even if it is only used for storing offsets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15764) Missing tests for transactions
[ https://issues.apache.org/jira/browse/KAFKA-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-15764: -- Assignee: hudeqi > Missing tests for transactions > -- > > Key: KAFKA-15764 > URL: https://issues.apache.org/jira/browse/KAFKA-15764 > Project: Kafka > Issue Type: Test >Affects Versions: 3.6.0 >Reporter: Divij Vaidya >Assignee: hudeqi >Priority: Major > > As part of https://issues.apache.org/jira/browse/KAFKA-15653 we found a bug > which was not caught by any of the existing tests in Apache Kafka. > However, the bug is consistently reproducible if we use the test suite shared > by [~twmb] at > https://issues.apache.org/jira/browse/KAFKA-15653?focusedCommentId=1846&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1846 > > As part of this JIRA, we want to do add relevant test in Apache Kafka which > should have failed. We can look at the test suite that franz/go repository is > running, more specifically the test suite that is executed by "go test -run > Txn".(see repro.sh attached to the comment linked above). > With reference to code in repro.sh, we don't need to do docker compose down > or up in the while loop Hence, it's ok to simply do > ``` > while go test -run Txn > FRANZ_FAIL; do > done > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15764) Missing tests for transactions
[ https://issues.apache.org/jira/browse/KAFKA-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781288#comment-17781288 ] hudeqi commented on KAFKA-15764: I can pick it up. > Missing tests for transactions > -- > > Key: KAFKA-15764 > URL: https://issues.apache.org/jira/browse/KAFKA-15764 > Project: Kafka > Issue Type: Test >Affects Versions: 3.6.0 >Reporter: Divij Vaidya >Priority: Major > > As part of https://issues.apache.org/jira/browse/KAFKA-15653 we found a bug > which was not caught by any of the existing tests in Apache Kafka. > However, the bug is consistently reproducible if we use the test suite shared > by [~twmb] at > https://issues.apache.org/jira/browse/KAFKA-15653?focusedCommentId=1846&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1846 > > As part of this JIRA, we want to do add relevant test in Apache Kafka which > should have failed. We can look at the test suite that franz/go repository is > running, more specifically the test suite that is executed by "go test -run > Txn".(see repro.sh attached to the comment linked above). > With reference to code in repro.sh, we don't need to do docker compose down > or up in the while loop Hence, it's ok to simply do > ``` > while go test -run Txn > FRANZ_FAIL; do > done > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15313) Delete remote log segments partition asynchronously when a partition is deleted.
[ https://issues.apache.org/jira/browse/KAFKA-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781257#comment-17781257 ] hudeqi commented on KAFKA-15313: Hi, [~abhijeetkumar] Are you still following this issue? If you don't have time, I can take over. Thanks. > Delete remote log segments partition asynchronously when a partition is > deleted. > - > > Key: KAFKA-15313 > URL: https://issues.apache.org/jira/browse/KAFKA-15313 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: Satish Duggana >Assignee: Abhijeet Kumar >Priority: Major > Labels: KIP-405 > > KIP-405 already covers the approach to delete remote log segments > asynchronously through controller and RLMM layers. > reference: https://github.com/apache/kafka/pull/13947#discussion_r1281675818 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15331) Handle remote log enabled topic deletion when leader is not available
[ https://issues.apache.org/jira/browse/KAFKA-15331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-15331: -- Assignee: hudeqi > Handle remote log enabled topic deletion when leader is not available > - > > Key: KAFKA-15331 > URL: https://issues.apache.org/jira/browse/KAFKA-15331 > Project: Kafka > Issue Type: Bug >Reporter: Kamal Chandraprakash >Assignee: hudeqi >Priority: Major > Fix For: 3.7.0 > > > When a topic gets deleted, then there can be a case where all the replicas > can be out of ISR. This case is not handled, See: > [https://github.com/apache/kafka/pull/13947#discussion_r1289331347] for more > details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15432) RLM Stop partitions should not be invoked for non-tiered storage topics
[ https://issues.apache.org/jira/browse/KAFKA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780818#comment-17780818 ] hudeqi commented on KAFKA-15432: I will take it over if no anyone is not following this. Thanks [~phuctran] [~ckamal] > RLM Stop partitions should not be invoked for non-tiered storage topics > --- > > Key: KAFKA-15432 > URL: https://issues.apache.org/jira/browse/KAFKA-15432 > Project: Kafka > Issue Type: Improvement >Reporter: Kamal Chandraprakash >Assignee: hudeqi >Priority: Major > > When a stop partition request is sent by the controller. It invokes the > RemoteLogManager#stopPartition even for internal and non-tiered-storage > enabled topics. The replica manager should not call this method for > non-tiered-storage topics. > See this > [comment|https://github.com/apache/kafka/pull/14329#discussion_r1315675510] > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15432) RLM Stop partitions should not be invoked for non-tiered storage topics
[ https://issues.apache.org/jira/browse/KAFKA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-15432: -- Assignee: hudeqi > RLM Stop partitions should not be invoked for non-tiered storage topics > --- > > Key: KAFKA-15432 > URL: https://issues.apache.org/jira/browse/KAFKA-15432 > Project: Kafka > Issue Type: Improvement >Reporter: Kamal Chandraprakash >Assignee: hudeqi >Priority: Major > > When a stop partition request is sent by the controller. It invokes the > RemoteLogManager#stopPartition even for internal and non-tiered-storage > enabled topics. The replica manager should not call this method for > non-tiered-storage topics. > See this > [comment|https://github.com/apache/kafka/pull/14329#discussion_r1315675510] > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15300) Include remotelog size in complete log size and also add local log size and remote log size separately in kafka-log-dirs tool.
[ https://issues.apache.org/jira/browse/KAFKA-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780816#comment-17780816 ] hudeqi commented on KAFKA-15300: Can I take over this issue? [~satish.duggana] Can you describe this issue in detail? > Include remotelog size in complete log size and also add local log size and > remote log size separately in kafka-log-dirs tool. > --- > > Key: KAFKA-15300 > URL: https://issues.apache.org/jira/browse/KAFKA-15300 > Project: Kafka > Issue Type: Task > Components: core >Reporter: Satish Duggana >Priority: Major > Fix For: 3.7.0 > > > Include remotelog size in complete log size and also add local log size and > remote log size separately in kafka-log-dirs tool. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15376) Explore options of removing data earlier to the current leader's leader epoch lineage for topics enabled with tiered storage.
[ https://issues.apache.org/jira/browse/KAFKA-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780815#comment-17780815 ] hudeqi commented on KAFKA-15376: Can I take over this issue? [~satish.duggana] > Explore options of removing data earlier to the current leader's leader epoch > lineage for topics enabled with tiered storage. > - > > Key: KAFKA-15376 > URL: https://issues.apache.org/jira/browse/KAFKA-15376 > Project: Kafka > Issue Type: Task > Components: core >Reporter: Satish Duggana >Priority: Major > Fix For: 3.7.0 > > > Followup on the discussion thread: > [https://github.com/apache/kafka/pull/13561#discussion_r1288778006] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15038) Use topic id/name mapping from the Metadata cache in the RemoteLogManager
[ https://issues.apache.org/jira/browse/KAFKA-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780814#comment-17780814 ] hudeqi commented on KAFKA-15038: Hi, [~owen-leung] Are you still following this issue? If you don't have time, I can take over. Thanks. > Use topic id/name mapping from the Metadata cache in the RemoteLogManager > - > > Key: KAFKA-15038 > URL: https://issues.apache.org/jira/browse/KAFKA-15038 > Project: Kafka > Issue Type: Task > Components: core >Reporter: Alexandre Dupriez >Assignee: Owen C.H. Leung >Priority: Minor > Fix For: 3.7.0 > > > Currently, the {{RemoteLogManager}} maintains its own cache of topic name to > topic id > [[1]|https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138] > using the information provided during leadership changes, and removing the > mapping upon receiving the notification of partition stopped. > It should be possible to re-use the mapping in a broker's metadata cache, > removing the need for the RLM to build and update a local cache thereby > duplicating the information in the metadata cache. It also allows to preserve > a single source of authority regarding the association between topic names > and ids. > [1] > https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15331) Handle remote log enabled topic deletion when leader is not available
[ https://issues.apache.org/jira/browse/KAFKA-15331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780812#comment-17780812 ] hudeqi commented on KAFKA-15331: Hi, [~_gargantua_] Are you still following this issue? If you don't have time, I can take over. Thanks. > Handle remote log enabled topic deletion when leader is not available > - > > Key: KAFKA-15331 > URL: https://issues.apache.org/jira/browse/KAFKA-15331 > Project: Kafka > Issue Type: Bug >Reporter: Kamal Chandraprakash >Priority: Major > Fix For: 3.7.0 > > > When a topic gets deleted, then there can be a case where all the replicas > can be out of ISR. This case is not handled, See: > [https://github.com/apache/kafka/pull/13947#discussion_r1289331347] for more > details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-12641) Clear RemoteLogLeaderEpochState entry when it become empty.
[ https://issues.apache.org/jira/browse/KAFKA-12641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780809#comment-17780809 ] hudeqi commented on KAFKA-12641: Hi, [~abhijeetkumar] Are you still following this issue? If you don't have time, I can take over. Thanks. > Clear RemoteLogLeaderEpochState entry when it become empty. > > > Key: KAFKA-12641 > URL: https://issues.apache.org/jira/browse/KAFKA-12641 > Project: Kafka > Issue Type: Bug >Reporter: Satish Duggana >Assignee: Abhijeet Kumar >Priority: Major > > https://github.com/apache/kafka/pull/10218#discussion_r609895193 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-13355) Shutdown broker eventually when unrecoverable exceptions like IOException encountered in RLMM.
[ https://issues.apache.org/jira/browse/KAFKA-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780810#comment-17780810 ] hudeqi commented on KAFKA-13355: Hi, [~abhijeetkumar] Are you still following this issue? If you don't have time, I can take over. Thanks. > Shutdown broker eventually when unrecoverable exceptions like IOException > encountered in RLMM. > --- > > Key: KAFKA-13355 > URL: https://issues.apache.org/jira/browse/KAFKA-13355 > Project: Kafka > Issue Type: Bug >Reporter: Satish Duggana >Assignee: Abhijeet Kumar >Priority: Major > Labels: tiered-storage > Fix For: 3.7.0 > > > Have mechanism to catch unrecoverable exceptions like IOException from RLMM > and shutdown the broker like it is done in log layer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15671) Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache
[ https://issues.apache.org/jira/browse/KAFKA-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-15671: -- Assignee: hudeqi > Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache > -- > > Key: KAFKA-15671 > URL: https://issues.apache.org/jira/browse/KAFKA-15671 > Project: Kafka > Issue Type: Test > Components: Tiered-Storage >Reporter: Divij Vaidya >Assignee: hudeqi >Priority: Major > > {{context: > [https://github.com/apache/kafka/pull/14483#issuecomment-1775107621] }} > {{Example of failure in trunk (from 23rd Oct)}} > {{[https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe%2FBerlin&tests.container=kafka.log.remote.RemoteIndexCacheTest&tests.test=testCorrectnessForCacheAndIndexFilesWhenResizeCache()] > }} > {{Stack trace: > [https://ge.apache.org/s/ahssoveatyg6k/tests/task/:core:test/details/kafka.log.remote.RemoteIndexCacheTest/testCorrectnessForCacheAndIndexFilesWhenResizeCache()?expanded-stacktrace=WyIwIl0&top-execution=1] > }} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15671) Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache
[ https://issues.apache.org/jira/browse/KAFKA-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779022#comment-17779022 ] hudeqi commented on KAFKA-15671: Hi, [~brian030128] , I saw that this issue needs a quick fix. I know a lot about this issue and have submitted a PR to run the CI test. > Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache > -- > > Key: KAFKA-15671 > URL: https://issues.apache.org/jira/browse/KAFKA-15671 > Project: Kafka > Issue Type: Test > Components: Tiered-Storage >Reporter: Divij Vaidya >Assignee: Brian >Priority: Major > > {{context: > [https://github.com/apache/kafka/pull/14483#issuecomment-1775107621] }} > {{Example of failure in trunk (from 23rd Oct)}} > {{[https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe%2FBerlin&tests.container=kafka.log.remote.RemoteIndexCacheTest&tests.test=testCorrectnessForCacheAndIndexFilesWhenResizeCache()] > }} > {{Stack trace: > [https://ge.apache.org/s/ahssoveatyg6k/tests/task/:core:test/details/kafka.log.remote.RemoteIndexCacheTest/testCorrectnessForCacheAndIndexFilesWhenResizeCache()?expanded-stacktrace=WyIwIl0&top-execution=1] > }} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15396) Add a metric indicating the version of the current running kafka server
[ https://issues.apache.org/jira/browse/KAFKA-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15396: --- Affects Version/s: 3.6.0 (was: 3.5.1) > Add a metric indicating the version of the current running kafka server > --- > > Key: KAFKA-15396 > URL: https://issues.apache.org/jira/browse/KAFKA-15396 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.6.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Labels: kip-required > > At present, it is impossible to perceive the Kafka version that the broker is > running from the perspective of metrics. If multiple Kafka versions are > deployed in a cluster due to various reasons, it is difficult for us to > intuitively understand the version distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask
[ https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15607: --- Attachment: SeaTalk_IMG_20231019_174340.png > Possible NPE is thrown in MirrorCheckpointTask > -- > > Key: KAFKA-15607 > URL: https://issues.apache.org/jira/browse/KAFKA-15607 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 2.8.1, 3.6.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_20231019_172945.png, > SeaTalk_IMG_20231019_174340.png > > > In the `syncGroupOffset` method, if > `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of > `latestDownstreamOffset` will throw NPE. This usually occurs in this > situation: a group consumed a topic in the target cluster previously. Later, > the group offset of some partitions was reset to -1, the `OffsetAndMetadata` > of these partitions was null. > It is possible that when reset offsets are performed in the java kafka > client, the reset to -1 will be intercepted. However, there are some other > types of clients such as sarama, which can magically reset the group offset > to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a > defensive measure to avoid NPE is needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask
[ https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15607: --- Affects Version/s: 3.7.0 > Possible NPE is thrown in MirrorCheckpointTask > -- > > Key: KAFKA-15607 > URL: https://issues.apache.org/jira/browse/KAFKA-15607 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 2.8.1, 3.7.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_20231019_172945.png > > > In the `syncGroupOffset` method, if > `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of > `latestDownstreamOffset` will throw NPE. This usually occurs in this > situation: a group consumed a topic in the target cluster previously. Later, > the group offset of some partitions was reset to -1, the `OffsetAndMetadata` > of these partitions was null. > It is possible that when reset offsets are performed in the java kafka > client, the reset to -1 will be intercepted. However, there are some other > types of clients such as sarama, which can magically reset the group offset > to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a > defensive measure to avoid NPE is needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask
[ https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15607: --- Affects Version/s: 3.6.0 (was: 3.7.0) > Possible NPE is thrown in MirrorCheckpointTask > -- > > Key: KAFKA-15607 > URL: https://issues.apache.org/jira/browse/KAFKA-15607 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 2.8.1, 3.6.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_20231019_172945.png > > > In the `syncGroupOffset` method, if > `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of > `latestDownstreamOffset` will throw NPE. This usually occurs in this > situation: a group consumed a topic in the target cluster previously. Later, > the group offset of some partitions was reset to -1, the `OffsetAndMetadata` > of these partitions was null. > It is possible that when reset offsets are performed in the java kafka > client, the reset to -1 will be intercepted. However, there are some other > types of clients such as sarama, which can magically reset the group offset > to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a > defensive measure to avoid NPE is needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask
[ https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15607: --- Attachment: SeaTalk_IMG_20231019_172945.png > Possible NPE is thrown in MirrorCheckpointTask > -- > > Key: KAFKA-15607 > URL: https://issues.apache.org/jira/browse/KAFKA-15607 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_20231019_172945.png > > > In the `syncGroupOffset` method, if > `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of > `latestDownstreamOffset` will throw NPE. This usually occurs in this > situation: a group consumed a topic in the target cluster previously. Later, > the group offset of some partitions was reset to -1, the `OffsetAndMetadata` > of these partitions was null. > It is possible that when reset offsets are performed in the java kafka > client, the reset to -1 will be intercepted. However, there are some other > types of clients such as sarama, which can magically reset the group offset > to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a > defensive measure to avoid NPE is needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask
[ https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15607: --- Description: In the `syncGroupOffset` method, if `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of `latestDownstreamOffset` will throw NPE. This usually occurs in this situation: a group consumed a topic in the target cluster previously. Later, the group offset of some partitions was reset to -1, the `OffsetAndMetadata` of these partitions was null. It is possible that when reset offsets are performed in the java kafka client, the reset to -1 will be intercepted. However, there are some other types of clients such as sarama, which can magically reset the group offset to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a defensive measure to avoid NPE is needed here. was:In the `syncGroupOffset` method, if `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of `latestDownstreamOffset` will throw NPE. This usually occurs in this situation: a group consumed a topic in the target cluster previously. Later, the group offset of some partitions was reset to -1, the `OffsetAndMetadata` of these partitions was null. > Possible NPE is thrown in MirrorCheckpointTask > -- > > Key: KAFKA-15607 > URL: https://issues.apache.org/jira/browse/KAFKA-15607 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > > In the `syncGroupOffset` method, if > `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of > `latestDownstreamOffset` will throw NPE. This usually occurs in this > situation: a group consumed a topic in the target cluster previously. Later, > the group offset of some partitions was reset to -1, the `OffsetAndMetadata` > of these partitions was null. > It is possible that when reset offsets are performed in the java kafka > client, the reset to -1 will be intercepted. However, there are some other > types of clients such as sarama, which can magically reset the group offset > to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a > defensive measure to avoid NPE is needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask
hudeqi created KAFKA-15607: -- Summary: Possible NPE is thrown in MirrorCheckpointTask Key: KAFKA-15607 URL: https://issues.apache.org/jira/browse/KAFKA-15607 Project: Kafka Issue Type: Bug Components: mirrormaker Affects Versions: 2.8.1 Reporter: hudeqi Assignee: hudeqi In the `syncGroupOffset` method, if `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of `latestDownstreamOffset` will throw NPE. This usually occurs in this situation: a group consumed a topic in the target cluster previously. Later, the group offset of some partitions was reset to -1, the `OffsetAndMetadata` of these partitions was null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15535) Add documentation of "remote.log.index.file.cache.total.size.bytes" configuration property.
[ https://issues.apache.org/jira/browse/KAFKA-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772948#comment-17772948 ] hudeqi commented on KAFKA-15535: Is this marked as resolved? The doc of "remote.log.index.file.cache.total.size.bytes" is already available, and other configurations about tiered storage have also been checked, and it seems that nothing is missing. [~satish.duggana] > Add documentation of "remote.log.index.file.cache.total.size.bytes" > configuration property. > > > Key: KAFKA-15535 > URL: https://issues.apache.org/jira/browse/KAFKA-15535 > Project: Kafka > Issue Type: Task > Components: documentation >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > Labels: tiered-storage > Fix For: 3.7.0 > > > Add documentation of "remote.log.index.file.cache.total.size.bytes" > configuration property. > Please double check all the existing public tiered storage configurations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15536) dynamically resize remoteIndexCache
[ https://issues.apache.org/jira/browse/KAFKA-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15536: --- Affects Version/s: 3.7.0 (was: 3.6.0) > dynamically resize remoteIndexCache > --- > > Key: KAFKA-15536 > URL: https://issues.apache.org/jira/browse/KAFKA-15536 > Project: Kafka > Issue Type: Improvement > Components: Tiered-Storage >Affects Versions: 3.7.0 >Reporter: Luke Chen >Assignee: hudeqi >Priority: Major > > context: > https://github.com/apache/kafka/pull/14243#discussion_r1320630057 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15535) Add documentation of "remote.log.index.file.cache.total.size.bytes" configuration property.
[ https://issues.apache.org/jira/browse/KAFKA-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-15535: -- Assignee: hudeqi > Add documentation of "remote.log.index.file.cache.total.size.bytes" > configuration property. > > > Key: KAFKA-15535 > URL: https://issues.apache.org/jira/browse/KAFKA-15535 > Project: Kafka > Issue Type: Task > Components: documentation >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > Fix For: 3.7.0 > > > Add documentation of "remote.log.index.file.cache.total.size.bytes" > configuration property. > Please double check all the existing public tiered storage configurations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reopened KAFKA-15397: > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.
[ https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763002#comment-17763002 ] hudeqi commented on KAFKA-14912: Thanks, [~showuon] . The PR corresponding to this issue is currently limited based on the "bytes size", please review it if you have time. > Introduce a configuration for remote index cache size, preferably a dynamic > config. > --- > > Key: KAFKA-14912 > URL: https://issues.apache.org/jira/browse/KAFKA-14912 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > > Context: We need to make the 1024 value here [1] as dynamically configurable > [1] > https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761165#comment-17761165 ] hudeqi commented on KAFKA-15397: yes, the analysis in the attachment is based on the heap dump. > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761149#comment-17761149 ] hudeqi commented on KAFKA-15397: I applied the fix of this PR online and saw that the frequency of full gc was greatly reduced, but still occurred after a long running time. It was also difficult for me to simulate the abnormal sending behavior of the client. Do you have any findings and suggestions? [~showuon] > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760223#comment-17760223 ] hudeqi commented on KAFKA-15397: [~showuon] link:[链接标题|https://github.com/apache/kafka/commit/b401fdaefbfbd9263deef0ab92b6093eaeb26d9b] , I not only added this pr processing when I fixed it myself, but also performed a set-null operation on this leaked collection. > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi resolved KAFKA-15397. Resolution: Resolved > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760204#comment-17760204 ] hudeqi commented on KAFKA-15397: It seems that this commit: ”MINOR: Add more validation during KRPC deserialization“ can effectively avoid this memory overflow problem, so close this issue first. > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15397: --- Affects Version/s: 2.8.1 (was: 3.5.1) > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.8.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15396) Add a metric indicating the version of the current running kafka server
[ https://issues.apache.org/jira/browse/KAFKA-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759514#comment-17759514 ] hudeqi commented on KAFKA-15396: Hi, [~divijvaidya] [~jolshan] I have added kip-972 link in this jira, thanks. > Add a metric indicating the version of the current running kafka server > --- > > Key: KAFKA-15396 > URL: https://issues.apache.org/jira/browse/KAFKA-15396 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.5.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Labels: kip-required > > At present, it is impossible to perceive the Kafka version that the broker is > running from the perspective of metrics. If multiple Kafka versions are > deployed in a cluster due to various reasons, it is difficult for us to > intuitively understand the version distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15396) Add a metric indicating the version of the current running kafka server
[ https://issues.apache.org/jira/browse/KAFKA-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759492#comment-17759492 ] hudeqi commented on KAFKA-15396: thanks, I will submit here. > Add a metric indicating the version of the current running kafka server > --- > > Key: KAFKA-15396 > URL: https://issues.apache.org/jira/browse/KAFKA-15396 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.5.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Labels: kip-required > > At present, it is impossible to perceive the Kafka version that the broker is > running from the perspective of metrics. If multiple Kafka versions are > deployed in a cluster due to various reasons, it is difficult for us to > intuitively understand the version distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15397: --- Attachment: SeaTalk_IMG_1692796505.png SeaTalk_IMG_1692796533.png SeaTalk_IMG_1692796604.png SeaTalk_IMG_1692796891.png > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 3.5.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, > SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png > > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
[ https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15397: --- Description: When the client sends a produce request in an abnormal way and the server accepts it for deserialization, a "java.lang.IllegalArgumentException" may occur, which will cause a large number of "TopicProduceData" objects to be instantiated without being cleaned up. In the end, the entire service of kafka is OOM. > Deserializing produce requests may cause memory leaks when exceptions occur > --- > > Key: KAFKA-15397 > URL: https://issues.apache.org/jira/browse/KAFKA-15397 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 3.5.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Blocker > > When the client sends a produce request in an abnormal way and the server > accepts it for deserialization, a "java.lang.IllegalArgumentException" may > occur, which will cause a large number of "TopicProduceData" objects to be > instantiated without being cleaned up. In the end, the entire service of > kafka is OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur
hudeqi created KAFKA-15397: -- Summary: Deserializing produce requests may cause memory leaks when exceptions occur Key: KAFKA-15397 URL: https://issues.apache.org/jira/browse/KAFKA-15397 Project: Kafka Issue Type: Bug Components: clients, core Affects Versions: 3.5.1 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15396) Add a metric indicating the version of the current running kafka server
hudeqi created KAFKA-15396: -- Summary: Add a metric indicating the version of the current running kafka server Key: KAFKA-15396 URL: https://issues.apache.org/jira/browse/KAFKA-15396 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 3.5.1 Reporter: hudeqi Assignee: hudeqi At present, it is impossible to perceive the Kafka version that the broker is running from the perspective of metrics. If multiple Kafka versions are deployed in a cluster due to various reasons, it is difficult for us to intuitively understand the version distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.
[ https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756895#comment-17756895 ] hudeqi commented on KAFKA-14912: But how to measure and get the size of the entry? [~divijvaidya] > Introduce a configuration for remote index cache size, preferably a dynamic > config. > --- > > Key: KAFKA-14912 > URL: https://issues.apache.org/jira/browse/KAFKA-14912 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > > Context: We need to make the 1024 value here [1] as dynamically configurable > [1] > https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15172) Allow exact mirroring of ACLs between clusters
[ https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754541#comment-17754541 ] hudeqi commented on KAFKA-15172: This KIP is seldom discussed, what should I do next? [~mimaison] > Allow exact mirroring of ACLs between clusters > -- > > Key: KAFKA-15172 > URL: https://issues.apache.org/jira/browse/KAFKA-15172 > Project: Kafka > Issue Type: Task > Components: mirrormaker >Reporter: Mickael Maison >Assignee: hudeqi >Priority: Major > Labels: kip-965 > > When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The > rationale to is prevent other clients to produce to remote topics. > However in disaster recovery scenarios, where the target cluster is not used > and just a "hot standby", it would be preferable to have exactly the same > ACLs on both clusters to speed up failover. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.
[ https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754089#comment-17754089 ] hudeqi commented on KAFKA-14912: Hi, [~divijvaidya] Caffeine does support “a max size by bytes”. If I use the existing configuration "remote.log.index.file.cache.total.size.bytes", it will be troublesome for me to calculate the size for each Entry , but if it is “max size by entry”, it will be easier to implement. How do you think this should be implemented? > Introduce a configuration for remote index cache size, preferably a dynamic > config. > --- > > Key: KAFKA-14912 > URL: https://issues.apache.org/jira/browse/KAFKA-14912 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > > Context: We need to make the 1024 value here [1] as dynamically configurable > [1] > https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15172) Allow exact mirroring of ACLs between clusters
[ https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752398#comment-17752398 ] hudeqi commented on KAFKA-15172: Hi, I have initiated a kip discussion: [https://lists.apache.org/thread/j17cdrpqxpn924cztdgxxm86qhf89sop] [~mimaison] [~ChrisEgerton] > Allow exact mirroring of ACLs between clusters > -- > > Key: KAFKA-15172 > URL: https://issues.apache.org/jira/browse/KAFKA-15172 > Project: Kafka > Issue Type: Task > Components: mirrormaker >Reporter: Mickael Maison >Assignee: hudeqi >Priority: Major > Labels: kip-965 > > When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The > rationale to is prevent other clients to produce to remote topics. > However in disaster recovery scenarios, where the target cluster is not used > and just a "hot standby", it would be preferable to have exactly the same > ACLs on both clusters to speed up failover. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.
[ https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-14912: -- Assignee: hudeqi > Introduce a configuration for remote index cache size, preferably a dynamic > config. > --- > > Key: KAFKA-14912 > URL: https://issues.apache.org/jira/browse/KAFKA-14912 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > > Context: We need to make the 1024 value here [1] as dynamically configurable > [1] > https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.
[ https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752229#comment-17752229 ] hudeqi commented on KAFKA-14912: ok, i'll take a look in the last few days, thx > Introduce a configuration for remote index cache size, preferably a dynamic > config. > --- > > Key: KAFKA-14912 > URL: https://issues.apache.org/jira/browse/KAFKA-14912 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Priority: Major > > Context: We need to make the 1024 value here [1] as dynamically configurable > [1] > https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15172) Allow exact mirroring of ACLs between clusters
[ https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15172: --- Labels: kip-965 (was: needs-kip) > Allow exact mirroring of ACLs between clusters > -- > > Key: KAFKA-15172 > URL: https://issues.apache.org/jira/browse/KAFKA-15172 > Project: Kafka > Issue Type: Task > Components: mirrormaker >Reporter: Mickael Maison >Assignee: hudeqi >Priority: Major > Labels: kip-965 > > When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The > rationale to is prevent other clients to produce to remote topics. > However in disaster recovery scenarios, where the target cluster is not used > and just a "hot standby", it would be preferable to have exactly the same > ACLs on both clusters to speed up failover. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15129) Clean up all metrics that were forgotten to be closed
[ https://issues.apache.org/jira/browse/KAFKA-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15129: --- Description: In the current kafka code, there are still many module metrics that are forgotten to be closed when they stop, although some of them have been fixed, such as kafka-14866 and kafka-14868. et. These metric leaks may lead to potential OOM risks, and, in the unit tests and integration tests in the code, there are also a large number of `closes` without removing the metric, which will also cause CI test instability. By cleaning up these leaked indicators, these risks can be eliminated, and the security and stability of the code can be enhanced. Here I will find all the metrics that are forgotten and closed in the current version, and submit them according to the subtasks in order to fix them. was: In the current kafka code, there are still many module metrics that are forgotten to be closed when they stop, although some of them have been fixed, such as kafka-14866 and kafka-14868. et. Here I will find all the metrics that are forgotten and closed in the current version, and submit them according to the subtasks in order to fix them. > Clean up all metrics that were forgotten to be closed > - > > Key: KAFKA-15129 > URL: https://issues.apache.org/jira/browse/KAFKA-15129 > Project: Kafka > Issue Type: Improvement > Components: controller, core, log >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > In the current kafka code, there are still many module metrics that are > forgotten to be closed when they stop, although some of them have been fixed, > such as kafka-14866 and kafka-14868. et. > These metric leaks may lead to potential OOM risks, and, in the unit tests > and integration tests in the code, there are also a large number of `closes` > without removing the metric, which will also cause CI test instability. By > cleaning up these leaked indicators, these risks can be eliminated, and the > security and stability of the code can be enhanced. > Here I will find all the metrics that are forgotten and closed in the current > version, and submit them according to the subtasks in order to fix them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15194) Rename local tiered storage segment with offset as prefix for easy navigation
[ https://issues.apache.org/jira/browse/KAFKA-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744148#comment-17744148 ] hudeqi commented on KAFKA-15194: thanks [~divijvaidya] I still don't know much about "tiered storage" for the time being, and I have already taken over another one, and it may not be completed as soon as possible if I take over this one. > Rename local tiered storage segment with offset as prefix for easy navigation > - > > Key: KAFKA-15194 > URL: https://issues.apache.org/jira/browse/KAFKA-15194 > Project: Kafka > Issue Type: Task >Reporter: Kamal Chandraprakash >Priority: Minor > Labels: newbie > > In LocalTieredStorage which is an implementation of RemoteStorageManager, > segments are saved with random UUID. This makes navigating to a particular > segment harder. To navigate a given segment by offset, prepend the offset > information to the segment filename. > https://github.com/apache/kafka/pull/13837#discussion_r1258896009 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14467) Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state
[ https://issues.apache.org/jira/browse/KAFKA-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-14467: -- Assignee: hudeqi > Add a test to validate the replica state after processing the > OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state > -- > > Key: KAFKA-14467 > URL: https://issues.apache.org/jira/browse/KAFKA-14467 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Assignee: hudeqi >Priority: Major > > [https://github.com/apache/kafka/pull/11390#pullrequestreview-1210993072] > > https://github.com/apache/kafka/pull/11390/files#r1050045012 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14467) Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state
[ https://issues.apache.org/jira/browse/KAFKA-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744147#comment-17744147 ] hudeqi commented on KAFKA-14467: Thanks, I took over first, I'll look at this later. > Add a test to validate the replica state after processing the > OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state > -- > > Key: KAFKA-14467 > URL: https://issues.apache.org/jira/browse/KAFKA-14467 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Priority: Major > > [https://github.com/apache/kafka/pull/11390#pullrequestreview-1210993072] > > https://github.com/apache/kafka/pull/11390/files#r1050045012 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14467) Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state
[ https://issues.apache.org/jira/browse/KAFKA-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744087#comment-17744087 ] hudeqi commented on KAFKA-14467: Isn't this issue [~showuon] in charge ? > Add a test to validate the replica state after processing the > OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state > -- > > Key: KAFKA-14467 > URL: https://issues.apache.org/jira/browse/KAFKA-14467 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Priority: Major > > [https://github.com/apache/kafka/pull/11390#pullrequestreview-1210993072] > > https://github.com/apache/kafka/pull/11390/files#r1050045012 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15172) Allow exact mirroring of ACLs between clusters
[ https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742994#comment-17742994 ] hudeqi commented on KAFKA-15172: Hi,[~mimaison] [~ChrisEgerton] I am currently doing the work of using MM2 to realize the disaster recovery of active and standby clusters, and it has been applied to the production environment, so I am still very interested in this issue. I don’t know if it can be assigned to me (if it is not suitable, you can always reassign). For this problem, I think it should not only be synchronized with Topic write Acl, but also synchronized with Group Acl or even user scram credential, right? I can initiate a KIP if this is confirmed, thanks! > Allow exact mirroring of ACLs between clusters > -- > > Key: KAFKA-15172 > URL: https://issues.apache.org/jira/browse/KAFKA-15172 > Project: Kafka > Issue Type: Task > Components: mirrormaker >Reporter: Mickael Maison >Assignee: hudeqi >Priority: Major > Labels: needs-kip > > When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The > rationale to is prevent other clients to produce to remote topics. > However in disaster recovery scenarios, where the target cluster is not used > and just a "hot standby", it would be preferable to have exactly the same > ACLs on both clusters to speed up failover. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15172) Allow exact mirroring of ACLs between clusters
[ https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-15172: -- Assignee: hudeqi > Allow exact mirroring of ACLs between clusters > -- > > Key: KAFKA-15172 > URL: https://issues.apache.org/jira/browse/KAFKA-15172 > Project: Kafka > Issue Type: Task > Components: mirrormaker >Reporter: Mickael Maison >Assignee: hudeqi >Priority: Major > Labels: needs-kip > > When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The > rationale to is prevent other clients to produce to remote topics. > However in disaster recovery scenarios, where the target cluster is not used > and just a "hot standby", it would be preferable to have exactly the same > ACLs on both clusters to speed up failover. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.
[ https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742624#comment-17742624 ] hudeqi commented on KAFKA-14912: Hi [~divijvaidya] [~manyanda] I am also interested, if you are busy later, I can provide alternative support.:D > Introduce a configuration for remote index cache size, preferably a dynamic > config. > --- > > Key: KAFKA-14912 > URL: https://issues.apache.org/jira/browse/KAFKA-14912 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Satish Duggana >Assignee: Manyanda Chitimbo >Priority: Major > > Context: We need to make the 1024 value here [1] as dynamically configurable > [1] > https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`
[ https://issues.apache.org/jira/browse/KAFKA-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi resolved KAFKA-15139. Resolution: Fixed > Optimize the performance of `Set.removeAll(List)` in > `MirrorCheckpointConnector` > > > Key: KAFKA-15139 > URL: https://issues.apache.org/jira/browse/KAFKA-15139 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > This is the hint of `removeAll` method in `Set`: > _This implementation determines which is the smaller of this set and the > specified collection, by invoking the size method on each. If this set has > fewer elements, then the implementation iterates over this set, checking each > element returned by the iterator in turn to see if it is contained in the > specified collection. If it is so contained, it is removed from this set with > the iterator's remove method. If the specified collection has fewer elements, > then the implementation iterates over the specified collection, removing from > this set each element returned by the iterator, using this set's remove > method._ > That's said, assume that _M_ is the number of elements in the set and _N_ is > the number of elements in the List, if the type of the specified collection > is `List`, and {_}M<=N{_}, then the time complexity of `removeAll` is _O(MN)_ > (because the time complexity of searching in List is {_}O(N){_}), on the > contrary, if {_}N {_}O(N){_}. > In `MirrorCheckpointConnector`, `refreshConsumerGroups` method is repeatedly > called in a daemon thread. There are two `removeAll` in this method. From a > logical point of view, when this method is called in one round, when the > number of groups in the source cluster simply increases or decreases, the two > `removeAll` execution strategies will always hit the _O(MN)_ situation > mentioned above. Therefore, it is better to change all the variables here to > Set type to avoid this "low performance". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`
[ https://issues.apache.org/jira/browse/KAFKA-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15139: --- Description: This is the hint of `removeAll` method in `Set`: _This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method._ That's said, assume that _M_ is the number of elements in the set and _N_ is the number of elements in the List, if the type of the specified collection is `List`, and {_}M<=N{_}, then the time complexity of `removeAll` is _O(MN)_ (because the time complexity of searching in List is {_}O(N){_}), on the contrary, if {_}N Optimize the performance of `Set.removeAll(List)` in > `MirrorCheckpointConnector` > > > Key: KAFKA-15139 > URL: https://issues.apache.org/jira/browse/KAFKA-15139 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > This is the hint of `removeAll` method in `Set`: > _This implementation determines which is the smaller of this set and the > specified collection, by invoking the size method on each. If this set has > fewer elements, then the implementation iterates over this set, checking each > element returned by the iterator in turn to see if it is contained in the > specified collection. If it is so contained, it is removed from this set with > the iterator's remove method. If the specified collection has fewer elements, > then the implementation iterates over the specified collection, removing from > this set each element returned by the iterator, using this set's remove > method._ > That's said, assume that _M_ is the number of elements in the set and _N_ is > the number of elements in the List, if the type of the specified collection > is `List`, and {_}M<=N{_}, then the time complexity of `removeAll` is _O(MN)_ > (because the time complexity of searching in List is {_}O(N){_}), on the > contrary, if {_}N {_}O(N){_}. > In `MirrorCheckpointConnector`, `refreshConsumerGroups` method is repeatedly > called in a daemon thread. There are two `removeAll` in this method. From a > logical point of view, when this method is called in one round, when the > number of groups in the source cluster simply increases or decreases, the two > `removeAll` execution strategies will always hit the _O(MN)_ situation > mentioned above. Therefore, it is better to change all the variables here to > Set type to avoid this "low performance". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`
[ https://issues.apache.org/jira/browse/KAFKA-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15139: --- Description: This is the hint of `removeAll` method in `Set`: _This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method._ That's said, assume that M is the number of elements in the set and N is the number of elements in the List, if the type of the specified collection is `List`, and M<=N, then the time complexity of `removeAll` is O(MN) (because the time complexity of searching in List is O(N)), on the contrary, if N Optimize the performance of `Set.removeAll(List)` in > `MirrorCheckpointConnector` > > > Key: KAFKA-15139 > URL: https://issues.apache.org/jira/browse/KAFKA-15139 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > This is the hint of `removeAll` method in `Set`: > _This implementation determines which is the smaller of this set and the > specified collection, by invoking the size method on each. If this set has > fewer elements, then the implementation iterates over this set, checking each > element returned by the iterator in turn to see if it is contained in the > specified collection. If it is so contained, it is removed from this set with > the iterator's remove method. If the specified collection has fewer elements, > then the implementation iterates over the specified collection, removing from > this set each element returned by the iterator, using this set's remove > method._ > That's said, assume that M is the number of elements in the set and N is the > number of elements in the List, if the type of the specified collection is > `List`, and M<=N, then the time complexity of `removeAll` is O(MN) (because > the time complexity of searching in List is O(N)), on the contrary, if N it will search in `Set`, the time complexity is O(N). > In -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`
hudeqi created KAFKA-15139: -- Summary: Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector` Key: KAFKA-15139 URL: https://issues.apache.org/jira/browse/KAFKA-15139 Project: Kafka Issue Type: Improvement Components: mirrormaker Affects Versions: 3.5.0 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15134) Enrich the prompt reason in CommitFailedException
[ https://issues.apache.org/jira/browse/KAFKA-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15134: --- Attachment: WechatIMG31.jpeg > Enrich the prompt reason in CommitFailedException > - > > Key: KAFKA-15134 > URL: https://issues.apache.org/jira/browse/KAFKA-15134 > Project: Kafka > Issue Type: Improvement > Components: clients >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG31.jpeg > > > Firstly, let me post the exception log of the client running: > _"org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be > completed since the group has already rebalanced and assigned the partitions > to another member. This means that the time between subsequent calls to > poll() was longer than the configured max.poll.interval.ms, which typically > implies that the poll loop is spending too much time message processing. You > can address this either by increasing max.poll.interval.ms or by reducing the > maximum size of batches returned in poll() with max.poll.records."_ > This is the client exception log provided to us by the online business. > First, we confirmed that there is no issue on the server brokers. Then > according to the exception prompt information, it may be judged that the > settings of "max.poll.interval.ms" and "max.poll.records" may be > unreasonable, but the actual processing time of business printing is much > shorter than "max.poll.interval.ms", so We're sure that's not the reason. In > the end, it was found that there is such a case: the business uses a "group > id" to subscribe to some partitions of "topic1" through the simple consumer > mode, and set "auto.offset.reset=false". When using the same "group id" name > to start a high level consumer on "topic2", the original service using simple > consumer throws CommitFailedException. And I have reproduced this process. > In fact, this is not a bug, but a problem with the way the client is used, > but I think the exception message of `CommitFailedException` may have > imperfect and misleading guidance, so I have enriched the message that there > may be special situations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15134) Enrich the prompt reason in CommitFailedException
[ https://issues.apache.org/jira/browse/KAFKA-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15134: --- Description: Firstly, let me post the exception log of the client running: _"org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records."_ This is the client exception log provided to us by the online business. First, we confirmed that there is no issue on the server brokers. Then according to the exception prompt information, it may be judged that the settings of "max.poll.interval.ms" and "max.poll.records" may be unreasonable, but the actual processing time of business printing is much shorter than "max.poll.interval.ms", so We're sure that's not the reason. In the end, it was found that there is such a case: the business uses a "group id" to subscribe to some partitions of "topic1" through the simple consumer mode, and set "auto.offset.reset=false". When using the same "group id" name to start a high level consumer on "topic2", the original service using simple consumer throws CommitFailedException. And I have reproduced this process. In fact, this is not a bug, but a problem with the way the client is used, but I think the exception message of `CommitFailedException` may have imperfect and misleading guidance, so I have enriched the message that there may be special situations. > Enrich the prompt reason in CommitFailedException > - > > Key: KAFKA-15134 > URL: https://issues.apache.org/jira/browse/KAFKA-15134 > Project: Kafka > Issue Type: Improvement > Components: clients >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > Firstly, let me post the exception log of the client running: > _"org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be > completed since the group has already rebalanced and assigned the partitions > to another member. This means that the time between subsequent calls to > poll() was longer than the configured max.poll.interval.ms, which typically > implies that the poll loop is spending too much time message processing. You > can address this either by increasing max.poll.interval.ms or by reducing the > maximum size of batches returned in poll() with max.poll.records."_ > This is the client exception log provided to us by the online business. > First, we confirmed that there is no issue on the server brokers. Then > according to the exception prompt information, it may be judged that the > settings of "max.poll.interval.ms" and "max.poll.records" may be > unreasonable, but the actual processing time of business printing is much > shorter than "max.poll.interval.ms", so We're sure that's not the reason. In > the end, it was found that there is such a case: the business uses a "group > id" to subscribe to some partitions of "topic1" through the simple consumer > mode, and set "auto.offset.reset=false". When using the same "group id" name > to start a high level consumer on "topic2", the original service using simple > consumer throws CommitFailedException. And I have reproduced this process. > In fact, this is not a bug, but a problem with the way the client is used, > but I think the exception message of `CommitFailedException` may have > imperfect and misleading guidance, so I have enriched the message that there > may be special situations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15134) Enrich the prompt reason in CommitFailedException
hudeqi created KAFKA-15134: -- Summary: Enrich the prompt reason in CommitFailedException Key: KAFKA-15134 URL: https://issues.apache.org/jira/browse/KAFKA-15134 Project: Kafka Issue Type: Improvement Components: clients Affects Versions: 3.5.0 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15129) Clean up all metrics that were forgotten to be closed
[ https://issues.apache.org/jira/browse/KAFKA-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15129: --- Description: In the current kafka code, there are still many module metrics that are forgotten to be closed when they stop, although some of them have been fixed, such as kafka-14866 and kafka-14868. et. Here I will find all the metrics that are forgotten and closed in the current version, and submit them according to the subtasks in order to fix them. > Clean up all metrics that were forgotten to be closed > - > > Key: KAFKA-15129 > URL: https://issues.apache.org/jira/browse/KAFKA-15129 > Project: Kafka > Issue Type: Improvement > Components: controller, core, log >Affects Versions: 3.5.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > In the current kafka code, there are still many module metrics that are > forgotten to be closed when they stop, although some of them have been fixed, > such as kafka-14866 and kafka-14868. et. > Here I will find all the metrics that are forgotten and closed in the current > version, and submit them according to the subtasks in order to fix them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15129) Clean up all metrics that were forgotten to be closed
hudeqi created KAFKA-15129: -- Summary: Clean up all metrics that were forgotten to be closed Key: KAFKA-15129 URL: https://issues.apache.org/jira/browse/KAFKA-15129 Project: Kafka Issue Type: Improvement Components: controller, core, log Affects Versions: 3.5.0 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long
[ https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15086: --- Labels: kip-943 (was: ) > The unreasonable segment size setting of the internal topics in MM2 may cause > the worker startup time to be too long > > > Key: KAFKA-15086 > URL: https://issues.apache.org/jira/browse/KAFKA-15086 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Labels: kip-943 > Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg, WechatIMG366.jpeg > > > As the config 'segment.bytes' for topics related MM2(such as > offset.storage.topic, config.storage.topic,status.storage.topic), if > following the default configuration of the broker or set it larger, then when > the MM cluster runs many and complicated tasks, especially the log volume of > the topic 'offset.storage.topic' is very large, it will affect the restart > speed of the MM workers. > After investigation, the reason is that a consumer needs to be started to > read the data of ‘offset.storage.topic’ at startup. Although this topic is > set to compact, if the 'segment size' is set to a large value, such as the > default value of 1G, then this topic may have tens of gigabytes of data that > cannot be compacted and has to be read from the earliest (because the active > segment cannot be cleaned), which will consume a lot of time (in our online > environment, we found that this topic stores 13G of data, it took nearly half > an hour for all the data to be consumed), which caused the worker to be > unable to start and execute tasks for a long time. > Of course, the number of consumer threads can also be adjusted, but I think > it may be easier to reduce the 'segment size', for example, refer to the > default value of __consumer_offsets: 100MB > > The first picture in the attachment is the log size stored in the internal > topic, the second one is the time when ‘offset.storage.topic’ starts to be > read, and the third one is the time when ‘offset.storage.topic’ being read > finished. It took about 23 minutes in total. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15119) Support incremental synchronization of topicAcl in MirrorSourceConnector
[ https://issues.apache.org/jira/browse/KAFKA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15119: --- Description: In the “syncTopicAcls” thread of MirrorSourceConnector, full amount of "TopicAclBindings" related to the replicated topics of the source cluster will be regularly listed, and then fully updated to the target cluster. Therefore, a large number of repeated "TopicAclBindings" will be repeatedly sent by calling "targetAdminClient". This action is redundant. In addition, if too many "TopicAclBindings" are updated at one time, it may also take a long time for the target cluster to handle processing the "createAcls" request, which will affect the accumulation of the request queue of the target cluster and further affect the processing delay of other type requests. Therefore, "TopicAclBinding" can be like the variable “knownConsumerGroups” in MirrorCheckpointConnector, and only update the incremental added "TopicAclBinding" every time, which can solve the above-mentioned problems. > Support incremental synchronization of topicAcl in MirrorSourceConnector > > > Key: KAFKA-15119 > URL: https://issues.apache.org/jira/browse/KAFKA-15119 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > In the “syncTopicAcls” thread of MirrorSourceConnector, full amount of > "TopicAclBindings" related to the replicated topics of the source cluster > will be regularly listed, and then fully updated to the target cluster. > Therefore, a large number of repeated "TopicAclBindings" will be repeatedly > sent by calling "targetAdminClient". This action is redundant. In addition, > if too many "TopicAclBindings" are updated at one time, it may also take a > long time for the target cluster to handle processing the "createAcls" > request, which will affect the accumulation of the request queue of the > target cluster and further affect the processing delay of other type requests. > Therefore, "TopicAclBinding" can be like the variable “knownConsumerGroups” > in MirrorCheckpointConnector, and only update the incremental added > "TopicAclBinding" every time, which can solve the above-mentioned problems. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15119) Support incremental synchronization of topicAcl in MirrorSourceConnector
hudeqi created KAFKA-15119: -- Summary: Support incremental synchronization of topicAcl in MirrorSourceConnector Key: KAFKA-15119 URL: https://issues.apache.org/jira/browse/KAFKA-15119 Project: Kafka Issue Type: Improvement Components: mirrormaker Affects Versions: 3.4.1 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs
[ https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735644#comment-17735644 ] hudeqi commented on KAFKA-15110: Leave an exception log so that others can find this problem: [2023-06-21 18:00:59,186] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics) [2023-06-21 18:00:59,188] INFO Broker and topic stats closed (kafka.server.BrokerTopicStats) [2023-06-21 18:00:59,190] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser) [2023-06-21 18:00:59,190] INFO [KafkaServer id=0] shut down completed (kafka.server.KafkaServer) [2023-06-21 18:00:59,190] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$) java.lang.NoSuchMethodError: 'void org.apache.kafka.storage.internals.log.ProducerStateManagerConfig.(int)' at kafka.log.LogManager$.apply(LogManager.scala:1394) at kafka.server.KafkaServer.startup(KafkaServer.scala:279) at kafka.Kafka$.main(Kafka.scala:113) at kafka.Kafka.main(Kafka.scala) [2023-06-21 18:00:59,190] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer) > Wrong version may be run, which will cause to fail to run when there are > multiple version jars under core/build/libs > > > Key: KAFKA-15110 > URL: https://issues.apache.org/jira/browse/KAFKA-15110 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG28.jpeg, WechatIMG29.jpeg > > > For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, > and then switch to a 3.6.0 branch, a jar is also built. Since > "core/build/libs" is a "gitignore dir", there will be two versions of > packages in this directory. At this time, when I start the kafka process > under the local bin dir, I will encounter the problem that it cannot be > started because the version is running incorrectly. > The reason is that /kafka-run-class.sh transfers all jar packages to > CLASSPATH by default, which is an unreasonable behavior. > For details, see the attached screenshot below: > Figure 1 shows that the abnormal exit because of the missing of > ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, > but the initial method of ProducerStateManagerConfig in version 3.6.0 has two > parameters. > Figure 2 is the printed value of CLASSPATH. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs
[ https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15110: --- Attachment: WechatIMG29.jpeg > Wrong version may be run, which will cause to fail to run when there are > multiple version jars under core/build/libs > > > Key: KAFKA-15110 > URL: https://issues.apache.org/jira/browse/KAFKA-15110 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG28.jpeg, WechatIMG29.jpeg > > > For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, > and then switch to a 3.6.0 branch, a jar is also built. Since > "core/build/libs" is a "gitignore dir", there will be two versions of > packages in this directory. At this time, when I start the kafka process > under the local bin dir, I will encounter the problem that it cannot be > started because the version is running incorrectly. > The reason is that /kafka-run-class.sh transfers all jar packages to > CLASSPATH by default, which is an unreasonable behavior. > For details, see the attached screenshot below: > Figure 1 shows that the abnormal exit because of the missing of > ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, > but the initial method of ProducerStateManagerConfig in version 3.6.0 has two > parameters. > Figure 2 is the printed value of CLASSPATH. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs
[ https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15110: --- Attachment: WechatIMG28.jpeg > Wrong version may be run, which will cause to fail to run when there are > multiple version jars under core/build/libs > > > Key: KAFKA-15110 > URL: https://issues.apache.org/jira/browse/KAFKA-15110 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG28.jpeg, WechatIMG29.jpeg > > > For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, > and then switch to a 3.6.0 branch, a jar is also built. Since > "core/build/libs" is a "gitignore dir", there will be two versions of > packages in this directory. At this time, when I start the kafka process > under the local bin dir, I will encounter the problem that it cannot be > started because the version is running incorrectly. > The reason is that /kafka-run-class.sh transfers all jar packages to > CLASSPATH by default, which is an unreasonable behavior. > For details, see the attached screenshot below: > Figure 1 shows that the abnormal exit because of the missing of > ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, > but the initial method of ProducerStateManagerConfig in version 3.6.0 has two > parameters. > Figure 2 is the printed value of CLASSPATH. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs
[ https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15110: --- Description: For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, and then switch to a 3.6.0 branch, a jar is also built. Since "core/build/libs" is a "gitignore dir", there will be two versions of packages in this directory. At this time, when I start the kafka process under the local bin dir, I will encounter the problem that it cannot be started because the version is running incorrectly. The reason is that /kafka-run-class.sh transfers all jar packages to CLASSPATH by default, which is an unreasonable behavior. For details, see the attached screenshot below: Figure 1 shows that the abnormal exit because of the missing of ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, but the initial method of ProducerStateManagerConfig in version 3.6.0 has two parameters. Figure 2 is the printed value of CLASSPATH. > Wrong version may be run, which will cause to fail to run when there are > multiple version jars under core/build/libs > > > Key: KAFKA-15110 > URL: https://issues.apache.org/jira/browse/KAFKA-15110 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, > and then switch to a 3.6.0 branch, a jar is also built. Since > "core/build/libs" is a "gitignore dir", there will be two versions of > packages in this directory. At this time, when I start the kafka process > under the local bin dir, I will encounter the problem that it cannot be > started because the version is running incorrectly. > The reason is that /kafka-run-class.sh transfers all jar packages to > CLASSPATH by default, which is an unreasonable behavior. > For details, see the attached screenshot below: > Figure 1 shows that the abnormal exit because of the missing of > ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, > but the initial method of ProducerStateManagerConfig in version 3.6.0 has two > parameters. > Figure 2 is the printed value of CLASSPATH. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs
[ https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15110: --- Summary: Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs (was: Wrong version may be run, which will cause it to fail to run when there are multiple version jars under core/build/libs) > Wrong version may be run, which will cause to fail to run when there are > multiple version jars under core/build/libs > > > Key: KAFKA-15110 > URL: https://issues.apache.org/jira/browse/KAFKA-15110 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15110) Wrong version may be run, which will cause it to fail to run when there are multiple version jars under core/build/libs
hudeqi created KAFKA-15110: -- Summary: Wrong version may be run, which will cause it to fail to run when there are multiple version jars under core/build/libs Key: KAFKA-15110 URL: https://issues.apache.org/jira/browse/KAFKA-15110 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 3.4.1 Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long
[ https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15086: --- Description: As the config 'segment.bytes' for topics related MM2(such as offset.storage.topic, config.storage.topic,status.storage.topic), if following the default configuration of the broker or set it larger, then when the MM cluster runs many and complicated tasks, especially the log volume of the topic 'offset.storage.topic' is very large, it will affect the restart speed of the MM workers. After investigation, the reason is that a consumer needs to be started to read the data of ‘offset.storage.topic’ at startup. Although this topic is set to compact, if the 'segment size' is set to a large value, such as the default value of 1G, then this topic may have tens of gigabytes of data that cannot be compacted and has to be read from the earliest (because the active segment cannot be cleaned), which will consume a lot of time (in our online environment, we found that this topic stores 13G of data, it took nearly half an hour for all the data to be consumed), which caused the worker to be unable to start and execute tasks for a long time. Of course, the number of consumer threads can also be adjusted, but I think it may be easier to reduce the 'segment size', for example, refer to the default value of __consumer_offsets: 100MB The first picture in the attachment is the log size stored in the internal topic, the second one is the time when ‘offset.storage.topic’ starts to be read, and the third one is the time when ‘offset.storage.topic’ being read finished. It took about 23 minutes in total. was: As the config 'segment.bytes' for topics related MM2(such as offset.storage.topic, config.storage.topic,status.storage.topic), if following the default configuration of the broker or set it larger, then when the MM cluster runs many and complicated tasks, especially the log volume of the topic 'offset.storage.topic' is very large, it will affect the restart speed of the MM workers. After investigation, the reason is that a consumer needs to be started to read the data of ‘offset.storage.topic’ at startup. Although this topic is set to compact, if the 'segment size' is set to a large value, such as the default value of 1G, then this topic may have tens of gigabytes of data that cannot be compacted and has to be read from the earliest (because the active segment cannot be cleaned), which will consume a lot of time (in our online environment, we found that this topic stores 13G of data, it took nearly half an hour for all the data to be consumed), which caused the worker to be unable to start and execute tasks for a long time. Of course, the number of consumer threads can also be adjusted, but I think it may be easier to reduce the 'segment size', for example, refer to the default value of __consumer_offsets: 100MB > The unreasonable segment size setting of the internal topics in MM2 may cause > the worker startup time to be too long > > > Key: KAFKA-15086 > URL: https://issues.apache.org/jira/browse/KAFKA-15086 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg, WechatIMG366.jpeg > > > As the config 'segment.bytes' for topics related MM2(such as > offset.storage.topic, config.storage.topic,status.storage.topic), if > following the default configuration of the broker or set it larger, then when > the MM cluster runs many and complicated tasks, especially the log volume of > the topic 'offset.storage.topic' is very large, it will affect the restart > speed of the MM workers. > After investigation, the reason is that a consumer needs to be started to > read the data of ‘offset.storage.topic’ at startup. Although this topic is > set to compact, if the 'segment size' is set to a large value, such as the > default value of 1G, then this topic may have tens of gigabytes of data that > cannot be compacted and has to be read from the earliest (because the active > segment cannot be cleaned), which will consume a lot of time (in our online > environment, we found that this topic stores 13G of data, it took nearly half > an hour for all the data to be consumed), which caused the worker to be > unable to start and execute tasks for a long time. > Of course, the number of consumer threads can also be adjusted, but I think > it may be easier to reduce the 'segment size', for example, refer to the > default value of __consumer_offsets: 100MB > > The first picture in the attachment is the log size stored in the intern
[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long
[ https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15086: --- Attachment: WechatIMG366.jpeg > The unreasonable segment size setting of the internal topics in MM2 may cause > the worker startup time to be too long > > > Key: KAFKA-15086 > URL: https://issues.apache.org/jira/browse/KAFKA-15086 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg, WechatIMG366.jpeg > > > As the config 'segment.bytes' for topics related MM2(such as > offset.storage.topic, config.storage.topic,status.storage.topic), if > following the default configuration of the broker or set it larger, then when > the MM cluster runs many and complicated tasks, especially the log volume of > the topic 'offset.storage.topic' is very large, it will affect the restart > speed of the MM workers. > After investigation, the reason is that a consumer needs to be started to > read the data of ‘offset.storage.topic’ at startup. Although this topic is > set to compact, if the 'segment size' is set to a large value, such as the > default value of 1G, then this topic may have tens of gigabytes of data that > cannot be compacted and has to be read from the earliest (because the active > segment cannot be cleaned), which will consume a lot of time (in our online > environment, we found that this topic stores 13G of data, it took nearly half > an hour for all the data to be consumed), which caused the worker to be > unable to start and execute tasks for a long time. > Of course, the number of consumer threads can also be adjusted, but I think > it may be easier to reduce the 'segment size', for example, refer to the > default value of __consumer_offsets: 100MB -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long
[ https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15086: --- Attachment: WechatIMG365.jpeg > The unreasonable segment size setting of the internal topics in MM2 may cause > the worker startup time to be too long > > > Key: KAFKA-15086 > URL: https://issues.apache.org/jira/browse/KAFKA-15086 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg > > > As the config 'segment.bytes' for topics related MM2(such as > offset.storage.topic, config.storage.topic,status.storage.topic), if > following the default configuration of the broker or set it larger, then when > the MM cluster runs many and complicated tasks, especially the log volume of > the topic 'offset.storage.topic' is very large, it will affect the restart > speed of the MM workers. > After investigation, the reason is that a consumer needs to be started to > read the data of ‘offset.storage.topic’ at startup. Although this topic is > set to compact, if the 'segment size' is set to a large value, such as the > default value of 1G, then this topic may have tens of gigabytes of data that > cannot be compacted and has to be read from the earliest (because the active > segment cannot be cleaned), which will consume a lot of time (in our online > environment, we found that this topic stores 13G of data, it took nearly half > an hour for all the data to be consumed), which caused the worker to be > unable to start and execute tasks for a long time. > Of course, the number of consumer threads can also be adjusted, but I think > it may be easier to reduce the 'segment size', for example, refer to the > default value of __consumer_offsets: 100MB -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long
[ https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15086: --- Attachment: WechatIMG364.jpeg > The unreasonable segment size setting of the internal topics in MM2 may cause > the worker startup time to be too long > > > Key: KAFKA-15086 > URL: https://issues.apache.org/jira/browse/KAFKA-15086 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Affects Versions: 3.4.1 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg > > > As the config 'segment.bytes' for topics related MM2(such as > offset.storage.topic, config.storage.topic,status.storage.topic), if > following the default configuration of the broker or set it larger, then when > the MM cluster runs many and complicated tasks, especially the log volume of > the topic 'offset.storage.topic' is very large, it will affect the restart > speed of the MM workers. > After investigation, the reason is that a consumer needs to be started to > read the data of ‘offset.storage.topic’ at startup. Although this topic is > set to compact, if the 'segment size' is set to a large value, such as the > default value of 1G, then this topic may have tens of gigabytes of data that > cannot be compacted and has to be read from the earliest (because the active > segment cannot be cleaned), which will consume a lot of time (in our online > environment, we found that this topic stores 13G of data, it took nearly half > an hour for all the data to be consumed), which caused the worker to be > unable to start and execute tasks for a long time. > Of course, the number of consumer threads can also be adjusted, but I think > it may be easier to reduce the 'segment size', for example, refer to the > default value of __consumer_offsets: 100MB -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long
hudeqi created KAFKA-15086: -- Summary: The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long Key: KAFKA-15086 URL: https://issues.apache.org/jira/browse/KAFKA-15086 Project: Kafka Issue Type: Improvement Components: mirrormaker Affects Versions: 3.4.1 Reporter: hudeqi Assignee: hudeqi As the config 'segment.bytes' for topics related MM2(such as offset.storage.topic, config.storage.topic,status.storage.topic), if following the default configuration of the broker or set it larger, then when the MM cluster runs many and complicated tasks, especially the log volume of the topic 'offset.storage.topic' is very large, it will affect the restart speed of the MM workers. After investigation, the reason is that a consumer needs to be started to read the data of ‘offset.storage.topic’ at startup. Although this topic is set to compact, if the 'segment size' is set to a large value, such as the default value of 1G, then this topic may have tens of gigabytes of data that cannot be compacted and has to be read from the earliest (because the active segment cannot be cleaned), which will consume a lot of time (in our online environment, we found that this topic stores 13G of data, it took nearly half an hour for all the data to be consumed), which caused the worker to be unable to start and execute tasks for a long time. Of course, the number of consumer threads can also be adjusted, but I think it may be easier to reduce the 'segment size', for example, refer to the default value of __consumer_offsets: 100MB -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead
[ https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-9613: -- Affects Version/s: 2.6.2 > CorruptRecordException: Found record size 0 smaller than minimum record > overhead > > > Key: KAFKA-9613 > URL: https://issues.apache.org/jira/browse/KAFKA-9613 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.6.2 >Reporter: Amit Khandelwal >Assignee: hudeqi >Priority: Major > > 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] > Error processing fetch with max size 1048576 from consumer on partition > SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, > maxBytes=1048576, currentLeaderEpoch=Optional.empty) > (kafka.server.ReplicaManager) > 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: > Found record size 0 smaller than minimum record overhead (14) in file > /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log. > 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager > brokerId=0] Removed 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: > Member > _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f > in group y_011 has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > > [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead
[ https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-9613: -- Component/s: core > CorruptRecordException: Found record size 0 smaller than minimum record > overhead > > > Key: KAFKA-9613 > URL: https://issues.apache.org/jira/browse/KAFKA-9613 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.6.2 >Reporter: Amit Khandelwal >Assignee: hudeqi >Priority: Major > > 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] > Error processing fetch with max size 1048576 from consumer on partition > SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, > maxBytes=1048576, currentLeaderEpoch=Optional.empty) > (kafka.server.ReplicaManager) > 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: > Found record size 0 smaller than minimum record overhead (14) in file > /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log. > 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager > brokerId=0] Removed 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: > Member > _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f > in group y_011 has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > > [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead
[ https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-9613: - Assignee: hudeqi > CorruptRecordException: Found record size 0 smaller than minimum record > overhead > > > Key: KAFKA-9613 > URL: https://issues.apache.org/jira/browse/KAFKA-9613 > Project: Kafka > Issue Type: Bug >Reporter: Amit Khandelwal >Assignee: hudeqi >Priority: Major > > 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] > Error processing fetch with max size 1048576 from consumer on partition > SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, > maxBytes=1048576, currentLeaderEpoch=Optional.empty) > (kafka.server.ReplicaManager) > 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: > Found record size 0 smaller than minimum record overhead (14) in file > /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log. > 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager > brokerId=0] Removed 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: > Member > _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f > in group y_011 has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > > [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-6668) Broker crashes on restart ,got a CorruptRecordException: Record size is smaller than minimum record overhead(14)
[ https://issues.apache.org/jira/browse/KAFKA-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-6668: - Assignee: hudeqi > Broker crashes on restart ,got a CorruptRecordException: Record size is > smaller than minimum record overhead(14) > > > Key: KAFKA-6668 > URL: https://issues.apache.org/jira/browse/KAFKA-6668 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 0.11.0.1 > Environment: Linux version : > 3.10.0-514.26.2.el7.x86_64 (mockbuild@cgslv5.buildsys213) (gcc version 4.8.5 > 20150623 (Red Hat 4.8.5-11) > docker version: > Client: > Version: 1.12.6 > API version: 1.24 > Go version: go1.7.5 > Git commit: ad3fef854be2172454902950240a9a9778d24345 > Built:Mon Jan 15 22:01:14 2018 > OS/Arch: linux/amd64 >Reporter: little brother ma >Assignee: hudeqi >Priority: Major > > There is a kafka cluster with a broker ,running in docker container. Because > of disk full, the container crashes, and gets restarted again, and crashes > again... > log when disk full : > > {code:java} > [2018-03-14 00:11:40,764] INFO Rolled new log segment for 'oem-debug-log-1' > in 1 ms. (kafka.log.Log) [2018-03-14 00:11:40,765] ERROR Uncaught exception > in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) > java.io.IOException: I/O error at sun.nio.ch.FileDispatcherImpl.force0(Native > Method) at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) at > sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) at > org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:162) at > kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:377) at > kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at > kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at > kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at > kafka.log.LogSegment.flush(LogSegment.scala:376) at > kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1312) at > kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1311) at > scala.collection.Iterator$class.foreach(Iterator.scala:891) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at > scala.collection.AbstractIterable.foreach(Iterable.scala:54) at > kafka.log.Log.flush(Log.scala:1311) at > kafka.log.Log$$anonfun$roll$1.apply$mcV$sp(Log.scala:1283) at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at > java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) [2018-03-14 00:11:56,514] ERROR > [KafkaApi-7] Error when handling request > {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=oem-debug- > log,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576},{partition=1,fetch_offset=131382630,max_bytes=1048576}]}]} > (kafka.server.KafkaAp is) > org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller > than minimum record overhead (14). > {code} > > > > log when resolved issue of disk full,and kafka restart: > > {code:java} > [2018-03-15 23:00:08,998] WARN Found a corrupted index file due to > requirement failed: Corrupt index found, index file > (/kafka/kafka-logs/__consumer > _offsets-19/03396188.index) has non-zero size but the last offset > is 3396188 which is no larger than the base offset 3396188.}. deleting > /kafka/kafka-logs/__consumer_offsets-19/03396188.timeindex, > /kafka/kafka-logs/__consumer_offsets-19/03396188.index, and /ka > fka/kafka-logs/__consumer_offsets-19/03396188.txnindex and > rebuilding index... (kafka.log.Log) > [2018-03-15 23:00:08,999] INFO Loading producer state from snapshot file > '/kafka/kafka-logs/__consumer_offsets-19/03396188.snapshot' for > partition __consumer_offsets-19 (kafka.log.ProducerStateManager) > [2018-03-15 23:00:09,242] INFO Recovering unflushed segment 3396188 in log > __consumer_offsets-19. (kafka.log.Log) > [2018-03-15 23:00:09,243] INFO Loading producer state from snapshot file > '/kafka/kafka-l
[jira] [Commented] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead
[ https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731863#comment-17731863 ] hudeqi commented on KAFKA-9613: --- I have also encountered this problem online. Based on the time point and system logs, it is related to disk problems. > CorruptRecordException: Found record size 0 smaller than minimum record > overhead > > > Key: KAFKA-9613 > URL: https://issues.apache.org/jira/browse/KAFKA-9613 > Project: Kafka > Issue Type: Bug >Reporter: Amit Khandelwal >Priority: Major > > 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] > Error processing fetch with max size 1048576 from consumer on partition > SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, > maxBytes=1048576, currentLeaderEpoch=Optional.empty) > (kafka.server.ReplicaManager) > 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: > Found record size 0 smaller than minimum record overhead (14) in file > /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log. > 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager > brokerId=0] Removed 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: > Member > _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f > in group y_011 has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > > [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-6668) Broker crashes on restart ,got a CorruptRecordException: Record size is smaller than minimum record overhead(14)
[ https://issues.apache.org/jira/browse/KAFKA-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731862#comment-17731862 ] hudeqi commented on KAFKA-6668: --- I have also encountered this issue online. Based on the time point and system logs, it is related to disk problems. > Broker crashes on restart ,got a CorruptRecordException: Record size is > smaller than minimum record overhead(14) > > > Key: KAFKA-6668 > URL: https://issues.apache.org/jira/browse/KAFKA-6668 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 0.11.0.1 > Environment: Linux version : > 3.10.0-514.26.2.el7.x86_64 (mockbuild@cgslv5.buildsys213) (gcc version 4.8.5 > 20150623 (Red Hat 4.8.5-11) > docker version: > Client: > Version: 1.12.6 > API version: 1.24 > Go version: go1.7.5 > Git commit: ad3fef854be2172454902950240a9a9778d24345 > Built:Mon Jan 15 22:01:14 2018 > OS/Arch: linux/amd64 >Reporter: little brother ma >Priority: Major > > There is a kafka cluster with a broker ,running in docker container. Because > of disk full, the container crashes, and gets restarted again, and crashes > again... > log when disk full : > > {code:java} > [2018-03-14 00:11:40,764] INFO Rolled new log segment for 'oem-debug-log-1' > in 1 ms. (kafka.log.Log) [2018-03-14 00:11:40,765] ERROR Uncaught exception > in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) > java.io.IOException: I/O error at sun.nio.ch.FileDispatcherImpl.force0(Native > Method) at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) at > sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) at > org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:162) at > kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:377) at > kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at > kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at > kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at > kafka.log.LogSegment.flush(LogSegment.scala:376) at > kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1312) at > kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1311) at > scala.collection.Iterator$class.foreach(Iterator.scala:891) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at > scala.collection.AbstractIterable.foreach(Iterable.scala:54) at > kafka.log.Log.flush(Log.scala:1311) at > kafka.log.Log$$anonfun$roll$1.apply$mcV$sp(Log.scala:1283) at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at > java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) [2018-03-14 00:11:56,514] ERROR > [KafkaApi-7] Error when handling request > {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=oem-debug- > log,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576},{partition=1,fetch_offset=131382630,max_bytes=1048576}]}]} > (kafka.server.KafkaAp is) > org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller > than minimum record overhead (14). > {code} > > > > log when resolved issue of disk full,and kafka restart: > > {code:java} > [2018-03-15 23:00:08,998] WARN Found a corrupted index file due to > requirement failed: Corrupt index found, index file > (/kafka/kafka-logs/__consumer > _offsets-19/03396188.index) has non-zero size but the last offset > is 3396188 which is no larger than the base offset 3396188.}. deleting > /kafka/kafka-logs/__consumer_offsets-19/03396188.timeindex, > /kafka/kafka-logs/__consumer_offsets-19/03396188.index, and /ka > fka/kafka-logs/__consumer_offsets-19/03396188.txnindex and > rebuilding index... (kafka.log.Log) > [2018-03-15 23:00:08,999] INFO Loading producer state from snapshot file > '/kafka/kafka-logs/__consumer_offsets-19/03396188.snapshot' for > partition __consumer_offsets-19 (kafka.log.ProducerStateManager) > [2018-03-15 23:00:09,242] INFO Recovering unflushed segment 3396188 in log > __consumer_offs
[jira] [Commented] (KAFKA-15068) Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME
[ https://issues.apache.org/jira/browse/KAFKA-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730100#comment-17730100 ] hudeqi commented on KAFKA-15068: @[~ChrisEgerton] Hello, if you have time, please take a look at this issue. > Incorrect replication Latency may be calculated when the timestamp of the > record is of type CREATE_TIME > --- > > Key: KAFKA-15068 > URL: https://issues.apache.org/jira/browse/KAFKA-15068 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > When MM2 is used to replicate topics between Kafka clusters, if the timestamp > of the record of the source cluster is of type CREATE_TIME, then the value of > timestamp may not be the actual "creation time (that is, append time)", for > example, the value of timestamp is less than the current time for a long time > (this is determined by the producer, and often occurs in the online > environment), at this time, the calculated replication latency will be too > large. > I understand that replication latency reflects the performance of replication > and should not be affected by this abnormal situation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15068) Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME
[ https://issues.apache.org/jira/browse/KAFKA-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-15068: --- Description: When MM2 is used to replicate topics between Kafka clusters, if the timestamp of the record of the source cluster is of type CREATE_TIME, then the value of timestamp may not be the actual "creation time (that is, append time)", for example, the value of timestamp is less than the current time for a long time (this is determined by the producer, and often occurs in the online environment), at this time, the calculated replication latency will be too large. I understand that replication latency reflects the performance of replication and should not be affected by this abnormal situation. > Incorrect replication Latency may be calculated when the timestamp of the > record is of type CREATE_TIME > --- > > Key: KAFKA-15068 > URL: https://issues.apache.org/jira/browse/KAFKA-15068 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > When MM2 is used to replicate topics between Kafka clusters, if the timestamp > of the record of the source cluster is of type CREATE_TIME, then the value of > timestamp may not be the actual "creation time (that is, append time)", for > example, the value of timestamp is less than the current time for a long time > (this is determined by the producer, and often occurs in the online > environment), at this time, the calculated replication latency will be too > large. > I understand that replication latency reflects the performance of replication > and should not be affected by this abnormal situation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15068) Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME
hudeqi created KAFKA-15068: -- Summary: Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME Key: KAFKA-15068 URL: https://issues.apache.org/jira/browse/KAFKA-15068 Project: Kafka Issue Type: Improvement Components: mirrormaker Reporter: hudeqi Assignee: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread
[ https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721163#comment-17721163 ] hudeqi commented on KAFKA-14979: [GitHub Pull Request #13692|https://github.com/apache/kafka/pull/13692] is expired. > Incorrect lag was calculated when markPartitionsForTruncation in > ReplicaAlterLogDirsThread > -- > > Key: KAFKA-14979 > URL: https://issues.apache.org/jira/browse/KAFKA-14979 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > When the partitions of ReplicaFetcherThread finished truncating, the > ReplicaAlterLogDirsThread to which these partitions belong needs to be marked > truncate. The lag value in the newState (PartitionFetchState) obtained in > this process is still the original value (state.lag). If the truncationOffset > is smaller than the original state.fetchOffset, then the original lag value > is incorrect and needs to be updated. It should be the original lag value > plus the difference between the original state.fetchOffset and > truncationOffset. > If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread > may set the wrong lag value for fetcherLagStats when executing > "processFetchRequest". So it might be more reasonable to recalculate a lag > value based on new state.fetchOffset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread
[ https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-14979: --- Description: When the partitions of ReplicaFetcherThread finished truncating, the ReplicaAlterLogDirsThread to which these partitions belong needs to be marked truncate. The lag value in the newState (PartitionFetchState) obtained in this process is still the original value (state.lag). If the truncationOffset is smaller than the original state.fetchOffset, then the original lag value is incorrect and needs to be updated. It should be the original lag value plus the difference between the original state.fetchOffset and truncationOffset. If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread may set the wrong lag value for fetcherLagStats when executing "processFetchRequest". So it might be more reasonable to recalculate a lag value based on new state.fetchOffset. was: When the partitions of ReplicaFetcherThread finished truncating, the ReplicaAlterLogDirsThread to which these partitions belong needs to be marked truncate. The lag value in the newState (PartitionFetchState) obtained in this process is still the original value (state.lag). If the truncationOffset is smaller than the original state.fetchOffset, then the original lag value is incorrect and needs to be updated. It should be the original lag value plus the difference between the original state.fetchOffset and truncationOffset. If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread may set the wrong lag value for fetcherLagStats when executing processFetchRequest. So it might be more reasonable to recalculate a lag value based on new state.fetchOffset. > Incorrect lag was calculated when markPartitionsForTruncation in > ReplicaAlterLogDirsThread > -- > > Key: KAFKA-14979 > URL: https://issues.apache.org/jira/browse/KAFKA-14979 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > When the partitions of ReplicaFetcherThread finished truncating, the > ReplicaAlterLogDirsThread to which these partitions belong needs to be marked > truncate. The lag value in the newState (PartitionFetchState) obtained in > this process is still the original value (state.lag). If the truncationOffset > is smaller than the original state.fetchOffset, then the original lag value > is incorrect and needs to be updated. It should be the original lag value > plus the difference between the original state.fetchOffset and > truncationOffset. > If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread > may set the wrong lag value for fetcherLagStats when executing > "processFetchRequest". So it might be more reasonable to recalculate a lag > value based on new state.fetchOffset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread
[ https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-14979: --- Description: When the partitions of ReplicaFetcherThread finished truncating, the ReplicaAlterLogDirsThread to which these partitions belong needs to be marked truncate. The lag value in the newState (PartitionFetchState) obtained in this process is still the original value (state.lag). If the truncationOffset is smaller than the original state.fetchOffset, then the original lag value is incorrect and needs to be updated. It should be the original lag value plus the difference between the original state.fetchOffset and truncationOffset. If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread may set the wrong lag value for fetcherLagStats when executing processFetchRequest. So it might be more reasonable to recalculate a lag value based on new state.fetchOffset. > Incorrect lag was calculated when markPartitionsForTruncation in > ReplicaAlterLogDirsThread > -- > > Key: KAFKA-14979 > URL: https://issues.apache.org/jira/browse/KAFKA-14979 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > When the partitions of ReplicaFetcherThread finished truncating, the > ReplicaAlterLogDirsThread to which these partitions belong needs to be marked > truncate. The lag value in the newState (PartitionFetchState) obtained in > this process is still the original value (state.lag). If the truncationOffset > is smaller than the original state.fetchOffset, then the original lag value > is incorrect and needs to be updated. It should be the original lag value > plus the difference between the original state.fetchOffset and > truncationOffset. > If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread > may set the wrong lag value for fetcherLagStats when executing > processFetchRequest. So it might be more reasonable to recalculate a lag > value based on new state.fetchOffset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread
[ https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi reassigned KAFKA-14979: -- Assignee: hudeqi > Incorrect lag was calculated when markPartitionsForTruncation in > ReplicaAlterLogDirsThread > -- > > Key: KAFKA-14979 > URL: https://issues.apache.org/jira/browse/KAFKA-14979 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread
hudeqi created KAFKA-14979: -- Summary: Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread Key: KAFKA-14979 URL: https://issues.apache.org/jira/browse/KAFKA-14979 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 3.3.2 Reporter: hudeqi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14907) Add the traffic metric of the partition dimension in BrokerTopicStats
[ https://issues.apache.org/jira/browse/KAFKA-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-14907: --- Labels: KIP-922 (was: ) > Add the traffic metric of the partition dimension in BrokerTopicStats > - > > Key: KAFKA-14907 > URL: https://issues.apache.org/jira/browse/KAFKA-14907 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Labels: KIP-922 > > {color:#172b4d}Currently, there are two metrics for measuring the traffic in > topic dimensions: MessagesInPerSec, BytesInPerSec, but there are two > problems:{color} > {color:#172b4d}1. It is difficult to intuitively reflect the problem of topic > partition traffic inclination through these indicators, and it is impossible > to clearly see which partition has the largest traffic and the traffic > situation of each partition. But the partition dimension can solve > this.{color} > {color:#172b4d}2. For the sudden increase in topic traffic (for example, this > will lead some followers to out of Isr, which can be avoided by appropriately > increasing the number of partitions.), the metrics of the partition dimension > can help to provide guidance on whether to expand the partition.{color} > {color:#172b4d}On the whole, I think it is very meaningful to add traffic > metrics of partition dimension, especially the issue of traffic skew.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14907) Add the traffic metric of the partition dimension in BrokerTopicStats
[ https://issues.apache.org/jira/browse/KAFKA-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hudeqi updated KAFKA-14907: --- Description: {color:#172b4d}Currently, there are two metrics for measuring the traffic in topic dimensions: MessagesInPerSec, BytesInPerSec, but there are two problems:{color} {color:#172b4d}1. It is difficult to intuitively reflect the problem of topic partition traffic inclination through these indicators, and it is impossible to clearly see which partition has the largest traffic and the traffic situation of each partition. But the partition dimension can solve this.{color} {color:#172b4d}2. For the sudden increase in topic traffic (for example, this will lead some followers to out of Isr, which can be avoided by appropriately increasing the number of partitions.), the metrics of the partition dimension can help to provide guidance on whether to expand the partition.{color} {color:#172b4d}On the whole, I think it is very meaningful to add traffic metrics of partition dimension, especially the issue of traffic skew.{color} was: Currently, there are three indicators for measuring topic dimensions: MessagesInPerSec, BytesInPerSec, and BytesOutPerSec, but there are two problems: 1. It is difficult to intuitively reflect the problem of topic partition traffic inclination through these indicators, and it is impossible to clearly see which partition has the largest traffic and the traffic situation of each partition. 2. For the sudden increase in topic traffic, the indicators of the partition dimension can provide guidance on whether to expand the partition and roughly how many partitions to expand. > Add the traffic metric of the partition dimension in BrokerTopicStats > - > > Key: KAFKA-14907 > URL: https://issues.apache.org/jira/browse/KAFKA-14907 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > > {color:#172b4d}Currently, there are two metrics for measuring the traffic in > topic dimensions: MessagesInPerSec, BytesInPerSec, but there are two > problems:{color} > {color:#172b4d}1. It is difficult to intuitively reflect the problem of topic > partition traffic inclination through these indicators, and it is impossible > to clearly see which partition has the largest traffic and the traffic > situation of each partition. But the partition dimension can solve > this.{color} > {color:#172b4d}2. For the sudden increase in topic traffic (for example, this > will lead some followers to out of Isr, which can be avoided by appropriately > increasing the number of partitions.), the metrics of the partition dimension > can help to provide guidance on whether to expand the partition.{color} > {color:#172b4d}On the whole, I think it is very meaningful to add traffic > metrics of partition dimension, especially the issue of traffic skew.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14906) Extract the coordinator service log from server log
[ https://issues.apache.org/jira/browse/KAFKA-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712337#comment-17712337 ] hudeqi commented on KAFKA-14906: This change will be reintroduced in version 4.x. > Extract the coordinator service log from server log > --- > > Key: KAFKA-14906 > URL: https://issues.apache.org/jira/browse/KAFKA-14906 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 4.0.0 >Reporter: hudeqi >Assignee: hudeqi >Priority: Major > Fix For: 4.0.0 > > > Currently, the coordinator service log and server log are mixed together. > When troubleshooting the coordinator problem, it is necessary to filter from > the server log, which is not very convenient. Therefore, the coordinator log > is separated like the controller log. -- This message was sent by Atlassian Jira (v8.20.10#820010)