[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`

2024-08-05 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-16710:
---
Affects Version/s: 3.8.0

> Continuously `makeFollower` may cause the replica fetcher thread to encounter 
> an offset mismatch exception when `processPartitionData`
> --
>
> Key: KAFKA-16710
> URL: https://issues.apache.org/jira/browse/KAFKA-16710
> Project: Kafka
>  Issue Type: Bug
>  Components: core, replication
>Affects Versions: 2.8.1, 3.8.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png, 
> 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png, 
> 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png
>
>
> The scenario where this case occurs is during a reassignment of a partition: 
> 110879, 110880 (original leader, original follower) ---> 110879, 110880, 
> 110881, 113915 (the latter two replicas are new leader and new follower) ---> 
> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
> occurs on the new follower 113915.
> Through analysis, the exception occurs in the reassignment process:
>  # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
> controller will switch the leader from 110879 to 110881, and then send a new 
> `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
> 110881, 113915.
>  # This time, 110881 executes `makeLeader`, and 113915 executes 
> `makeFollower`. After the new follower 113915 completes 
> `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts 
> fetching data from the new leader 110881, but because the log end offset of 
> the new leader 110881 (18735600055) is smaller than the log end offset of the 
> new follower 113915 (18735600059), the new follower 113915 adds the partition 
> to `divergingEndOffsets` during `processFetchRequest` and then executes 
> `truncateOnFetchResponse` to truncate the local log to 18735600055.
>  # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
> `partitionMapLock` lock, and at the same time, the new leader 110881 and the 
> new follower 113915 also receive another `leaderAndIsr` request from the 
> controller (to remove the old replicas 110879, 110880 from the ISR), and the 
> `ReplicaFetcherManager` thread of the new follower 113915 executes the second 
> `makeFollower` to acquire the `partitionMapLock` lock firstly and execute 
> `removeFetcherForPartitions`, and then gets the local log end offset 
> (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` 
> again to update the fetch offset (18735600059) to the `partitionStates`.
>  # But unfortunately, the follower fetcher thread that was ready to truncate 
> the local log to 18735600055 firstly obtained the `partitionMapLock` lock and 
> completed the truncation, and the log end offset is now 18735600055.
>  # Then, the thread that executed the second `makeFollower` obtained the 
> `partitionMapLock` lock and executed `addFetcherForPartitions` to update the 
> outdated fetch offset (18735600059) to the `partitionStates`.
>  # Finally, it leads to: the follower thread throws the following exception 
> during `processPartitionData`: "java.lang.IllegalStateException: Offset 
> mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 
> 18735600059, log end offset = 18735600055."
>  
> The relevant logs are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`

2024-05-13 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-16710:
---
Attachment: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png
企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png
企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png

> Continuously `makeFollower` may cause the replica fetcher thread to encounter 
> an offset mismatch exception when `processPartitionData`
> --
>
> Key: KAFKA-16710
> URL: https://issues.apache.org/jira/browse/KAFKA-16710
> Project: Kafka
>  Issue Type: Bug
>  Components: core, replication
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png, 
> 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png, 
> 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png
>
>
> The scenario where this case occurs is during a reassignment of a partition: 
> 110879, 110880 (original leader, original follower) ---> 110879, 110880, 
> 110881, 113915 (the latter two replicas are new leader and new follower) ---> 
> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
> occurs on the new follower 113915.
> Through analysis, the exception occurs in the reassignment process:
>  # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
> controller will switch the leader from 110879 to 110881, and then send a new 
> `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
> 110881, 113915.
>  # This time, 110881 executes `makeLeader`, and 113915 executes 
> `makeFollower`. After the new follower 113915 completes 
> `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts 
> fetching data from the new leader 110881, but because the log end offset of 
> the new leader 110881 (18735600055) is smaller than the log end offset of the 
> new follower 113915 (18735600059), the new follower 113915 adds the partition 
> to `divergingEndOffsets` during `processFetchRequest` and then executes 
> `truncateOnFetchResponse` to truncate the local log to 18735600055.
>  # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
> `partitionMapLock` lock, and at the same time, the new leader 110881 and the 
> new follower 113915 also receive another `leaderAndIsr` request from the 
> controller (to remove the old replicas 110879, 110880 from the ISR), and the 
> `ReplicaFetcherManager` thread of the new follower 113915 executes the second 
> `makeFollower` to acquire the `partitionMapLock` lock firstly and execute 
> `removeFetcherForPartitions`, and then gets the local log end offset 
> (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` 
> again to update the fetch offset (18735600059) to the `partitionStates`.
>  # But unfortunately, the follower fetcher thread that was ready to truncate 
> the local log to 18735600055 firstly obtained the `partitionMapLock` lock and 
> completed the truncation, and the log end offset is now 18735600055.
>  # Then, the thread that executed the second `makeFollower` obtained the 
> `partitionMapLock` lock and executed `addFetcherForPartitions` to update the 
> outdated fetch offset (18735600059) to the `partitionStates`.
>  # Finally, it leads to: the follower thread throws the following exception 
> during `processPartitionData`: "java.lang.IllegalStateException: Offset 
> mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 
> 18735600059, log end offset = 18735600055."
>  
> The relevant logs are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`

2024-05-13 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-16710:
---
Description: 
The scenario where this case occurs is during a reassignment of a partition: 
110879, 110880 (original leader, original follower) ---> 110879, 110880, 
110881, 113915 (the latter two replicas are new leader and new follower) ---> 
110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
occurs on the new follower 113915.

Through analysis, the exception occurs in the reassignment process:
 # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
controller will switch the leader from 110879 to 110881, and then send a new 
`leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
110881, 113915.
 # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. 
After the new follower 113915 completes `removeFetcherForPartitions` and 
`addFetcherForPartitions`, it starts fetching data from the new leader 110881, 
but because the log end offset of the new leader 110881 (18735600055) is 
smaller than the log end offset of the new follower 113915 (18735600059), the 
new follower 113915 adds the partition to `divergingEndOffsets` during 
`processFetchRequest` and then executes `truncateOnFetchResponse` to truncate 
the local log to 18735600055.
 # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
`partitionMapLock` lock, and at the same time, the new leader 110881 and the 
new follower 113915 also receive another `leaderAndIsr` request from the 
controller (to remove the old replicas 110879, 110880 from the ISR), and the 
`ReplicaFetcherManager` thread of the new follower 113915 executes the second 
`makeFollower` to acquire the `partitionMapLock` lock firstly and execute 
`removeFetcherForPartitions`, and then gets the local log end offset 
(18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` 
again to update the fetch offset (18735600059) to the `partitionStates`.
 # But unfortunately, the follower fetcher thread that was ready to truncate 
the local log to 18735600055 firstly obtained the `partitionMapLock` lock and 
completed the truncation, and the log end offset is now 18735600055.
 # Then, the thread that executed the second `makeFollower` obtained the 
`partitionMapLock` lock and executed `addFetcherForPartitions` to update the 
outdated fetch offset (18735600059) to the `partitionStates`.
 # Finally, it leads to: the follower thread throws the following exception 
during `processPartitionData`: "java.lang.IllegalStateException: Offset 
mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 
18735600059, log end offset = 18735600055."

 

The relevant logs are attached.

  was:
The scenario where this case occurs is during a reassignment of a partition: 
110879, 110880 (original leader, original follower) ---> 110879, 110880, 
110881, 113915 (the latter two replicas are new leader and new follower) ---> 
110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
occurs on the new follower 113915.

Through analysis, the exception occurs in the reassignment process:
 # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
controller will switch the leader from 110879 to 110881, and then send a new 
`leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
110881, 113915.
 # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. 
After the new follower 113915 completes `removeFetcherForPartitions` and 
`addFetcherForPartitions`, it starts fetching data from the new leader 110881, 
but because the log end offset of the new leader 110881 (18735600055) is 
smaller than the log end offset of the new follower 113915 (18735600059), the 
new follower 113915 adds the partition to `divergingEndOffsets` during 
`processFetchRequest` and then executes `truncateOnFetchResponse` to truncate 
the local log to 18735600055.
 # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
`partitionMapLock` lock, and at the same time, the new leader 110881 and the 
new follower 113915 also receive another `leaderAndIsr` request from the 
controller (to remove the old replicas 110879, 110880 from the ISR), and the 
`ReplicaFetcherManager` thread of the new follower 113915 executes the second 
`makeFollower` to acquire the `partitionMapLock` lock firstly and execute 
`removeFetcherForPartitions`, and then gets the local log end offset 
(18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` 
again to update the fetch offset (18735600059) to the `partitionStates`.
 # But unfortunately, the follower fetcher thread that was ready to truncate 
the local log to 18735600055 firstly obtained the `partitionMapLock` lock and 
completed the truncation, and the log end offset is now 187356

[jira] [Updated] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`

2024-05-13 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-16710:
---
Description: 
The scenario where this case occurs is during a reassignment of a partition: 
110879, 110880 (original leader, original follower) ---> 110879, 110880, 
110881, 113915 (the latter two replicas are new leader and new follower) ---> 
110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
occurs on the new follower 113915.

Through analysis, the exception occurs in the reassignment process:
 # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
controller will switch the leader from 110879 to 110881, and then send a new 
`leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
110881, 113915.
 # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. 
After the new follower 113915 completes `removeFetcherForPartitions` and 
`addFetcherForPartitions`, it starts fetching data from the new leader 110881, 
but because the log end offset of the new leader 110881 (18735600055) is 
smaller than the log end offset of the new follower 113915 (18735600059), the 
new follower 113915 adds the partition to `divergingEndOffsets` during 
`processFetchRequest` and then executes `truncateOnFetchResponse` to truncate 
the local log to 18735600055.
 # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
`partitionMapLock` lock, and at the same time, the new leader 110881 and the 
new follower 113915 also receive another `leaderAndIsr` request from the 
controller (to remove the old replicas 110879, 110880 from the ISR), and the 
`ReplicaFetcherManager` thread of the new follower 113915 executes the second 
`makeFollower` to acquire the `partitionMapLock` lock firstly and execute 
`removeFetcherForPartitions`, and then gets the local log end offset 
(18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` 
again to update the fetch offset (18735600059) to the `partitionStates`.
 # But unfortunately, the follower fetcher thread that was ready to truncate 
the local log to 18735600055 firstly obtained the `partitionMapLock` lock and 
completed the truncation, and the log end offset is now 18735600055.
 # Then, the thread that executed the second `makeFollower` obtained the 
`partitionMapLock` lock and executed `addFetcherForPartitions` to update the 
outdated fetch offset (18735600059) to the `partitionStates`.
 # Finally, it leads to: the follower thread throws the following exception 
during `processPartitionData`: "java.lang.IllegalStateException: Offset 
mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 
18735600059, log end offset = 18735600055."

> Continuously `makeFollower` may cause the replica fetcher thread to encounter 
> an offset mismatch exception when `processPartitionData`
> --
>
> Key: KAFKA-16710
> URL: https://issues.apache.org/jira/browse/KAFKA-16710
> Project: Kafka
>  Issue Type: Bug
>  Components: core, replication
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
>
> The scenario where this case occurs is during a reassignment of a partition: 
> 110879, 110880 (original leader, original follower) ---> 110879, 110880, 
> 110881, 113915 (the latter two replicas are new leader and new follower) ---> 
> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
> occurs on the new follower 113915.
> Through analysis, the exception occurs in the reassignment process:
>  # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
> controller will switch the leader from 110879 to 110881, and then send a new 
> `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
> 110881, 113915.
>  # This time, 110881 executes `makeLeader`, and 113915 executes 
> `makeFollower`. After the new follower 113915 completes 
> `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts 
> fetching data from the new leader 110881, but because the log end offset of 
> the new leader 110881 (18735600055) is smaller than the log end offset of the 
> new follower 113915 (18735600059), the new follower 113915 adds the partition 
> to `divergingEndOffsets` during `processFetchRequest` and then executes 
> `truncateOnFetchResponse` to truncate the local log to 18735600055.
>  # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
> `partitionMapLock` lock, and at the same time, the new leader 110881 and the 
> new follower 113915 also receive another `leaderAndIsr` request from the 
> controller (to remove the old replicas 110879, 110880 from the ISR)

[jira] [Created] (KAFKA-16710) Continuously `makeFollower` may cause the replica fetcher thread to encounter an offset mismatch exception when `processPartitionData`

2024-05-13 Thread hudeqi (Jira)
hudeqi created KAFKA-16710:
--

 Summary: Continuously `makeFollower` may cause the replica fetcher 
thread to encounter an offset mismatch exception when `processPartitionData`
 Key: KAFKA-16710
 URL: https://issues.apache.org/jira/browse/KAFKA-16710
 Project: Kafka
  Issue Type: Bug
  Components: core, replication
Affects Versions: 2.8.1
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16543) There may be ambiguous deletions in the `cleanupGroupMetadata` when the generation of the group is less than or equal to 0

2024-04-12 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-16543:
---
Description: 
In the `cleanupGroupMetadata` method, tombstone messages is written to delete 
the group's MetadataKey only when the group is in the Dead state and the 
generation is greater than 0. The comment indicates: 'We avoid writing the 
tombstone when the generationId is 0, since this group is only using Kafka for 
offset storage.' This means that groups that only use Kafka for offset storage 
should not be deleted. However, there is a situation where, for example, Flink 
commit offsets with a generationId equal to -1. If the ApiKeys.DELETE_GROUPS is 
called to delete this group, Flink's group metadata will never be deleted. Yet, 
the logic above has already cleaned up commitKey by writing tombstone messages 
with `removedOffsets`. Therefore, the actual manifestation is: the group no 
longer exists (since the offsets have been cleaned up, there is no possibility 
of adding the group back to the `groupMetadataCache` unless offsets are 
committed again with the same group name), but the corresponding group metadata 
information still exists in __consumer_offsets. This leads to the problem that 
deleting the group does not completely clean up its related information.

The group's state is set to Dead only in the following three situations:
1. The group information is unloaded
2. The group is deleted by ApiKeys.DELETE_GROUPS
3. All offsets of the group have expired or removed.

Therefore, since the group is already in the Dead state and has been removed 
from the `groupMetadataCache`, why not directly clean up all the information of 
the group? Even if it is only used for storing offsets.

  was:
In the `cleanupGroupMetadata` method, tombstone messages is written to delete 
the group's MetadataKey only when the group is in the Dead state and the 
generation is greater than 0. The comment indicates: 'We avoid writing the 
tombstone when the generationId is 0, since this group is only using Kafka for 
offset storage.' This means that groups that only use Kafka for offset storage 
should not be deleted. However, there is a situation where, for example, Flink 
commit offsets with a generationId equal to -1. If the ApiKeys.DELETE_GROUPS is 
called to delete this group, Flink's group metadata will never be deleted. Yet, 
the logic above has already cleaned up commitKey by writing tombstone messages 
with removedOffsets. Therefore, the actual manifestation is: the group no 
longer exists (since the offsets have been cleaned up, there is no possibility 
of adding the group back to the `groupMetadataCache` unless offsets are 
committed again with the same group name), but the corresponding group metadata 
information still exists in __consumer_offsets. This leads to the problem that 
deleting the group does not completely clean up its related information.

The group's state is set to Dead only in the following three situations:
1. The group information is unloaded
2. The group is deleted by ApiKeys.DELETE_GROUPS
3. All offsets of the group have expired or removed.

Therefore, since the group is already in the Dead state and has been removed 
from the `groupMetadataCache`, why not directly clean up all the information of 
the group? Even if it is only used for storing offsets.


> There may be ambiguous deletions in the `cleanupGroupMetadata` when the 
> generation of the group is less than or equal to 0
> --
>
> Key: KAFKA-16543
> URL: https://issues.apache.org/jira/browse/KAFKA-16543
> Project: Kafka
>  Issue Type: Bug
>  Components: group-coordinator
>Affects Versions: 3.6.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> In the `cleanupGroupMetadata` method, tombstone messages is written to delete 
> the group's MetadataKey only when the group is in the Dead state and the 
> generation is greater than 0. The comment indicates: 'We avoid writing the 
> tombstone when the generationId is 0, since this group is only using Kafka 
> for offset storage.' This means that groups that only use Kafka for offset 
> storage should not be deleted. However, there is a situation where, for 
> example, Flink commit offsets with a generationId equal to -1. If the 
> ApiKeys.DELETE_GROUPS is called to delete this group, Flink's group metadata 
> will never be deleted. Yet, the logic above has already cleaned up commitKey 
> by writing tombstone messages with `removedOffsets`. Therefore, the actual 
> manifestation is: the group no longer exists (since the offsets have been 
> cleaned up, there is no possibility of adding the group back to the 
> `groupMetadataCache` unless offsets are committed again with the same gr

[jira] [Created] (KAFKA-16543) There may be ambiguous deletions in the `cleanupGroupMetadata` when the generation of the group is less than or equal to 0

2024-04-12 Thread hudeqi (Jira)
hudeqi created KAFKA-16543:
--

 Summary: There may be ambiguous deletions in the 
`cleanupGroupMetadata` when the generation of the group is less than or equal 
to 0
 Key: KAFKA-16543
 URL: https://issues.apache.org/jira/browse/KAFKA-16543
 Project: Kafka
  Issue Type: Bug
  Components: group-coordinator
Affects Versions: 3.6.2
Reporter: hudeqi
Assignee: hudeqi


In the `cleanupGroupMetadata` method, tombstone messages is written to delete 
the group's MetadataKey only when the group is in the Dead state and the 
generation is greater than 0. The comment indicates: 'We avoid writing the 
tombstone when the generationId is 0, since this group is only using Kafka for 
offset storage.' This means that groups that only use Kafka for offset storage 
should not be deleted. However, there is a situation where, for example, Flink 
commit offsets with a generationId equal to -1. If the ApiKeys.DELETE_GROUPS is 
called to delete this group, Flink's group metadata will never be deleted. Yet, 
the logic above has already cleaned up commitKey by writing tombstone messages 
with removedOffsets. Therefore, the actual manifestation is: the group no 
longer exists (since the offsets have been cleaned up, there is no possibility 
of adding the group back to the `groupMetadataCache` unless offsets are 
committed again with the same group name), but the corresponding group metadata 
information still exists in __consumer_offsets. This leads to the problem that 
deleting the group does not completely clean up its related information.

The group's state is set to Dead only in the following three situations:
1. The group information is unloaded
2. The group is deleted by ApiKeys.DELETE_GROUPS
3. All offsets of the group have expired or removed.

Therefore, since the group is already in the Dead state and has been removed 
from the `groupMetadataCache`, why not directly clean up all the information of 
the group? Even if it is only used for storing offsets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15764) Missing tests for transactions

2023-10-31 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-15764:
--

Assignee: hudeqi

> Missing tests for transactions
> --
>
> Key: KAFKA-15764
> URL: https://issues.apache.org/jira/browse/KAFKA-15764
> Project: Kafka
>  Issue Type: Test
>Affects Versions: 3.6.0
>Reporter: Divij Vaidya
>Assignee: hudeqi
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/KAFKA-15653 we found a bug 
> which was not caught by any of the existing tests in Apache Kafka.
> However, the bug is consistently reproducible if we use the test suite shared 
> by [~twmb] at 
> https://issues.apache.org/jira/browse/KAFKA-15653?focusedCommentId=1846&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1846
>  
> As part of this JIRA, we want to do add relevant test in Apache Kafka which 
> should have failed. We can look at the test suite that franz/go repository is 
> running, more specifically the test suite that is executed by "go test -run 
> Txn".(see repro.sh attached to the comment linked above).
> With reference to code in repro.sh, we don't need to do docker compose down 
> or up in the while loop Hence, it's ok to simply do
> ```
> while go test -run Txn > FRANZ_FAIL; do
> done
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15764) Missing tests for transactions

2023-10-31 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781288#comment-17781288
 ] 

hudeqi commented on KAFKA-15764:


I can pick it up.

> Missing tests for transactions
> --
>
> Key: KAFKA-15764
> URL: https://issues.apache.org/jira/browse/KAFKA-15764
> Project: Kafka
>  Issue Type: Test
>Affects Versions: 3.6.0
>Reporter: Divij Vaidya
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/KAFKA-15653 we found a bug 
> which was not caught by any of the existing tests in Apache Kafka.
> However, the bug is consistently reproducible if we use the test suite shared 
> by [~twmb] at 
> https://issues.apache.org/jira/browse/KAFKA-15653?focusedCommentId=1846&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1846
>  
> As part of this JIRA, we want to do add relevant test in Apache Kafka which 
> should have failed. We can look at the test suite that franz/go repository is 
> running, more specifically the test suite that is executed by "go test -run 
> Txn".(see repro.sh attached to the comment linked above).
> With reference to code in repro.sh, we don't need to do docker compose down 
> or up in the while loop Hence, it's ok to simply do
> ```
> while go test -run Txn > FRANZ_FAIL; do
> done
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15313) Delete remote log segments partition asynchronously when a partition is deleted.

2023-10-31 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781257#comment-17781257
 ] 

hudeqi commented on KAFKA-15313:


Hi, [~abhijeetkumar]   Are you still following this issue? If you don't have 
time, I can take over. Thanks.

> Delete remote log segments partition asynchronously when a partition is 
> deleted. 
> -
>
> Key: KAFKA-15313
> URL: https://issues.apache.org/jira/browse/KAFKA-15313
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Satish Duggana
>Assignee: Abhijeet Kumar
>Priority: Major
>  Labels: KIP-405
>
> KIP-405 already covers the approach to delete remote log segments 
> asynchronously through controller and RLMM layers. 
> reference: https://github.com/apache/kafka/pull/13947#discussion_r1281675818



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15331) Handle remote log enabled topic deletion when leader is not available

2023-10-31 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-15331:
--

Assignee: hudeqi

> Handle remote log enabled topic deletion when leader is not available
> -
>
> Key: KAFKA-15331
> URL: https://issues.apache.org/jira/browse/KAFKA-15331
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kamal Chandraprakash
>Assignee: hudeqi
>Priority: Major
> Fix For: 3.7.0
>
>
> When a topic gets deleted, then there can be a case where all the replicas 
> can be out of ISR. This case is not handled, See: 
> [https://github.com/apache/kafka/pull/13947#discussion_r1289331347] for more 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15432) RLM Stop partitions should not be invoked for non-tiered storage topics

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780818#comment-17780818
 ] 

hudeqi commented on KAFKA-15432:


I will take it over if no anyone is not following this. Thanks [~phuctran] 
[~ckamal] 

> RLM Stop partitions should not be invoked for non-tiered storage topics
> ---
>
> Key: KAFKA-15432
> URL: https://issues.apache.org/jira/browse/KAFKA-15432
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Kamal Chandraprakash
>Assignee: hudeqi
>Priority: Major
>
> When a stop partition request is sent by the controller. It invokes the 
> RemoteLogManager#stopPartition even for internal and non-tiered-storage 
> enabled topics. The replica manager should not call this method for 
> non-tiered-storage topics.
> See this 
> [comment|https://github.com/apache/kafka/pull/14329#discussion_r1315675510] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15432) RLM Stop partitions should not be invoked for non-tiered storage topics

2023-10-29 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-15432:
--

Assignee: hudeqi

> RLM Stop partitions should not be invoked for non-tiered storage topics
> ---
>
> Key: KAFKA-15432
> URL: https://issues.apache.org/jira/browse/KAFKA-15432
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Kamal Chandraprakash
>Assignee: hudeqi
>Priority: Major
>
> When a stop partition request is sent by the controller. It invokes the 
> RemoteLogManager#stopPartition even for internal and non-tiered-storage 
> enabled topics. The replica manager should not call this method for 
> non-tiered-storage topics.
> See this 
> [comment|https://github.com/apache/kafka/pull/14329#discussion_r1315675510] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15300) Include remotelog size in complete log size and also add local log size and remote log size separately in kafka-log-dirs tool.

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780816#comment-17780816
 ] 

hudeqi commented on KAFKA-15300:


Can I take over this issue? [~satish.duggana] Can you describe this issue in 
detail?

> Include remotelog size in complete log size and also add local log size and 
> remote log size separately in kafka-log-dirs tool. 
> ---
>
> Key: KAFKA-15300
> URL: https://issues.apache.org/jira/browse/KAFKA-15300
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Satish Duggana
>Priority: Major
> Fix For: 3.7.0
>
>
> Include remotelog size in complete log size and also add local log size and 
> remote log size separately in kafka-log-dirs tool. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15376) Explore options of removing data earlier to the current leader's leader epoch lineage for topics enabled with tiered storage.

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780815#comment-17780815
 ] 

hudeqi commented on KAFKA-15376:


Can I take over this issue? [~satish.duggana] 

> Explore options of removing data earlier to the current leader's leader epoch 
> lineage for topics enabled with tiered storage.
> -
>
> Key: KAFKA-15376
> URL: https://issues.apache.org/jira/browse/KAFKA-15376
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Satish Duggana
>Priority: Major
> Fix For: 3.7.0
>
>
> Followup on the discussion thread:
> [https://github.com/apache/kafka/pull/13561#discussion_r1288778006]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15038) Use topic id/name mapping from the Metadata cache in the RemoteLogManager

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780814#comment-17780814
 ] 

hudeqi commented on KAFKA-15038:


Hi, [~owen-leung]  Are you still following this issue? If you don't have time, 
I can take over. Thanks.

> Use topic id/name mapping from the Metadata cache in the RemoteLogManager
> -
>
> Key: KAFKA-15038
> URL: https://issues.apache.org/jira/browse/KAFKA-15038
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Alexandre Dupriez
>Assignee: Owen C.H. Leung
>Priority: Minor
> Fix For: 3.7.0
>
>
> Currently, the {{RemoteLogManager}} maintains its own cache of topic name to 
> topic id 
> [[1]|https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138]
>  using the information provided during leadership changes, and removing the 
> mapping upon receiving the notification of partition stopped.
> It should be possible to re-use the mapping in a broker's metadata cache, 
> removing the need for the RLM to build and update a local cache thereby 
> duplicating the information in the metadata cache. It also allows to preserve 
> a single source of authority regarding the association between topic names 
> and ids.
> [1] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15331) Handle remote log enabled topic deletion when leader is not available

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780812#comment-17780812
 ] 

hudeqi commented on KAFKA-15331:


Hi, [~_gargantua_]  Are you still following this issue? If you don't have time, 
I can take over. Thanks.

> Handle remote log enabled topic deletion when leader is not available
> -
>
> Key: KAFKA-15331
> URL: https://issues.apache.org/jira/browse/KAFKA-15331
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kamal Chandraprakash
>Priority: Major
> Fix For: 3.7.0
>
>
> When a topic gets deleted, then there can be a case where all the replicas 
> can be out of ISR. This case is not handled, See: 
> [https://github.com/apache/kafka/pull/13947#discussion_r1289331347] for more 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-12641) Clear RemoteLogLeaderEpochState entry when it become empty.

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780809#comment-17780809
 ] 

hudeqi commented on KAFKA-12641:


Hi, [~abhijeetkumar] Are you still following this issue? If you don't have 
time, I can take over. Thanks.

> Clear RemoteLogLeaderEpochState entry when it become empty. 
> 
>
> Key: KAFKA-12641
> URL: https://issues.apache.org/jira/browse/KAFKA-12641
> Project: Kafka
>  Issue Type: Bug
>Reporter: Satish Duggana
>Assignee: Abhijeet Kumar
>Priority: Major
>
> https://github.com/apache/kafka/pull/10218#discussion_r609895193



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-13355) Shutdown broker eventually when unrecoverable exceptions like IOException encountered in RLMM.

2023-10-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780810#comment-17780810
 ] 

hudeqi commented on KAFKA-13355:


Hi, [~abhijeetkumar]  Are you still following this issue? If you don't have 
time, I can take over. Thanks.

> Shutdown broker eventually when unrecoverable exceptions like IOException 
> encountered in RLMM. 
> ---
>
> Key: KAFKA-13355
> URL: https://issues.apache.org/jira/browse/KAFKA-13355
> Project: Kafka
>  Issue Type: Bug
>Reporter: Satish Duggana
>Assignee: Abhijeet Kumar
>Priority: Major
>  Labels: tiered-storage
> Fix For: 3.7.0
>
>
> Have mechanism to catch unrecoverable exceptions like IOException from RLMM 
> and shutdown the broker like it is done in log layer. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15671) Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache

2023-10-24 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-15671:
--

Assignee: hudeqi

> Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache
> --
>
> Key: KAFKA-15671
> URL: https://issues.apache.org/jira/browse/KAFKA-15671
> Project: Kafka
>  Issue Type: Test
>  Components: Tiered-Storage
>Reporter: Divij Vaidya
>Assignee: hudeqi
>Priority: Major
>
> {{context: 
> [https://github.com/apache/kafka/pull/14483#issuecomment-1775107621] }}
> {{Example of failure in trunk (from 23rd Oct)}}
> {{[https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe%2FBerlin&tests.container=kafka.log.remote.RemoteIndexCacheTest&tests.test=testCorrectnessForCacheAndIndexFilesWhenResizeCache()]
>  }}
> {{Stack trace: 
> [https://ge.apache.org/s/ahssoveatyg6k/tests/task/:core:test/details/kafka.log.remote.RemoteIndexCacheTest/testCorrectnessForCacheAndIndexFilesWhenResizeCache()?expanded-stacktrace=WyIwIl0&top-execution=1]
>  }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15671) Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache

2023-10-24 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779022#comment-17779022
 ] 

hudeqi commented on KAFKA-15671:


Hi, [~brian030128] , I saw that this issue needs a quick fix. I know a lot 
about this issue and have submitted a PR to run the CI test.

> Flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache
> --
>
> Key: KAFKA-15671
> URL: https://issues.apache.org/jira/browse/KAFKA-15671
> Project: Kafka
>  Issue Type: Test
>  Components: Tiered-Storage
>Reporter: Divij Vaidya
>Assignee: Brian
>Priority: Major
>
> {{context: 
> [https://github.com/apache/kafka/pull/14483#issuecomment-1775107621] }}
> {{Example of failure in trunk (from 23rd Oct)}}
> {{[https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe%2FBerlin&tests.container=kafka.log.remote.RemoteIndexCacheTest&tests.test=testCorrectnessForCacheAndIndexFilesWhenResizeCache()]
>  }}
> {{Stack trace: 
> [https://ge.apache.org/s/ahssoveatyg6k/tests/task/:core:test/details/kafka.log.remote.RemoteIndexCacheTest/testCorrectnessForCacheAndIndexFilesWhenResizeCache()?expanded-stacktrace=WyIwIl0&top-execution=1]
>  }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15396) Add a metric indicating the version of the current running kafka server

2023-10-20 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15396:
---
Affects Version/s: 3.6.0
   (was: 3.5.1)

> Add a metric indicating the version of the current running kafka server
> ---
>
> Key: KAFKA-15396
> URL: https://issues.apache.org/jira/browse/KAFKA-15396
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.6.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-required
>
> At present, it is impossible to perceive the Kafka version that the broker is 
> running from the perspective of metrics. If multiple Kafka versions are 
> deployed in a cluster due to various reasons, it is difficult for us to 
> intuitively understand the version distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask

2023-10-19 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15607:
---
Attachment: SeaTalk_IMG_20231019_174340.png

> Possible NPE is thrown in MirrorCheckpointTask
> --
>
> Key: KAFKA-15607
> URL: https://issues.apache.org/jira/browse/KAFKA-15607
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.8.1, 3.6.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_20231019_172945.png, 
> SeaTalk_IMG_20231019_174340.png
>
>
> In the `syncGroupOffset` method, if 
> `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of 
> `latestDownstreamOffset` will throw NPE. This usually occurs in this 
> situation: a group consumed a topic in the target cluster previously. Later, 
> the group offset of some partitions was reset to -1, the `OffsetAndMetadata` 
> of these partitions was null.
> It is possible that when reset offsets are performed in the java kafka 
> client, the reset to -1 will be intercepted. However, there are some other 
> types of clients such as sarama, which can magically reset the group offset 
> to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a 
> defensive measure to avoid NPE is needed here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask

2023-10-19 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15607:
---
Affects Version/s: 3.7.0

> Possible NPE is thrown in MirrorCheckpointTask
> --
>
> Key: KAFKA-15607
> URL: https://issues.apache.org/jira/browse/KAFKA-15607
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.8.1, 3.7.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_20231019_172945.png
>
>
> In the `syncGroupOffset` method, if 
> `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of 
> `latestDownstreamOffset` will throw NPE. This usually occurs in this 
> situation: a group consumed a topic in the target cluster previously. Later, 
> the group offset of some partitions was reset to -1, the `OffsetAndMetadata` 
> of these partitions was null.
> It is possible that when reset offsets are performed in the java kafka 
> client, the reset to -1 will be intercepted. However, there are some other 
> types of clients such as sarama, which can magically reset the group offset 
> to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a 
> defensive measure to avoid NPE is needed here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask

2023-10-19 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15607:
---
Affects Version/s: 3.6.0
   (was: 3.7.0)

> Possible NPE is thrown in MirrorCheckpointTask
> --
>
> Key: KAFKA-15607
> URL: https://issues.apache.org/jira/browse/KAFKA-15607
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.8.1, 3.6.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_20231019_172945.png
>
>
> In the `syncGroupOffset` method, if 
> `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of 
> `latestDownstreamOffset` will throw NPE. This usually occurs in this 
> situation: a group consumed a topic in the target cluster previously. Later, 
> the group offset of some partitions was reset to -1, the `OffsetAndMetadata` 
> of these partitions was null.
> It is possible that when reset offsets are performed in the java kafka 
> client, the reset to -1 will be intercepted. However, there are some other 
> types of clients such as sarama, which can magically reset the group offset 
> to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a 
> defensive measure to avoid NPE is needed here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask

2023-10-19 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15607:
---
Attachment: SeaTalk_IMG_20231019_172945.png

> Possible NPE is thrown in MirrorCheckpointTask
> --
>
> Key: KAFKA-15607
> URL: https://issues.apache.org/jira/browse/KAFKA-15607
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_20231019_172945.png
>
>
> In the `syncGroupOffset` method, if 
> `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of 
> `latestDownstreamOffset` will throw NPE. This usually occurs in this 
> situation: a group consumed a topic in the target cluster previously. Later, 
> the group offset of some partitions was reset to -1, the `OffsetAndMetadata` 
> of these partitions was null.
> It is possible that when reset offsets are performed in the java kafka 
> client, the reset to -1 will be intercepted. However, there are some other 
> types of clients such as sarama, which can magically reset the group offset 
> to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a 
> defensive measure to avoid NPE is needed here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask

2023-10-19 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15607:
---
Description: 
In the `syncGroupOffset` method, if `targetConsumerOffset.get(topicPartition)` 
gets null, then the calculation of `latestDownstreamOffset` will throw NPE. 
This usually occurs in this situation: a group consumed a topic in the target 
cluster previously. Later, the group offset of some partitions was reset to -1, 
the `OffsetAndMetadata` of these partitions was null.

It is possible that when reset offsets are performed in the java kafka client, 
the reset to -1 will be intercepted. However, there are some other types of 
clients such as sarama, which can magically reset the group offset to -1, so 
MM2 will trigger an NPE exception in this scenario. Therefore, a defensive 
measure to avoid NPE is needed here.

  was:In the `syncGroupOffset` method, if 
`targetConsumerOffset.get(topicPartition)` gets null, then the calculation of 
`latestDownstreamOffset` will throw NPE. This usually occurs in this situation: 
a group consumed a topic in the target cluster previously. Later, the group 
offset of some partitions was reset to -1, the `OffsetAndMetadata` of these 
partitions was null.


> Possible NPE is thrown in MirrorCheckpointTask
> --
>
> Key: KAFKA-15607
> URL: https://issues.apache.org/jira/browse/KAFKA-15607
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
>
> In the `syncGroupOffset` method, if 
> `targetConsumerOffset.get(topicPartition)` gets null, then the calculation of 
> `latestDownstreamOffset` will throw NPE. This usually occurs in this 
> situation: a group consumed a topic in the target cluster previously. Later, 
> the group offset of some partitions was reset to -1, the `OffsetAndMetadata` 
> of these partitions was null.
> It is possible that when reset offsets are performed in the java kafka 
> client, the reset to -1 will be intercepted. However, there are some other 
> types of clients such as sarama, which can magically reset the group offset 
> to -1, so MM2 will trigger an NPE exception in this scenario. Therefore, a 
> defensive measure to avoid NPE is needed here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15607) Possible NPE is thrown in MirrorCheckpointTask

2023-10-14 Thread hudeqi (Jira)
hudeqi created KAFKA-15607:
--

 Summary: Possible NPE is thrown in MirrorCheckpointTask
 Key: KAFKA-15607
 URL: https://issues.apache.org/jira/browse/KAFKA-15607
 Project: Kafka
  Issue Type: Bug
  Components: mirrormaker
Affects Versions: 2.8.1
Reporter: hudeqi
Assignee: hudeqi


In the `syncGroupOffset` method, if `targetConsumerOffset.get(topicPartition)` 
gets null, then the calculation of `latestDownstreamOffset` will throw NPE. 
This usually occurs in this situation: a group consumed a topic in the target 
cluster previously. Later, the group offset of some partitions was reset to -1, 
the `OffsetAndMetadata` of these partitions was null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15535) Add documentation of "remote.log.index.file.cache.total.size.bytes" configuration property.

2023-10-08 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772948#comment-17772948
 ] 

hudeqi commented on KAFKA-15535:


Is this marked as resolved? The doc of 
"remote.log.index.file.cache.total.size.bytes" is already available, and other 
configurations about tiered storage have also been checked, and it seems that 
nothing is missing. [~satish.duggana] 

> Add documentation of "remote.log.index.file.cache.total.size.bytes" 
> configuration property. 
> 
>
> Key: KAFKA-15535
> URL: https://issues.apache.org/jira/browse/KAFKA-15535
> Project: Kafka
>  Issue Type: Task
>  Components: documentation
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
>  Labels: tiered-storage
> Fix For: 3.7.0
>
>
> Add documentation of "remote.log.index.file.cache.total.size.bytes" 
> configuration property. 
> Please double check all the existing public tiered storage configurations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15536) dynamically resize remoteIndexCache

2023-10-08 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15536:
---
Affects Version/s: 3.7.0
   (was: 3.6.0)

> dynamically resize remoteIndexCache
> ---
>
> Key: KAFKA-15536
> URL: https://issues.apache.org/jira/browse/KAFKA-15536
> Project: Kafka
>  Issue Type: Improvement
>  Components: Tiered-Storage
>Affects Versions: 3.7.0
>Reporter: Luke Chen
>Assignee: hudeqi
>Priority: Major
>
> context:
> https://github.com/apache/kafka/pull/14243#discussion_r1320630057



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15535) Add documentation of "remote.log.index.file.cache.total.size.bytes" configuration property.

2023-10-03 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-15535:
--

Assignee: hudeqi

> Add documentation of "remote.log.index.file.cache.total.size.bytes" 
> configuration property. 
> 
>
> Key: KAFKA-15535
> URL: https://issues.apache.org/jira/browse/KAFKA-15535
> Project: Kafka
>  Issue Type: Task
>  Components: documentation
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
> Fix For: 3.7.0
>
>
> Add documentation of "remote.log.index.file.cache.total.size.bytes" 
> configuration property. 
> Please double check all the existing public tiered storage configurations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-09-11 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reopened KAFKA-15397:


> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.

2023-09-08 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763002#comment-17763002
 ] 

hudeqi commented on KAFKA-14912:


Thanks, [~showuon] . The PR corresponding to this issue is currently limited 
based on the "bytes size", please review it if you have time.

> Introduce a configuration for remote index cache size, preferably a dynamic 
> config.
> ---
>
> Key: KAFKA-14912
> URL: https://issues.apache.org/jira/browse/KAFKA-14912
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
>
> Context: We need to make the 1024 value here [1] as dynamically configurable
> [1] 
> https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-31 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761165#comment-17761165
 ] 

hudeqi commented on KAFKA-15397:


yes, the analysis in the attachment is based on the heap dump.

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-31 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761149#comment-17761149
 ] 

hudeqi commented on KAFKA-15397:


I applied the fix of this PR online and saw that the frequency of full gc was 
greatly reduced, but still occurred after a long running time. It was also 
difficult for me to simulate the abnormal sending behavior of the client. Do 
you have any findings and suggestions? [~showuon] 

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760223#comment-17760223
 ] 

hudeqi commented on KAFKA-15397:


[~showuon] 
link:[链接标题|https://github.com/apache/kafka/commit/b401fdaefbfbd9263deef0ab92b6093eaeb26d9b]
 , I not only added this pr processing when I fixed it myself, but also 
performed a set-null operation on this leaked collection.

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-29 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi resolved KAFKA-15397.

Resolution: Resolved

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-29 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760204#comment-17760204
 ] 

hudeqi commented on KAFKA-15397:


It seems that this commit: ”MINOR: Add more validation during KRPC 
deserialization“ can effectively avoid this memory overflow problem, so close 
this issue first.

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-29 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15397:
---
Affects Version/s: 2.8.1
   (was: 3.5.1)

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.8.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15396) Add a metric indicating the version of the current running kafka server

2023-08-28 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759514#comment-17759514
 ] 

hudeqi commented on KAFKA-15396:


Hi, [~divijvaidya] [~jolshan] I have added kip-972 link in this jira, thanks.

> Add a metric indicating the version of the current running kafka server
> ---
>
> Key: KAFKA-15396
> URL: https://issues.apache.org/jira/browse/KAFKA-15396
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.5.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-required
>
> At present, it is impossible to perceive the Kafka version that the broker is 
> running from the perspective of metrics. If multiple Kafka versions are 
> deployed in a cluster due to various reasons, it is difficult for us to 
> intuitively understand the version distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15396) Add a metric indicating the version of the current running kafka server

2023-08-28 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759492#comment-17759492
 ] 

hudeqi commented on KAFKA-15396:


thanks, I will submit here.

> Add a metric indicating the version of the current running kafka server
> ---
>
> Key: KAFKA-15396
> URL: https://issues.apache.org/jira/browse/KAFKA-15396
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.5.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-required
>
> At present, it is impossible to perceive the Kafka version that the broker is 
> running from the perspective of metrics. If multiple Kafka versions are 
> deployed in a cluster due to various reasons, it is difficult for us to 
> intuitively understand the version distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-23 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15397:
---
Attachment: SeaTalk_IMG_1692796505.png
SeaTalk_IMG_1692796533.png
SeaTalk_IMG_1692796604.png
SeaTalk_IMG_1692796891.png

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 3.5.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
> Attachments: SeaTalk_IMG_1692796505.png, SeaTalk_IMG_1692796533.png, 
> SeaTalk_IMG_1692796604.png, SeaTalk_IMG_1692796891.png
>
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-23 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15397:
---
Description: When the client sends a produce request in an abnormal way and 
the server accepts it for deserialization, a 
"java.lang.IllegalArgumentException" may occur, which will cause a large number 
of "TopicProduceData" objects to be instantiated without being cleaned up. In 
the end, the entire service of kafka is OOM.

> Deserializing produce requests may cause memory leaks when exceptions occur
> ---
>
> Key: KAFKA-15397
> URL: https://issues.apache.org/jira/browse/KAFKA-15397
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 3.5.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Blocker
>
> When the client sends a produce request in an abnormal way and the server 
> accepts it for deserialization, a "java.lang.IllegalArgumentException" may 
> occur, which will cause a large number of "TopicProduceData" objects to be 
> instantiated without being cleaned up. In the end, the entire service of 
> kafka is OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15397) Deserializing produce requests may cause memory leaks when exceptions occur

2023-08-23 Thread hudeqi (Jira)
hudeqi created KAFKA-15397:
--

 Summary: Deserializing produce requests may cause memory leaks 
when exceptions occur
 Key: KAFKA-15397
 URL: https://issues.apache.org/jira/browse/KAFKA-15397
 Project: Kafka
  Issue Type: Bug
  Components: clients, core
Affects Versions: 3.5.1
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15396) Add a metric indicating the version of the current running kafka server

2023-08-23 Thread hudeqi (Jira)
hudeqi created KAFKA-15396:
--

 Summary: Add a metric indicating the version of the current 
running kafka server
 Key: KAFKA-15396
 URL: https://issues.apache.org/jira/browse/KAFKA-15396
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 3.5.1
Reporter: hudeqi
Assignee: hudeqi


At present, it is impossible to perceive the Kafka version that the broker is 
running from the perspective of metrics. If multiple Kafka versions are 
deployed in a cluster due to various reasons, it is difficult for us to 
intuitively understand the version distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.

2023-08-21 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756895#comment-17756895
 ] 

hudeqi commented on KAFKA-14912:


But how to measure and get the size of the entry? [~divijvaidya] 

> Introduce a configuration for remote index cache size, preferably a dynamic 
> config.
> ---
>
> Key: KAFKA-14912
> URL: https://issues.apache.org/jira/browse/KAFKA-14912
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
>
> Context: We need to make the 1024 value here [1] as dynamically configurable
> [1] 
> https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15172) Allow exact mirroring of ACLs between clusters

2023-08-15 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754541#comment-17754541
 ] 

hudeqi commented on KAFKA-15172:


This KIP is seldom discussed, what should I do next? [~mimaison] 

> Allow exact mirroring of ACLs between clusters
> --
>
> Key: KAFKA-15172
> URL: https://issues.apache.org/jira/browse/KAFKA-15172
> Project: Kafka
>  Issue Type: Task
>  Components: mirrormaker
>Reporter: Mickael Maison
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-965
>
> When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The 
> rationale to is prevent other clients to produce to remote topics. 
> However in disaster recovery scenarios, where the target cluster is not used 
> and just a "hot standby", it would be preferable to have exactly the same 
> ACLs on both clusters to speed up failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.

2023-08-14 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754089#comment-17754089
 ] 

hudeqi commented on KAFKA-14912:


Hi, [~divijvaidya] Caffeine does support “a max size by bytes”. If I use the 
existing configuration "remote.log.index.file.cache.total.size.bytes", it will 
be troublesome for me to calculate the size for each Entry , but if it is “max 
size by entry”, it will be easier to implement. How do you think this should be 
implemented?

> Introduce a configuration for remote index cache size, preferably a dynamic 
> config.
> ---
>
> Key: KAFKA-14912
> URL: https://issues.apache.org/jira/browse/KAFKA-14912
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
>
> Context: We need to make the 1024 value here [1] as dynamically configurable
> [1] 
> https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15172) Allow exact mirroring of ACLs between clusters

2023-08-09 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752398#comment-17752398
 ] 

hudeqi commented on KAFKA-15172:


Hi, I have initiated a kip discussion: 
[https://lists.apache.org/thread/j17cdrpqxpn924cztdgxxm86qhf89sop] [~mimaison] 
[~ChrisEgerton] 

> Allow exact mirroring of ACLs between clusters
> --
>
> Key: KAFKA-15172
> URL: https://issues.apache.org/jira/browse/KAFKA-15172
> Project: Kafka
>  Issue Type: Task
>  Components: mirrormaker
>Reporter: Mickael Maison
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-965
>
> When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The 
> rationale to is prevent other clients to produce to remote topics. 
> However in disaster recovery scenarios, where the target cluster is not used 
> and just a "hot standby", it would be preferable to have exactly the same 
> ACLs on both clusters to speed up failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.

2023-08-08 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-14912:
--

Assignee: hudeqi

> Introduce a configuration for remote index cache size, preferably a dynamic 
> config.
> ---
>
> Key: KAFKA-14912
> URL: https://issues.apache.org/jira/browse/KAFKA-14912
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
>
> Context: We need to make the 1024 value here [1] as dynamically configurable
> [1] 
> https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.

2023-08-08 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752229#comment-17752229
 ] 

hudeqi commented on KAFKA-14912:


ok, i'll take a look in the last few days, thx

> Introduce a configuration for remote index cache size, preferably a dynamic 
> config.
> ---
>
> Key: KAFKA-14912
> URL: https://issues.apache.org/jira/browse/KAFKA-14912
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Priority: Major
>
> Context: We need to make the 1024 value here [1] as dynamically configurable
> [1] 
> https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15172) Allow exact mirroring of ACLs between clusters

2023-08-08 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15172:
---
Labels: kip-965  (was: needs-kip)

> Allow exact mirroring of ACLs between clusters
> --
>
> Key: KAFKA-15172
> URL: https://issues.apache.org/jira/browse/KAFKA-15172
> Project: Kafka
>  Issue Type: Task
>  Components: mirrormaker
>Reporter: Mickael Maison
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-965
>
> When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The 
> rationale to is prevent other clients to produce to remote topics. 
> However in disaster recovery scenarios, where the target cluster is not used 
> and just a "hot standby", it would be preferable to have exactly the same 
> ACLs on both clusters to speed up failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15129) Clean up all metrics that were forgotten to be closed

2023-07-19 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15129:
---
Description: 
In the current kafka code, there are still many module metrics that are 
forgotten to be closed when they stop, although some of them have been fixed, 
such as kafka-14866 and kafka-14868. et.

These metric leaks may lead to potential OOM risks, and, in the unit tests and 
integration tests in the code, there are also a large number of `closes` 
without removing the metric, which will also cause CI test instability. By 
cleaning up these leaked indicators, these risks can be eliminated, and the 
security and stability of the code can be enhanced.

Here I will find all the metrics that are forgotten and closed in the current 
version, and submit them according to the subtasks in order to fix them.

  was:
In the current kafka code, there are still many module metrics that are 
forgotten to be closed when they stop, although some of them have been fixed, 
such as kafka-14866 and kafka-14868. et.
Here I will find all the metrics that are forgotten and closed in the current 
version, and submit them according to the subtasks in order to fix them.


> Clean up all metrics that were forgotten to be closed
> -
>
> Key: KAFKA-15129
> URL: https://issues.apache.org/jira/browse/KAFKA-15129
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core, log
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> In the current kafka code, there are still many module metrics that are 
> forgotten to be closed when they stop, although some of them have been fixed, 
> such as kafka-14866 and kafka-14868. et.
> These metric leaks may lead to potential OOM risks, and, in the unit tests 
> and integration tests in the code, there are also a large number of `closes` 
> without removing the metric, which will also cause CI test instability. By 
> cleaning up these leaked indicators, these risks can be eliminated, and the 
> security and stability of the code can be enhanced.
> Here I will find all the metrics that are forgotten and closed in the current 
> version, and submit them according to the subtasks in order to fix them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15194) Rename local tiered storage segment with offset as prefix for easy navigation

2023-07-18 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744148#comment-17744148
 ] 

hudeqi commented on KAFKA-15194:


thanks [~divijvaidya] I still don't know much about "tiered storage" for the 
time being, and I have already taken over another one, and it may not be 
completed as soon as possible if I take over this one.

> Rename local tiered storage segment with offset as prefix for easy navigation
> -
>
> Key: KAFKA-15194
> URL: https://issues.apache.org/jira/browse/KAFKA-15194
> Project: Kafka
>  Issue Type: Task
>Reporter: Kamal Chandraprakash
>Priority: Minor
>  Labels: newbie
>
> In LocalTieredStorage which is an implementation of RemoteStorageManager, 
> segments are saved with random UUID. This makes navigating to a particular 
> segment harder. To navigate a given segment by offset, prepend the offset 
> information to the segment filename.
> https://github.com/apache/kafka/pull/13837#discussion_r1258896009



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14467) Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state

2023-07-18 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-14467:
--

Assignee: hudeqi

> Add a test to validate the replica state after processing the 
> OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state
> --
>
> Key: KAFKA-14467
> URL: https://issues.apache.org/jira/browse/KAFKA-14467
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: hudeqi
>Priority: Major
>
> [https://github.com/apache/kafka/pull/11390#pullrequestreview-1210993072]
>  
> https://github.com/apache/kafka/pull/11390/files#r1050045012



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14467) Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state

2023-07-18 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744147#comment-17744147
 ] 

hudeqi commented on KAFKA-14467:


Thanks, I took over first, I'll look at this later.

> Add a test to validate the replica state after processing the 
> OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state
> --
>
> Key: KAFKA-14467
> URL: https://issues.apache.org/jira/browse/KAFKA-14467
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Priority: Major
>
> [https://github.com/apache/kafka/pull/11390#pullrequestreview-1210993072]
>  
> https://github.com/apache/kafka/pull/11390/files#r1050045012



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14467) Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state

2023-07-18 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744087#comment-17744087
 ] 

hudeqi commented on KAFKA-14467:


Isn't this issue [~showuon]  in charge ?

> Add a test to validate the replica state after processing the 
> OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state
> --
>
> Key: KAFKA-14467
> URL: https://issues.apache.org/jira/browse/KAFKA-14467
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Priority: Major
>
> [https://github.com/apache/kafka/pull/11390#pullrequestreview-1210993072]
>  
> https://github.com/apache/kafka/pull/11390/files#r1050045012



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15172) Allow exact mirroring of ACLs between clusters

2023-07-13 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742994#comment-17742994
 ] 

hudeqi commented on KAFKA-15172:


Hi,[~mimaison] [~ChrisEgerton]  I am currently doing the work of using MM2 to 
realize the disaster recovery of active and standby clusters, and it has been 
applied to the production environment, so I am still very interested in this 
issue. I don’t know if it can be assigned to me (if it is not suitable, you can 
always reassign).

For this problem, I think it should not only be synchronized with Topic write 
Acl, but also synchronized with Group Acl or even user scram credential, right? 
I can initiate a KIP if this is confirmed, thanks!

> Allow exact mirroring of ACLs between clusters
> --
>
> Key: KAFKA-15172
> URL: https://issues.apache.org/jira/browse/KAFKA-15172
> Project: Kafka
>  Issue Type: Task
>  Components: mirrormaker
>Reporter: Mickael Maison
>Assignee: hudeqi
>Priority: Major
>  Labels: needs-kip
>
> When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The 
> rationale to is prevent other clients to produce to remote topics. 
> However in disaster recovery scenarios, where the target cluster is not used 
> and just a "hot standby", it would be preferable to have exactly the same 
> ACLs on both clusters to speed up failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15172) Allow exact mirroring of ACLs between clusters

2023-07-13 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-15172:
--

Assignee: hudeqi

> Allow exact mirroring of ACLs between clusters
> --
>
> Key: KAFKA-15172
> URL: https://issues.apache.org/jira/browse/KAFKA-15172
> Project: Kafka
>  Issue Type: Task
>  Components: mirrormaker
>Reporter: Mickael Maison
>Assignee: hudeqi
>Priority: Major
>  Labels: needs-kip
>
> When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The 
> rationale to is prevent other clients to produce to remote topics. 
> However in disaster recovery scenarios, where the target cluster is not used 
> and just a "hot standby", it would be preferable to have exactly the same 
> ACLs on both clusters to speed up failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14912) Introduce a configuration for remote index cache size, preferably a dynamic config.

2023-07-12 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742624#comment-17742624
 ] 

hudeqi commented on KAFKA-14912:


Hi [~divijvaidya] [~manyanda] I am also interested, if you are busy later, I 
can provide alternative support.:D

> Introduce a configuration for remote index cache size, preferably a dynamic 
> config.
> ---
>
> Key: KAFKA-14912
> URL: https://issues.apache.org/jira/browse/KAFKA-14912
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: Manyanda Chitimbo
>Priority: Major
>
> Context: We need to make the 1024 value here [1] as dynamically configurable
> [1] 
> https://github.com/apache/kafka/blob/8d24716f27b307da79a819487aefb8dec79b4ca8/storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteIndexCache.java#L119



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`

2023-07-10 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi resolved KAFKA-15139.

Resolution: Fixed

> Optimize the performance of `Set.removeAll(List)` in 
> `MirrorCheckpointConnector`
> 
>
> Key: KAFKA-15139
> URL: https://issues.apache.org/jira/browse/KAFKA-15139
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> This is the hint of `removeAll` method in `Set`:
> _This implementation determines which is the smaller of this set and the 
> specified collection, by invoking the size method on each. If this set has 
> fewer elements, then the implementation iterates over this set, checking each 
> element returned by the iterator in turn to see if it is contained in the 
> specified collection. If it is so contained, it is removed from this set with 
> the iterator's remove method. If the specified collection has fewer elements, 
> then the implementation iterates over the specified collection, removing from 
> this set each element returned by the iterator, using this set's remove 
> method._
> That's said, assume that _M_ is the number of elements in the set and _N_ is 
> the number of elements in the List, if the type of the specified collection 
> is `List`, and {_}M<=N{_}, then the time complexity of `removeAll` is _O(MN)_ 
> (because the time complexity of searching in List is {_}O(N){_}), on the 
> contrary, if {_}N {_}O(N){_}.
> In `MirrorCheckpointConnector`, `refreshConsumerGroups` method is repeatedly 
> called in a daemon thread. There are two `removeAll` in this method. From a 
> logical point of view, when this method is called in one round, when the 
> number of groups in the source cluster simply increases or decreases, the two 
> `removeAll` execution strategies will always hit the _O(MN)_ situation 
> mentioned above. Therefore, it is better to change all the variables here to 
> Set type to avoid this "low performance".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`

2023-07-01 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15139:
---
Description: 
This is the hint of `removeAll` method in `Set`:

_This implementation determines which is the smaller of this set and the 
specified collection, by invoking the size method on each. If this set has 
fewer elements, then the implementation iterates over this set, checking each 
element returned by the iterator in turn to see if it is contained in the 
specified collection. If it is so contained, it is removed from this set with 
the iterator's remove method. If the specified collection has fewer elements, 
then the implementation iterates over the specified collection, removing from 
this set each element returned by the iterator, using this set's remove method._

That's said, assume that _M_ is the number of elements in the set and _N_ is 
the number of elements in the List, if the type of the specified collection is 
`List`, and {_}M<=N{_}, then the time complexity of `removeAll` is _O(MN)_ 
(because the time complexity of searching in List is {_}O(N){_}), on the 
contrary, if {_}N Optimize the performance of `Set.removeAll(List)` in 
> `MirrorCheckpointConnector`
> 
>
> Key: KAFKA-15139
> URL: https://issues.apache.org/jira/browse/KAFKA-15139
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> This is the hint of `removeAll` method in `Set`:
> _This implementation determines which is the smaller of this set and the 
> specified collection, by invoking the size method on each. If this set has 
> fewer elements, then the implementation iterates over this set, checking each 
> element returned by the iterator in turn to see if it is contained in the 
> specified collection. If it is so contained, it is removed from this set with 
> the iterator's remove method. If the specified collection has fewer elements, 
> then the implementation iterates over the specified collection, removing from 
> this set each element returned by the iterator, using this set's remove 
> method._
> That's said, assume that _M_ is the number of elements in the set and _N_ is 
> the number of elements in the List, if the type of the specified collection 
> is `List`, and {_}M<=N{_}, then the time complexity of `removeAll` is _O(MN)_ 
> (because the time complexity of searching in List is {_}O(N){_}), on the 
> contrary, if {_}N {_}O(N){_}.
> In `MirrorCheckpointConnector`, `refreshConsumerGroups` method is repeatedly 
> called in a daemon thread. There are two `removeAll` in this method. From a 
> logical point of view, when this method is called in one round, when the 
> number of groups in the source cluster simply increases or decreases, the two 
> `removeAll` execution strategies will always hit the _O(MN)_ situation 
> mentioned above. Therefore, it is better to change all the variables here to 
> Set type to avoid this "low performance".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`

2023-07-01 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15139:
---
Description: 
This is the hint of `removeAll` method in `Set`:

_This implementation determines which is the smaller of this set and the 
specified collection, by invoking the size method on each. If this set has 
fewer elements, then the implementation iterates over this set, checking each 
element returned by the iterator in turn to see if it is contained in the 
specified collection. If it is so contained, it is removed from this set with 
the iterator's remove method. If the specified collection has fewer elements, 
then the implementation iterates over the specified collection, removing from 
this set each element returned by the iterator, using this set's remove method._

That's said, assume that M is the number of elements in the set and N is the 
number of elements in the List, if the type of the specified collection is 
`List`, and M<=N, then the time complexity of `removeAll` is O(MN) (because the 
time complexity of searching in List is O(N)), on the contrary, if N Optimize the performance of `Set.removeAll(List)` in 
> `MirrorCheckpointConnector`
> 
>
> Key: KAFKA-15139
> URL: https://issues.apache.org/jira/browse/KAFKA-15139
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> This is the hint of `removeAll` method in `Set`:
> _This implementation determines which is the smaller of this set and the 
> specified collection, by invoking the size method on each. If this set has 
> fewer elements, then the implementation iterates over this set, checking each 
> element returned by the iterator in turn to see if it is contained in the 
> specified collection. If it is so contained, it is removed from this set with 
> the iterator's remove method. If the specified collection has fewer elements, 
> then the implementation iterates over the specified collection, removing from 
> this set each element returned by the iterator, using this set's remove 
> method._
> That's said, assume that M is the number of elements in the set and N is the 
> number of elements in the List, if the type of the specified collection is 
> `List`, and M<=N, then the time complexity of `removeAll` is O(MN) (because 
> the time complexity of searching in List is O(N)), on the contrary, if N it will search in `Set`, the time complexity is O(N).
> In 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15139) Optimize the performance of `Set.removeAll(List)` in `MirrorCheckpointConnector`

2023-07-01 Thread hudeqi (Jira)
hudeqi created KAFKA-15139:
--

 Summary: Optimize the performance of `Set.removeAll(List)` in 
`MirrorCheckpointConnector`
 Key: KAFKA-15139
 URL: https://issues.apache.org/jira/browse/KAFKA-15139
 Project: Kafka
  Issue Type: Improvement
  Components: mirrormaker
Affects Versions: 3.5.0
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15134) Enrich the prompt reason in CommitFailedException

2023-06-29 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15134:
---
Attachment: WechatIMG31.jpeg

> Enrich the prompt reason in CommitFailedException
> -
>
> Key: KAFKA-15134
> URL: https://issues.apache.org/jira/browse/KAFKA-15134
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG31.jpeg
>
>
> Firstly, let me post the exception log of the client running:
> _"org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequent calls to 
> poll() was longer than the configured max.poll.interval.ms, which typically 
> implies that the poll loop is spending too much time message processing. You 
> can address this either by increasing max.poll.interval.ms or by reducing the 
> maximum size of batches returned in poll() with max.poll.records."_
> This is the client exception log provided to us by the online business. 
> First, we confirmed that there is no issue on the server brokers. Then 
> according to the exception prompt information, it may be judged that the 
> settings of "max.poll.interval.ms" and "max.poll.records" may be 
> unreasonable, but the actual processing time of business printing is much 
> shorter than "max.poll.interval.ms", so We're sure that's not the reason. In 
> the end, it was found that there is such a case: the business uses a "group 
> id" to subscribe to some partitions of "topic1" through the simple consumer 
> mode, and set "auto.offset.reset=false". When using the same "group id" name 
> to start a high level consumer on "topic2", the original service using simple 
> consumer throws CommitFailedException. And I have reproduced this process.
> In fact, this is not a bug, but a problem with the way the client is used, 
> but I think the exception message of `CommitFailedException` may have 
> imperfect and  misleading guidance, so I have enriched the message that there 
> may be special situations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15134) Enrich the prompt reason in CommitFailedException

2023-06-29 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15134:
---
Description: 
Firstly, let me post the exception log of the client running:

_"org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
completed since the group has already rebalanced and assigned the partitions to 
another member. This means that the time between subsequent calls to poll() was 
longer than the configured max.poll.interval.ms, which typically implies that 
the poll loop is spending too much time message processing. You can address 
this either by increasing max.poll.interval.ms or by reducing the maximum size 
of batches returned in poll() with max.poll.records."_

This is the client exception log provided to us by the online business. First, 
we confirmed that there is no issue on the server brokers. Then according to 
the exception prompt information, it may be judged that the settings of 
"max.poll.interval.ms" and "max.poll.records" may be unreasonable, but the 
actual processing time of business printing is much shorter than 
"max.poll.interval.ms", so We're sure that's not the reason. In the end, it was 
found that there is such a case: the business uses a "group id" to subscribe to 
some partitions of "topic1" through the simple consumer mode, and set 
"auto.offset.reset=false". When using the same "group id" name to start a high 
level consumer on "topic2", the original service using simple consumer throws 
CommitFailedException. And I have reproduced this process.


In fact, this is not a bug, but a problem with the way the client is used, but 
I think the exception message of `CommitFailedException` may have imperfect and 
 misleading guidance, so I have enriched the message that there may be special 
situations.

> Enrich the prompt reason in CommitFailedException
> -
>
> Key: KAFKA-15134
> URL: https://issues.apache.org/jira/browse/KAFKA-15134
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> Firstly, let me post the exception log of the client running:
> _"org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequent calls to 
> poll() was longer than the configured max.poll.interval.ms, which typically 
> implies that the poll loop is spending too much time message processing. You 
> can address this either by increasing max.poll.interval.ms or by reducing the 
> maximum size of batches returned in poll() with max.poll.records."_
> This is the client exception log provided to us by the online business. 
> First, we confirmed that there is no issue on the server brokers. Then 
> according to the exception prompt information, it may be judged that the 
> settings of "max.poll.interval.ms" and "max.poll.records" may be 
> unreasonable, but the actual processing time of business printing is much 
> shorter than "max.poll.interval.ms", so We're sure that's not the reason. In 
> the end, it was found that there is such a case: the business uses a "group 
> id" to subscribe to some partitions of "topic1" through the simple consumer 
> mode, and set "auto.offset.reset=false". When using the same "group id" name 
> to start a high level consumer on "topic2", the original service using simple 
> consumer throws CommitFailedException. And I have reproduced this process.
> In fact, this is not a bug, but a problem with the way the client is used, 
> but I think the exception message of `CommitFailedException` may have 
> imperfect and  misleading guidance, so I have enriched the message that there 
> may be special situations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15134) Enrich the prompt reason in CommitFailedException

2023-06-29 Thread hudeqi (Jira)
hudeqi created KAFKA-15134:
--

 Summary: Enrich the prompt reason in CommitFailedException
 Key: KAFKA-15134
 URL: https://issues.apache.org/jira/browse/KAFKA-15134
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Affects Versions: 3.5.0
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15129) Clean up all metrics that were forgotten to be closed

2023-06-28 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15129:
---
Description: 
In the current kafka code, there are still many module metrics that are 
forgotten to be closed when they stop, although some of them have been fixed, 
such as kafka-14866 and kafka-14868. et.
Here I will find all the metrics that are forgotten and closed in the current 
version, and submit them according to the subtasks in order to fix them.

> Clean up all metrics that were forgotten to be closed
> -
>
> Key: KAFKA-15129
> URL: https://issues.apache.org/jira/browse/KAFKA-15129
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core, log
>Affects Versions: 3.5.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> In the current kafka code, there are still many module metrics that are 
> forgotten to be closed when they stop, although some of them have been fixed, 
> such as kafka-14866 and kafka-14868. et.
> Here I will find all the metrics that are forgotten and closed in the current 
> version, and submit them according to the subtasks in order to fix them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15129) Clean up all metrics that were forgotten to be closed

2023-06-28 Thread hudeqi (Jira)
hudeqi created KAFKA-15129:
--

 Summary: Clean up all metrics that were forgotten to be closed
 Key: KAFKA-15129
 URL: https://issues.apache.org/jira/browse/KAFKA-15129
 Project: Kafka
  Issue Type: Improvement
  Components: controller, core, log
Affects Versions: 3.5.0
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long

2023-06-27 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15086:
---
Labels: kip-943  (was: )

> The unreasonable segment size setting of the internal topics in MM2 may cause 
> the worker startup time to be too long
> 
>
> Key: KAFKA-15086
> URL: https://issues.apache.org/jira/browse/KAFKA-15086
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>  Labels: kip-943
> Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg, WechatIMG366.jpeg
>
>
> As the config 'segment.bytes' for topics related MM2(such as 
> offset.storage.topic, config.storage.topic,status.storage.topic), if 
> following the default configuration of the broker or set it larger, then when 
> the MM cluster runs many and complicated tasks, especially the log volume of 
> the topic 'offset.storage.topic' is very large, it will affect the restart 
> speed of the MM workers.
> After investigation, the reason is that a consumer needs to be started to 
> read the data of ‘offset.storage.topic’ at startup. Although this topic is 
> set to compact, if the 'segment size' is set to a large value, such as the 
> default value of 1G, then this topic may have tens of gigabytes of data that 
> cannot be compacted and has to be read from the earliest (because the active 
> segment cannot be cleaned), which will consume a lot of time (in our online 
> environment, we found that this topic stores 13G of data, it took nearly half 
> an hour for all the data to be consumed), which caused the worker to be 
> unable to start and execute tasks for a long time.
> Of course, the number of consumer threads can also be adjusted, but I think 
> it may be easier to reduce the 'segment size', for example, refer to the 
> default value of __consumer_offsets: 100MB
>  
> The first picture in the attachment is the log size stored in the internal 
> topic, the second one is the time when ‘offset.storage.topic’ starts to be 
> read, and the third one is the time when ‘offset.storage.topic’ being read 
> finished. It took about 23 minutes in total.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15119) Support incremental synchronization of topicAcl in MirrorSourceConnector

2023-06-25 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15119:
---
Description: 
In the “syncTopicAcls” thread of MirrorSourceConnector, full amount of 
"TopicAclBindings" related to the replicated topics of the source cluster will 
be regularly listed, and then fully updated to the target cluster. Therefore, a 
large number of repeated "TopicAclBindings" will be repeatedly sent  by calling 
"targetAdminClient". This action is redundant. In addition, if too many 
"TopicAclBindings" are updated at one time, it may also take a long time for 
the target cluster to handle processing the "createAcls" request, which will 
affect the accumulation of the request queue of the target cluster and further 
affect the processing delay of other type requests.

Therefore, "TopicAclBinding" can be like the variable “knownConsumerGroups” in 
MirrorCheckpointConnector, and only update the incremental added 
"TopicAclBinding" every time, which can solve the above-mentioned problems.

> Support incremental synchronization of topicAcl in MirrorSourceConnector
> 
>
> Key: KAFKA-15119
> URL: https://issues.apache.org/jira/browse/KAFKA-15119
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> In the “syncTopicAcls” thread of MirrorSourceConnector, full amount of 
> "TopicAclBindings" related to the replicated topics of the source cluster 
> will be regularly listed, and then fully updated to the target cluster. 
> Therefore, a large number of repeated "TopicAclBindings" will be repeatedly 
> sent  by calling "targetAdminClient". This action is redundant. In addition, 
> if too many "TopicAclBindings" are updated at one time, it may also take a 
> long time for the target cluster to handle processing the "createAcls" 
> request, which will affect the accumulation of the request queue of the 
> target cluster and further affect the processing delay of other type requests.
> Therefore, "TopicAclBinding" can be like the variable “knownConsumerGroups” 
> in MirrorCheckpointConnector, and only update the incremental added 
> "TopicAclBinding" every time, which can solve the above-mentioned problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15119) Support incremental synchronization of topicAcl in MirrorSourceConnector

2023-06-25 Thread hudeqi (Jira)
hudeqi created KAFKA-15119:
--

 Summary: Support incremental synchronization of topicAcl in 
MirrorSourceConnector
 Key: KAFKA-15119
 URL: https://issues.apache.org/jira/browse/KAFKA-15119
 Project: Kafka
  Issue Type: Improvement
  Components: mirrormaker
Affects Versions: 3.4.1
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs

2023-06-21 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735644#comment-17735644
 ] 

hudeqi commented on KAFKA-15110:


Leave an exception log so that others can find this problem:

 

[2023-06-21 18:00:59,186] INFO Metrics reporters closed 
(org.apache.kafka.common.metrics.Metrics)
[2023-06-21 18:00:59,188] INFO Broker and topic stats closed 
(kafka.server.BrokerTopicStats)
[2023-06-21 18:00:59,190] INFO App info kafka.server for 0 unregistered 
(org.apache.kafka.common.utils.AppInfoParser)
[2023-06-21 18:00:59,190] INFO [KafkaServer id=0] shut down completed 
(kafka.server.KafkaServer)
[2023-06-21 18:00:59,190] ERROR Exiting Kafka due to fatal exception during 
startup. (kafka.Kafka$)
java.lang.NoSuchMethodError: 'void 
org.apache.kafka.storage.internals.log.ProducerStateManagerConfig.(int)'
        at kafka.log.LogManager$.apply(LogManager.scala:1394)
        at kafka.server.KafkaServer.startup(KafkaServer.scala:279)
        at kafka.Kafka$.main(Kafka.scala:113)
        at kafka.Kafka.main(Kafka.scala)
[2023-06-21 18:00:59,190] INFO [KafkaServer id=0] shutting down 
(kafka.server.KafkaServer)

> Wrong version may be run, which will cause to fail to run when there are 
> multiple version jars under core/build/libs
> 
>
> Key: KAFKA-15110
> URL: https://issues.apache.org/jira/browse/KAFKA-15110
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG28.jpeg, WechatIMG29.jpeg
>
>
> For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, 
> and then switch to a 3.6.0 branch, a jar is also built. Since 
> "core/build/libs" is a  "gitignore dir", there will be two versions of 
> packages in this directory. At this time, when I start the kafka process 
> under the local bin dir, I will encounter the problem that it cannot be 
> started because the version is running incorrectly.
> The reason is that /kafka-run-class.sh transfers all jar packages to 
> CLASSPATH by default, which is an unreasonable behavior.
> For details, see the attached screenshot below:
> Figure 1 shows that the abnormal exit because of the missing of 
> ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, 
> but the initial method of ProducerStateManagerConfig in version 3.6.0 has two 
> parameters.
> Figure 2 is the printed value of CLASSPATH.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs

2023-06-21 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15110:
---
Attachment: WechatIMG29.jpeg

> Wrong version may be run, which will cause to fail to run when there are 
> multiple version jars under core/build/libs
> 
>
> Key: KAFKA-15110
> URL: https://issues.apache.org/jira/browse/KAFKA-15110
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG28.jpeg, WechatIMG29.jpeg
>
>
> For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, 
> and then switch to a 3.6.0 branch, a jar is also built. Since 
> "core/build/libs" is a  "gitignore dir", there will be two versions of 
> packages in this directory. At this time, when I start the kafka process 
> under the local bin dir, I will encounter the problem that it cannot be 
> started because the version is running incorrectly.
> The reason is that /kafka-run-class.sh transfers all jar packages to 
> CLASSPATH by default, which is an unreasonable behavior.
> For details, see the attached screenshot below:
> Figure 1 shows that the abnormal exit because of the missing of 
> ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, 
> but the initial method of ProducerStateManagerConfig in version 3.6.0 has two 
> parameters.
> Figure 2 is the printed value of CLASSPATH.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs

2023-06-21 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15110:
---
Attachment: WechatIMG28.jpeg

> Wrong version may be run, which will cause to fail to run when there are 
> multiple version jars under core/build/libs
> 
>
> Key: KAFKA-15110
> URL: https://issues.apache.org/jira/browse/KAFKA-15110
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG28.jpeg, WechatIMG29.jpeg
>
>
> For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, 
> and then switch to a 3.6.0 branch, a jar is also built. Since 
> "core/build/libs" is a  "gitignore dir", there will be two versions of 
> packages in this directory. At this time, when I start the kafka process 
> under the local bin dir, I will encounter the problem that it cannot be 
> started because the version is running incorrectly.
> The reason is that /kafka-run-class.sh transfers all jar packages to 
> CLASSPATH by default, which is an unreasonable behavior.
> For details, see the attached screenshot below:
> Figure 1 shows that the abnormal exit because of the missing of 
> ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, 
> but the initial method of ProducerStateManagerConfig in version 3.6.0 has two 
> parameters.
> Figure 2 is the printed value of CLASSPATH.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs

2023-06-21 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15110:
---
Description: 
For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, 
and then switch to a 3.6.0 branch, a jar is also built. Since "core/build/libs" 
is a  "gitignore dir", there will be two versions of packages in this 
directory. At this time, when I start the kafka process under the local bin 
dir, I will encounter the problem that it cannot be started because the version 
is running incorrectly.
The reason is that /kafka-run-class.sh transfers all jar packages to CLASSPATH 
by default, which is an unreasonable behavior.

For details, see the attached screenshot below:

Figure 1 shows that the abnormal exit because of the missing of 
ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, 
but the initial method of ProducerStateManagerConfig in version 3.6.0 has two 
parameters.

Figure 2 is the printed value of CLASSPATH.

> Wrong version may be run, which will cause to fail to run when there are 
> multiple version jars under core/build/libs
> 
>
> Key: KAFKA-15110
> URL: https://issues.apache.org/jira/browse/KAFKA-15110
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> For example, when I build a jar through './gradlew jar' under a 3.5.0 branch, 
> and then switch to a 3.6.0 branch, a jar is also built. Since 
> "core/build/libs" is a  "gitignore dir", there will be two versions of 
> packages in this directory. At this time, when I start the kafka process 
> under the local bin dir, I will encounter the problem that it cannot be 
> started because the version is running incorrectly.
> The reason is that /kafka-run-class.sh transfers all jar packages to 
> CLASSPATH by default, which is an unreasonable behavior.
> For details, see the attached screenshot below:
> Figure 1 shows that the abnormal exit because of the missing of 
> ProducerStateManagerConfig(int xx) method, which is defined in version 3.5.0, 
> but the initial method of ProducerStateManagerConfig in version 3.6.0 has two 
> parameters.
> Figure 2 is the printed value of CLASSPATH.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15110) Wrong version may be run, which will cause to fail to run when there are multiple version jars under core/build/libs

2023-06-21 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15110:
---
Summary: Wrong version may be run, which will cause to fail to run when 
there are multiple version jars under core/build/libs  (was: Wrong version may 
be run, which will cause it to fail to run when there are multiple version jars 
under core/build/libs)

> Wrong version may be run, which will cause to fail to run when there are 
> multiple version jars under core/build/libs
> 
>
> Key: KAFKA-15110
> URL: https://issues.apache.org/jira/browse/KAFKA-15110
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15110) Wrong version may be run, which will cause it to fail to run when there are multiple version jars under core/build/libs

2023-06-21 Thread hudeqi (Jira)
hudeqi created KAFKA-15110:
--

 Summary: Wrong version may be run, which will cause it to fail to 
run when there are multiple version jars under core/build/libs
 Key: KAFKA-15110
 URL: https://issues.apache.org/jira/browse/KAFKA-15110
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 3.4.1
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long

2023-06-14 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15086:
---
Description: 
As the config 'segment.bytes' for topics related MM2(such as 
offset.storage.topic, config.storage.topic,status.storage.topic), if following 
the default configuration of the broker or set it larger, then when the MM 
cluster runs many and complicated tasks, especially the log volume of the topic 
'offset.storage.topic' is very large, it will affect the restart speed of the 
MM workers.

After investigation, the reason is that a consumer needs to be started to read 
the data of ‘offset.storage.topic’ at startup. Although this topic is set to 
compact, if the 'segment size' is set to a large value, such as the default 
value of 1G, then this topic may have tens of gigabytes of data that cannot be 
compacted and has to be read from the earliest (because the active segment 
cannot be cleaned), which will consume a lot of time (in our online 
environment, we found that this topic stores 13G of data, it took nearly half 
an hour for all the data to be consumed), which caused the worker to be unable 
to start and execute tasks for a long time.
Of course, the number of consumer threads can also be adjusted, but I think it 
may be easier to reduce the 'segment size', for example, refer to the default 
value of __consumer_offsets: 100MB

 

The first picture in the attachment is the log size stored in the internal 
topic, the second one is the time when ‘offset.storage.topic’ starts to be 
read, and the third one is the time when ‘offset.storage.topic’ being read 
finished. It took about 23 minutes in total.

  was:
As the config 'segment.bytes' for topics related MM2(such as 
offset.storage.topic, config.storage.topic,status.storage.topic), if following 
the default configuration of the broker or set it larger, then when the MM 
cluster runs many and complicated tasks, especially the log volume of the topic 
'offset.storage.topic' is very large, it will affect the restart speed of the 
MM workers.

After investigation, the reason is that a consumer needs to be started to read 
the data of ‘offset.storage.topic’ at startup. Although this topic is set to 
compact, if the 'segment size' is set to a large value, such as the default 
value of 1G, then this topic may have tens of gigabytes of data that cannot be 
compacted and has to be read from the earliest (because the active segment 
cannot be cleaned), which will consume a lot of time (in our online 
environment, we found that this topic stores 13G of data, it took nearly half 
an hour for all the data to be consumed), which caused the worker to be unable 
to start and execute tasks for a long time.
Of course, the number of consumer threads can also be adjusted, but I think it 
may be easier to reduce the 'segment size', for example, refer to the default 
value of __consumer_offsets: 100MB


> The unreasonable segment size setting of the internal topics in MM2 may cause 
> the worker startup time to be too long
> 
>
> Key: KAFKA-15086
> URL: https://issues.apache.org/jira/browse/KAFKA-15086
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg, WechatIMG366.jpeg
>
>
> As the config 'segment.bytes' for topics related MM2(such as 
> offset.storage.topic, config.storage.topic,status.storage.topic), if 
> following the default configuration of the broker or set it larger, then when 
> the MM cluster runs many and complicated tasks, especially the log volume of 
> the topic 'offset.storage.topic' is very large, it will affect the restart 
> speed of the MM workers.
> After investigation, the reason is that a consumer needs to be started to 
> read the data of ‘offset.storage.topic’ at startup. Although this topic is 
> set to compact, if the 'segment size' is set to a large value, such as the 
> default value of 1G, then this topic may have tens of gigabytes of data that 
> cannot be compacted and has to be read from the earliest (because the active 
> segment cannot be cleaned), which will consume a lot of time (in our online 
> environment, we found that this topic stores 13G of data, it took nearly half 
> an hour for all the data to be consumed), which caused the worker to be 
> unable to start and execute tasks for a long time.
> Of course, the number of consumer threads can also be adjusted, but I think 
> it may be easier to reduce the 'segment size', for example, refer to the 
> default value of __consumer_offsets: 100MB
>  
> The first picture in the attachment is the log size stored in the intern

[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long

2023-06-14 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15086:
---
Attachment: WechatIMG366.jpeg

> The unreasonable segment size setting of the internal topics in MM2 may cause 
> the worker startup time to be too long
> 
>
> Key: KAFKA-15086
> URL: https://issues.apache.org/jira/browse/KAFKA-15086
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg, WechatIMG366.jpeg
>
>
> As the config 'segment.bytes' for topics related MM2(such as 
> offset.storage.topic, config.storage.topic,status.storage.topic), if 
> following the default configuration of the broker or set it larger, then when 
> the MM cluster runs many and complicated tasks, especially the log volume of 
> the topic 'offset.storage.topic' is very large, it will affect the restart 
> speed of the MM workers.
> After investigation, the reason is that a consumer needs to be started to 
> read the data of ‘offset.storage.topic’ at startup. Although this topic is 
> set to compact, if the 'segment size' is set to a large value, such as the 
> default value of 1G, then this topic may have tens of gigabytes of data that 
> cannot be compacted and has to be read from the earliest (because the active 
> segment cannot be cleaned), which will consume a lot of time (in our online 
> environment, we found that this topic stores 13G of data, it took nearly half 
> an hour for all the data to be consumed), which caused the worker to be 
> unable to start and execute tasks for a long time.
> Of course, the number of consumer threads can also be adjusted, but I think 
> it may be easier to reduce the 'segment size', for example, refer to the 
> default value of __consumer_offsets: 100MB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long

2023-06-14 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15086:
---
Attachment: WechatIMG365.jpeg

> The unreasonable segment size setting of the internal topics in MM2 may cause 
> the worker startup time to be too long
> 
>
> Key: KAFKA-15086
> URL: https://issues.apache.org/jira/browse/KAFKA-15086
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg
>
>
> As the config 'segment.bytes' for topics related MM2(such as 
> offset.storage.topic, config.storage.topic,status.storage.topic), if 
> following the default configuration of the broker or set it larger, then when 
> the MM cluster runs many and complicated tasks, especially the log volume of 
> the topic 'offset.storage.topic' is very large, it will affect the restart 
> speed of the MM workers.
> After investigation, the reason is that a consumer needs to be started to 
> read the data of ‘offset.storage.topic’ at startup. Although this topic is 
> set to compact, if the 'segment size' is set to a large value, such as the 
> default value of 1G, then this topic may have tens of gigabytes of data that 
> cannot be compacted and has to be read from the earliest (because the active 
> segment cannot be cleaned), which will consume a lot of time (in our online 
> environment, we found that this topic stores 13G of data, it took nearly half 
> an hour for all the data to be consumed), which caused the worker to be 
> unable to start and execute tasks for a long time.
> Of course, the number of consumer threads can also be adjusted, but I think 
> it may be easier to reduce the 'segment size', for example, refer to the 
> default value of __consumer_offsets: 100MB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long

2023-06-14 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15086:
---
Attachment: WechatIMG364.jpeg

> The unreasonable segment size setting of the internal topics in MM2 may cause 
> the worker startup time to be too long
> 
>
> Key: KAFKA-15086
> URL: https://issues.apache.org/jira/browse/KAFKA-15086
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.4.1
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Attachments: WechatIMG364.jpeg, WechatIMG365.jpeg
>
>
> As the config 'segment.bytes' for topics related MM2(such as 
> offset.storage.topic, config.storage.topic,status.storage.topic), if 
> following the default configuration of the broker or set it larger, then when 
> the MM cluster runs many and complicated tasks, especially the log volume of 
> the topic 'offset.storage.topic' is very large, it will affect the restart 
> speed of the MM workers.
> After investigation, the reason is that a consumer needs to be started to 
> read the data of ‘offset.storage.topic’ at startup. Although this topic is 
> set to compact, if the 'segment size' is set to a large value, such as the 
> default value of 1G, then this topic may have tens of gigabytes of data that 
> cannot be compacted and has to be read from the earliest (because the active 
> segment cannot be cleaned), which will consume a lot of time (in our online 
> environment, we found that this topic stores 13G of data, it took nearly half 
> an hour for all the data to be consumed), which caused the worker to be 
> unable to start and execute tasks for a long time.
> Of course, the number of consumer threads can also be adjusted, but I think 
> it may be easier to reduce the 'segment size', for example, refer to the 
> default value of __consumer_offsets: 100MB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15086) The unreasonable segment size setting of the internal topics in MM2 may cause the worker startup time to be too long

2023-06-14 Thread hudeqi (Jira)
hudeqi created KAFKA-15086:
--

 Summary: The unreasonable segment size setting of the internal 
topics in MM2 may cause the worker startup time to be too long
 Key: KAFKA-15086
 URL: https://issues.apache.org/jira/browse/KAFKA-15086
 Project: Kafka
  Issue Type: Improvement
  Components: mirrormaker
Affects Versions: 3.4.1
Reporter: hudeqi
Assignee: hudeqi


As the config 'segment.bytes' for topics related MM2(such as 
offset.storage.topic, config.storage.topic,status.storage.topic), if following 
the default configuration of the broker or set it larger, then when the MM 
cluster runs many and complicated tasks, especially the log volume of the topic 
'offset.storage.topic' is very large, it will affect the restart speed of the 
MM workers.

After investigation, the reason is that a consumer needs to be started to read 
the data of ‘offset.storage.topic’ at startup. Although this topic is set to 
compact, if the 'segment size' is set to a large value, such as the default 
value of 1G, then this topic may have tens of gigabytes of data that cannot be 
compacted and has to be read from the earliest (because the active segment 
cannot be cleaned), which will consume a lot of time (in our online 
environment, we found that this topic stores 13G of data, it took nearly half 
an hour for all the data to be consumed), which caused the worker to be unable 
to start and execute tasks for a long time.
Of course, the number of consumer threads can also be adjusted, but I think it 
may be easier to reduce the 'segment size', for example, refer to the default 
value of __consumer_offsets: 100MB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead

2023-06-12 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-9613:
--
Affects Version/s: 2.6.2

> CorruptRecordException: Found record size 0 smaller than minimum record 
> overhead
> 
>
> Key: KAFKA-9613
> URL: https://issues.apache.org/jira/browse/KAFKA-9613
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.6.2
>Reporter: Amit Khandelwal
>Assignee: hudeqi
>Priority: Major
>
> 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] 
> Error processing fetch with max size 1048576 from consumer on partition 
> SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, 
> maxBytes=1048576, currentLeaderEpoch=Optional.empty) 
> (kafka.server.ReplicaManager)
> 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: 
> Found record size 0 smaller than minimum record overhead (14) in file 
> /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log.
> 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager 
> brokerId=0] Removed 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: 
> Member 
> _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f
>  in group y_011 has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
>  
> [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead

2023-06-12 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-9613:
--
Component/s: core

> CorruptRecordException: Found record size 0 smaller than minimum record 
> overhead
> 
>
> Key: KAFKA-9613
> URL: https://issues.apache.org/jira/browse/KAFKA-9613
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.6.2
>Reporter: Amit Khandelwal
>Assignee: hudeqi
>Priority: Major
>
> 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] 
> Error processing fetch with max size 1048576 from consumer on partition 
> SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, 
> maxBytes=1048576, currentLeaderEpoch=Optional.empty) 
> (kafka.server.ReplicaManager)
> 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: 
> Found record size 0 smaller than minimum record overhead (14) in file 
> /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log.
> 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager 
> brokerId=0] Removed 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: 
> Member 
> _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f
>  in group y_011 has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
>  
> [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead

2023-06-12 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-9613:
-

Assignee: hudeqi

> CorruptRecordException: Found record size 0 smaller than minimum record 
> overhead
> 
>
> Key: KAFKA-9613
> URL: https://issues.apache.org/jira/browse/KAFKA-9613
> Project: Kafka
>  Issue Type: Bug
>Reporter: Amit Khandelwal
>Assignee: hudeqi
>Priority: Major
>
> 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] 
> Error processing fetch with max size 1048576 from consumer on partition 
> SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, 
> maxBytes=1048576, currentLeaderEpoch=Optional.empty) 
> (kafka.server.ReplicaManager)
> 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: 
> Found record size 0 smaller than minimum record overhead (14) in file 
> /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log.
> 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager 
> brokerId=0] Removed 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: 
> Member 
> _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f
>  in group y_011 has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
>  
> [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-6668) Broker crashes on restart ,got a CorruptRecordException: Record size is smaller than minimum record overhead(14)

2023-06-12 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-6668:
-

Assignee: hudeqi

> Broker crashes on restart ,got a CorruptRecordException: Record size is 
> smaller than minimum record overhead(14)
> 
>
> Key: KAFKA-6668
> URL: https://issues.apache.org/jira/browse/KAFKA-6668
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.11.0.1
> Environment: Linux version :
> 3.10.0-514.26.2.el7.x86_64 (mockbuild@cgslv5.buildsys213) (gcc version 4.8.5 
> 20150623 (Red Hat 4.8.5-11)
> docker version:
> Client:
>  Version:  1.12.6
>  API version:  1.24
>  Go version:   go1.7.5
>  Git commit:   ad3fef854be2172454902950240a9a9778d24345
>  Built:Mon Jan 15 22:01:14 2018
>  OS/Arch:  linux/amd64
>Reporter: little brother ma
>Assignee: hudeqi
>Priority: Major
>
> There is a kafka cluster with a broker ,running in docker container. Because 
> of disk full, the container crashes, and gets restarted again, and crashes 
> again...  
> log when disk full :
>  
> {code:java}
> [2018-03-14 00:11:40,764] INFO Rolled new log segment for 'oem-debug-log-1' 
> in 1 ms. (kafka.log.Log) [2018-03-14 00:11:40,765] ERROR Uncaught exception 
> in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) 
> java.io.IOException: I/O error at sun.nio.ch.FileDispatcherImpl.force0(Native 
> Method) at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) at 
> sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) at 
> org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:162) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:377) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at 
> kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at 
> kafka.log.LogSegment.flush(LogSegment.scala:376) at 
> kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1312) at 
> kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1311) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> kafka.log.Log.flush(Log.scala:1311) at 
> kafka.log.Log$$anonfun$roll$1.apply$mcV$sp(Log.scala:1283) at 
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) 
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) [2018-03-14 00:11:56,514] ERROR 
> [KafkaApi-7] Error when handling request 
> {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=oem-debug- 
> log,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576},{partition=1,fetch_offset=131382630,max_bytes=1048576}]}]}
>  (kafka.server.KafkaAp is) 
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> {code}
>  
>  
>  
> log when resolved issue of disk full,and kafka restart:
>  
> {code:java}
> [2018-03-15 23:00:08,998] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/kafka/kafka-logs/__consumer
> _offsets-19/03396188.index) has non-zero size but the last offset 
> is 3396188 which is no larger than the base offset 3396188.}. deleting
>  /kafka/kafka-logs/__consumer_offsets-19/03396188.timeindex, 
> /kafka/kafka-logs/__consumer_offsets-19/03396188.index, and /ka
> fka/kafka-logs/__consumer_offsets-19/03396188.txnindex and 
> rebuilding index... (kafka.log.Log)
> [2018-03-15 23:00:08,999] INFO Loading producer state from snapshot file 
> '/kafka/kafka-logs/__consumer_offsets-19/03396188.snapshot' for
>  partition __consumer_offsets-19 (kafka.log.ProducerStateManager)
> [2018-03-15 23:00:09,242] INFO Recovering unflushed segment 3396188 in log 
> __consumer_offsets-19. (kafka.log.Log)
> [2018-03-15 23:00:09,243] INFO Loading producer state from snapshot file 
> '/kafka/kafka-l

[jira] [Commented] (KAFKA-9613) CorruptRecordException: Found record size 0 smaller than minimum record overhead

2023-06-12 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731863#comment-17731863
 ] 

hudeqi commented on KAFKA-9613:
---

I have also encountered this problem online. Based on the time point and system 
logs, it is related to disk problems.

> CorruptRecordException: Found record size 0 smaller than minimum record 
> overhead
> 
>
> Key: KAFKA-9613
> URL: https://issues.apache.org/jira/browse/KAFKA-9613
> Project: Kafka
>  Issue Type: Bug
>Reporter: Amit Khandelwal
>Priority: Major
>
> 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] 
> Error processing fetch with max size 1048576 from consumer on partition 
> SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, 
> maxBytes=1048576, currentLeaderEpoch=Optional.empty) 
> (kafka.server.ReplicaManager)
> 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: 
> Found record size 0 smaller than minimum record overhead (14) in file 
> /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/.log.
> 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager 
> brokerId=0] Removed 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: 
> Member 
> _011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f
>  in group y_011 has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
>  
> [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-6668) Broker crashes on restart ,got a CorruptRecordException: Record size is smaller than minimum record overhead(14)

2023-06-12 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731862#comment-17731862
 ] 

hudeqi commented on KAFKA-6668:
---

I have also encountered this issue online. Based on the time point and system 
logs, it is related to disk problems.

> Broker crashes on restart ,got a CorruptRecordException: Record size is 
> smaller than minimum record overhead(14)
> 
>
> Key: KAFKA-6668
> URL: https://issues.apache.org/jira/browse/KAFKA-6668
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.11.0.1
> Environment: Linux version :
> 3.10.0-514.26.2.el7.x86_64 (mockbuild@cgslv5.buildsys213) (gcc version 4.8.5 
> 20150623 (Red Hat 4.8.5-11)
> docker version:
> Client:
>  Version:  1.12.6
>  API version:  1.24
>  Go version:   go1.7.5
>  Git commit:   ad3fef854be2172454902950240a9a9778d24345
>  Built:Mon Jan 15 22:01:14 2018
>  OS/Arch:  linux/amd64
>Reporter: little brother ma
>Priority: Major
>
> There is a kafka cluster with a broker ,running in docker container. Because 
> of disk full, the container crashes, and gets restarted again, and crashes 
> again...  
> log when disk full :
>  
> {code:java}
> [2018-03-14 00:11:40,764] INFO Rolled new log segment for 'oem-debug-log-1' 
> in 1 ms. (kafka.log.Log) [2018-03-14 00:11:40,765] ERROR Uncaught exception 
> in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) 
> java.io.IOException: I/O error at sun.nio.ch.FileDispatcherImpl.force0(Native 
> Method) at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) at 
> sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) at 
> org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:162) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:377) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at 
> kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at 
> kafka.log.LogSegment.flush(LogSegment.scala:376) at 
> kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1312) at 
> kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1311) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> kafka.log.Log.flush(Log.scala:1311) at 
> kafka.log.Log$$anonfun$roll$1.apply$mcV$sp(Log.scala:1283) at 
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) 
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) [2018-03-14 00:11:56,514] ERROR 
> [KafkaApi-7] Error when handling request 
> {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=oem-debug- 
> log,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576},{partition=1,fetch_offset=131382630,max_bytes=1048576}]}]}
>  (kafka.server.KafkaAp is) 
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> {code}
>  
>  
>  
> log when resolved issue of disk full,and kafka restart:
>  
> {code:java}
> [2018-03-15 23:00:08,998] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/kafka/kafka-logs/__consumer
> _offsets-19/03396188.index) has non-zero size but the last offset 
> is 3396188 which is no larger than the base offset 3396188.}. deleting
>  /kafka/kafka-logs/__consumer_offsets-19/03396188.timeindex, 
> /kafka/kafka-logs/__consumer_offsets-19/03396188.index, and /ka
> fka/kafka-logs/__consumer_offsets-19/03396188.txnindex and 
> rebuilding index... (kafka.log.Log)
> [2018-03-15 23:00:08,999] INFO Loading producer state from snapshot file 
> '/kafka/kafka-logs/__consumer_offsets-19/03396188.snapshot' for
>  partition __consumer_offsets-19 (kafka.log.ProducerStateManager)
> [2018-03-15 23:00:09,242] INFO Recovering unflushed segment 3396188 in log 
> __consumer_offs

[jira] [Commented] (KAFKA-15068) Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME

2023-06-07 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730100#comment-17730100
 ] 

hudeqi commented on KAFKA-15068:


@[~ChrisEgerton] Hello, if you have time, please take a look at this issue.

> Incorrect replication Latency may be calculated when the timestamp of the 
> record is of type CREATE_TIME
> ---
>
> Key: KAFKA-15068
> URL: https://issues.apache.org/jira/browse/KAFKA-15068
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> When MM2 is used to replicate topics between Kafka clusters, if the timestamp 
> of the record of the source cluster is of type CREATE_TIME, then the value of 
> timestamp may not be the actual "creation time (that is, append time)", for 
> example, the value of timestamp is less than the current time for a long time 
> (this is determined by the producer, and often occurs in the online 
> environment), at this time, the calculated replication latency will be too 
> large.
> I understand that replication latency reflects the performance of replication 
> and should not be affected by this abnormal situation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15068) Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME

2023-06-07 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-15068:
---
Description: 
When MM2 is used to replicate topics between Kafka clusters, if the timestamp 
of the record of the source cluster is of type CREATE_TIME, then the value of 
timestamp may not be the actual "creation time (that is, append time)", for 
example, the value of timestamp is less than the current time for a long time 
(this is determined by the producer, and often occurs in the online 
environment), at this time, the calculated replication latency will be too 
large.

I understand that replication latency reflects the performance of replication 
and should not be affected by this abnormal situation.

> Incorrect replication Latency may be calculated when the timestamp of the 
> record is of type CREATE_TIME
> ---
>
> Key: KAFKA-15068
> URL: https://issues.apache.org/jira/browse/KAFKA-15068
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> When MM2 is used to replicate topics between Kafka clusters, if the timestamp 
> of the record of the source cluster is of type CREATE_TIME, then the value of 
> timestamp may not be the actual "creation time (that is, append time)", for 
> example, the value of timestamp is less than the current time for a long time 
> (this is determined by the producer, and often occurs in the online 
> environment), at this time, the calculated replication latency will be too 
> large.
> I understand that replication latency reflects the performance of replication 
> and should not be affected by this abnormal situation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15068) Incorrect replication Latency may be calculated when the timestamp of the record is of type CREATE_TIME

2023-06-07 Thread hudeqi (Jira)
hudeqi created KAFKA-15068:
--

 Summary: Incorrect replication Latency may be calculated when the 
timestamp of the record is of type CREATE_TIME
 Key: KAFKA-15068
 URL: https://issues.apache.org/jira/browse/KAFKA-15068
 Project: Kafka
  Issue Type: Improvement
  Components: mirrormaker
Reporter: hudeqi
Assignee: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread

2023-05-09 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721163#comment-17721163
 ] 

hudeqi commented on KAFKA-14979:


[GitHub Pull Request #13692|https://github.com/apache/kafka/pull/13692] is 
expired.

> Incorrect lag was calculated when markPartitionsForTruncation in 
> ReplicaAlterLogDirsThread
> --
>
> Key: KAFKA-14979
> URL: https://issues.apache.org/jira/browse/KAFKA-14979
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> When the partitions of ReplicaFetcherThread finished truncating, the 
> ReplicaAlterLogDirsThread to which these partitions belong needs to be marked 
> truncate. The lag value in the newState (PartitionFetchState) obtained in 
> this process is still the original value (state.lag). If the truncationOffset 
> is smaller than the original state.fetchOffset, then the original lag value 
> is incorrect and needs to be updated. It should be the original lag value 
> plus the difference between the original state.fetchOffset and 
> truncationOffset.
> If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread 
> may set the wrong lag value for fetcherLagStats when executing 
> "processFetchRequest". So it might be more reasonable to recalculate a lag 
> value based on new state.fetchOffset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread

2023-05-09 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-14979:
---
Description: 
When the partitions of ReplicaFetcherThread finished truncating, the 
ReplicaAlterLogDirsThread to which these partitions belong needs to be marked 
truncate. The lag value in the newState (PartitionFetchState) obtained in this 
process is still the original value (state.lag). If the truncationOffset is 
smaller than the original state.fetchOffset, then the original lag value is 
incorrect and needs to be updated. It should be the original lag value plus the 
difference between the original state.fetchOffset and truncationOffset.
If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread 
may set the wrong lag value for fetcherLagStats when executing 
"processFetchRequest". So it might be more reasonable to recalculate a lag 
value based on new state.fetchOffset.

  was:
When the partitions of ReplicaFetcherThread finished truncating, the 
ReplicaAlterLogDirsThread to which these partitions belong needs to be marked 
truncate. The lag value in the newState (PartitionFetchState) obtained in this 
process is still the original value (state.lag). If the truncationOffset is 
smaller than the original state.fetchOffset, then the original lag value is 
incorrect and needs to be updated. It should be the original lag value plus the 
difference between the original state.fetchOffset and truncationOffset.
If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread 
may set the wrong lag value for fetcherLagStats when executing 
processFetchRequest. So it might be more reasonable to recalculate a lag value 
based on new state.fetchOffset.


> Incorrect lag was calculated when markPartitionsForTruncation in 
> ReplicaAlterLogDirsThread
> --
>
> Key: KAFKA-14979
> URL: https://issues.apache.org/jira/browse/KAFKA-14979
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> When the partitions of ReplicaFetcherThread finished truncating, the 
> ReplicaAlterLogDirsThread to which these partitions belong needs to be marked 
> truncate. The lag value in the newState (PartitionFetchState) obtained in 
> this process is still the original value (state.lag). If the truncationOffset 
> is smaller than the original state.fetchOffset, then the original lag value 
> is incorrect and needs to be updated. It should be the original lag value 
> plus the difference between the original state.fetchOffset and 
> truncationOffset.
> If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread 
> may set the wrong lag value for fetcherLagStats when executing 
> "processFetchRequest". So it might be more reasonable to recalculate a lag 
> value based on new state.fetchOffset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread

2023-05-09 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-14979:
---
Description: 
When the partitions of ReplicaFetcherThread finished truncating, the 
ReplicaAlterLogDirsThread to which these partitions belong needs to be marked 
truncate. The lag value in the newState (PartitionFetchState) obtained in this 
process is still the original value (state.lag). If the truncationOffset is 
smaller than the original state.fetchOffset, then the original lag value is 
incorrect and needs to be updated. It should be the original lag value plus the 
difference between the original state.fetchOffset and truncationOffset.
If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread 
may set the wrong lag value for fetcherLagStats when executing 
processFetchRequest. So it might be more reasonable to recalculate a lag value 
based on new state.fetchOffset.

> Incorrect lag was calculated when markPartitionsForTruncation in 
> ReplicaAlterLogDirsThread
> --
>
> Key: KAFKA-14979
> URL: https://issues.apache.org/jira/browse/KAFKA-14979
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> When the partitions of ReplicaFetcherThread finished truncating, the 
> ReplicaAlterLogDirsThread to which these partitions belong needs to be marked 
> truncate. The lag value in the newState (PartitionFetchState) obtained in 
> this process is still the original value (state.lag). If the truncationOffset 
> is smaller than the original state.fetchOffset, then the original lag value 
> is incorrect and needs to be updated. It should be the original lag value 
> plus the difference between the original state.fetchOffset and 
> truncationOffset.
> If the original lag value is used incorrectly, then ReplicaAlterLogDirsThread 
> may set the wrong lag value for fetcherLagStats when executing 
> processFetchRequest. So it might be more reasonable to recalculate a lag 
> value based on new state.fetchOffset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread

2023-05-09 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi reassigned KAFKA-14979:
--

Assignee: hudeqi

> Incorrect lag was calculated when markPartitionsForTruncation in 
> ReplicaAlterLogDirsThread
> --
>
> Key: KAFKA-14979
> URL: https://issues.apache.org/jira/browse/KAFKA-14979
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14979) Incorrect lag was calculated when markPartitionsForTruncation in ReplicaAlterLogDirsThread

2023-05-09 Thread hudeqi (Jira)
hudeqi created KAFKA-14979:
--

 Summary: Incorrect lag was calculated when 
markPartitionsForTruncation in ReplicaAlterLogDirsThread
 Key: KAFKA-14979
 URL: https://issues.apache.org/jira/browse/KAFKA-14979
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 3.3.2
Reporter: hudeqi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14907) Add the traffic metric of the partition dimension in BrokerTopicStats

2023-04-22 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-14907:
---
Labels: KIP-922  (was: )

> Add the traffic metric of the partition dimension in BrokerTopicStats
> -
>
> Key: KAFKA-14907
> URL: https://issues.apache.org/jira/browse/KAFKA-14907
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>  Labels: KIP-922
>
> {color:#172b4d}Currently, there are two metrics for measuring the traffic in 
> topic dimensions: MessagesInPerSec, BytesInPerSec, but there are two 
> problems:{color}
> {color:#172b4d}1. It is difficult to intuitively reflect the problem of topic 
> partition traffic inclination through these indicators, and it is impossible 
> to clearly see which partition has the largest traffic and the traffic 
> situation of each partition. But the partition dimension can solve 
> this.{color}
> {color:#172b4d}2. For the sudden increase in topic traffic (for example, this 
> will lead some followers to out of Isr, which can be avoided by appropriately 
> increasing the number of partitions.), the metrics of the partition dimension 
> can help to provide guidance on whether to expand the partition.{color}
> {color:#172b4d}On the whole, I think it is very meaningful to add traffic 
> metrics of partition dimension, especially the issue of traffic skew.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14907) Add the traffic metric of the partition dimension in BrokerTopicStats

2023-04-22 Thread hudeqi (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-14907:
---
Description: 
{color:#172b4d}Currently, there are two metrics for measuring the traffic in 
topic dimensions: MessagesInPerSec, BytesInPerSec, but there are two 
problems:{color}
{color:#172b4d}1. It is difficult to intuitively reflect the problem of topic 
partition traffic inclination through these indicators, and it is impossible to 
clearly see which partition has the largest traffic and the traffic situation 
of each partition. But the partition dimension can solve this.{color}
{color:#172b4d}2. For the sudden increase in topic traffic (for example, this 
will lead some followers to out of Isr, which can be avoided by appropriately 
increasing the number of partitions.), the metrics of the partition dimension 
can help to provide guidance on whether to expand the partition.{color}

{color:#172b4d}On the whole, I think it is very meaningful to add traffic 
metrics of partition dimension, especially the issue of traffic skew.{color}

  was:
Currently, there are three indicators for measuring topic dimensions: 
MessagesInPerSec, BytesInPerSec, and BytesOutPerSec, but there are two problems:
1. It is difficult to intuitively reflect the problem of topic partition 
traffic inclination through these indicators, and it is impossible to clearly 
see which partition has the largest traffic and the traffic situation of each 
partition.
2. For the sudden increase in topic traffic, the indicators of the partition 
dimension can provide guidance on whether to expand the partition and roughly 
how many partitions to expand.


> Add the traffic metric of the partition dimension in BrokerTopicStats
> -
>
> Key: KAFKA-14907
> URL: https://issues.apache.org/jira/browse/KAFKA-14907
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
>
> {color:#172b4d}Currently, there are two metrics for measuring the traffic in 
> topic dimensions: MessagesInPerSec, BytesInPerSec, but there are two 
> problems:{color}
> {color:#172b4d}1. It is difficult to intuitively reflect the problem of topic 
> partition traffic inclination through these indicators, and it is impossible 
> to clearly see which partition has the largest traffic and the traffic 
> situation of each partition. But the partition dimension can solve 
> this.{color}
> {color:#172b4d}2. For the sudden increase in topic traffic (for example, this 
> will lead some followers to out of Isr, which can be avoided by appropriately 
> increasing the number of partitions.), the metrics of the partition dimension 
> can help to provide guidance on whether to expand the partition.{color}
> {color:#172b4d}On the whole, I think it is very meaningful to add traffic 
> metrics of partition dimension, especially the issue of traffic skew.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14906) Extract the coordinator service log from server log

2023-04-14 Thread hudeqi (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712337#comment-17712337
 ] 

hudeqi commented on KAFKA-14906:


This change will be reintroduced in version 4.x.

> Extract the coordinator service log from server log
> ---
>
> Key: KAFKA-14906
> URL: https://issues.apache.org/jira/browse/KAFKA-14906
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 4.0.0
>Reporter: hudeqi
>Assignee: hudeqi
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently, the coordinator service log and server log are mixed together. 
> When troubleshooting the coordinator problem, it is necessary to filter from 
> the server log, which is not very convenient. Therefore, the coordinator log 
> is separated like the controller log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >