Thanks for the valuable KIP, Lucas! 
I've read through the document and have some initial comments.


ASH01:
I understand that deduplication and backoff are intentionally delegated to the 
plugin. However, since the plugin runs in the broker process and 
requiresTopologyPush is called on the heartbeat path, a misbehaving or 
incorrectly implemented plugin could still affect coordinator stability.
There seem to be two separate risks:

A slow or blocking requiresTopologyPush implementation could increase heartbeat 
latency and impact Streams group coordination.

A plugin that repeatedly returns true could cause clients to repeatedly send 
UpdateStreamsGroupTopologyDescription requests, increasing unnecessary network 
and request load between clients and brokers.


Should we consider adding broker-side guard rails independent of the plugin 
implementation? For example, the broker could enforce a per-group or per 
(groupId, topologyDescriptionId) minimum interval / rate limit for including 
TopologyDescriptionId in heartbeat responses, even if the plugin keeps 
returning true. This would not replace plugin-side deduplication, but it would 
bound the blast radius of a bad plugin implementation.


ASH02:
One scenario that may be worth considering is topology visibility during 
topology updates or rolling upgrades.

Since STALE_TOPOLOGY members are skipped, StreamsGroupDescribe seems to expose 
only the topology for the current TopologyDescriptionId. During an update, 
however, some members may still be running the previous topology epoch. In that 
state, returning only the current topology may be less helpful, or potentially 
misleading, for operators.

Would it make sense to expose both the current and previous/stale topology 
descriptions, tagged by topology epoch or description id?

ASH03:
Related to this, it would be helpful to clarify the behavior for existing 
streams groups that were created, or had their latest topology epoch bump, 
while coordinated by a broker without this feature enabled.

If such a group later moves to a plugin-enabled coordinator, there may no 
longer be a group-creation or topology-epoch-bump event to trigger minting a 
TopologyDescriptionId. In that case, should the plugin-enabled coordinator 
backfill a TopologyDescriptionId when it loads the group or handles the next 
successful heartbeat?

Otherwise, existing groups created under plugin-less coordinators may not push 
a topology until the next topology epoch bump.

ASH04:
One edge case that may be worth clarifying is what happens if multiple members 
push different topology descriptions for the same (groupId, 
TopologyDescriptionId).
The plugin contract says that concurrent calls for the same pair carry 
identical data and should be treated as idempotent. In practice, this 
assumption could potentially be violated due to configuration drift, a failed 
rolling deployment, or a client-side bug, even if this is expected to be rare.
Should the KIP define the expected behavior in this case? For example, should 
the broker avoid validating this and leave the policy to the plugin, should the 
plugin treat the first successfully stored topology as authoritative, or should 
a mismatched later push be rejected as INVALID_REQUEST?

Best Regards
Sanghyeok An

Reply via email to