Re: [PR] [pip] PIP-474: Key_Shared Hot Key Overflow Mechanism [pulsar]

via GitHub Thu, 07 May 2026 00:42:16 -0700


liangyepianzhou commented on PR #25706:
URL: https://github.com/apache/pulsar/pull/25706#issuecomment-4395146323


   > I'd suggest analyzing 3 alternative designs before deciding on the 
solution.
   > 
   > Alternative 1: I'd suggest looking into an alternative design that 
achieves the same outcome of allowing the subscription cursor to advance. 
Instead of making copies of the messages, an alternative design would be to 
create another subscription to track the slow or hot keys. Essentially, the 
design could be very similar as diverting to the overflow managed ledger, but 
there wouldn't be a need to duplicate the data and get into a situation where 
different failure modes cause unnecessary complications.
   > 
   > Alternative 2: Simply optimize the replay queue solution together with 
improving the scalability of individualDeletedMessages so that it scales to 
1,000,000 ack holes and beyond. This would result in the simplest solution, 
which would cover most use cases. There are multiple benefits in keeping the 
solution simple. For example backlog management doesn't change.
   > 
   > Alternative 3: The client-side code could simply route to a separate topic 
on its own when it detects a hot key and acknowledge the original message.
   
   
   Thanks for the feedback. Let's first analyze the design approaches, and then 
we can refine the wording details.
   
   Here are my thoughts on the three alternatives you proposed:
   
   **Alternative 1**
   
   1) The mark-delete gap problem is transferred, not eliminated.
   
   The auxiliary cursor would need to individually ack all non-hot-key messages 
(keeping only hot-key messages unacked). This means `individualDeletedMessages` 
on the auxiliary cursor would grow rapidly — if 3 out of 50 keys are hot, then 
94% of messages need to be individually acked on the auxiliary cursor. This is 
exactly the same problem, just occurring on a different cursor.
   
   2) Storage amplification is far greater than with the Overflow ML.
   
   The auxiliary cursor prevents ledger GC at the **entire ledger granularity** 
— as long as the auxiliary cursor's mark-delete hasn't advanced past a ledger, 
the entire ledger is retained (containing messages for all keys). The Overflow 
ML, by contrast, only stores hot-key messages. If hot keys account for 5% of 
traffic:
   - Auxiliary cursor approach: retains 100% of the data
   - Overflow ML approach: stores only ~5% additional data
   
   3) Dual-ack implementation is more invasive.
   
   Every normal message must be acked on **two cursors** (original + 
auxiliary), requiring changes throughout the dispatch-ack chain.
   
   **Conclusion**: This approach cannot claim "no data duplication" — it trades 
broader ledger retention for not copying, and is more invasive to implement 
than Overflow. That said, the starting point (avoiding data duplication) is 
reasonable — it's just that the overall cost turns out to be higher.
   
   **Alternative 2**
   
   1) Normal Read compression still exists.
   
   `getMaxEntriesReadLimit() = max(limit - replaySize, 1)`. Even with a limit 
of 1M, once replay grows to 500K, Normal Read batches are compressed to 500K. 
Over time it will still fill up — just taking tens of minutes instead of 2.7 
minutes. This treats the symptom, not the cause.
   
   2) The hidden cost of "the simplest solution" is underestimated.
   
   Scaling `individualDeletedMessages` to the millions involves:
   - Serialization/deserialization overhead for the RangeSet — every cursor 
persistence operation must handle millions of ranges (whether written to the 
cursor ledger in BK or the metadata store)
   - Broker restart recovery time scanning millions of ack holes from 
mark-delete
   
   3) Backlog management "not changing" is actually a disadvantage.
   
   The backlog would include a large volume of stuck-key messages that are 
known to be unconsumable, misleading operational judgment. The Overflow 
approach isolates them so the backlog reflects only genuinely consumable 
messages — which is actually more accurate semantics.
   
   **Conclusion**: This approach is useful for mild scenarios (e.g., brief 
consumer restarts), but is ineffective for sustained hot keys.
   
   **Alternative 3**
   
   This approach has the most fundamental contradiction — a stuck consumer 
cannot rescue itself.
   
   Client transparency is violated: all Pulsar client libraries (Java, C++, 
Python, Go, Node.js…) would need to implement hot-key detection + routing 
logic. This is a massive cross-language engineering effort and is invasive to 
users.
   
   **Conclusion**: This is the weakest alternative. It pushes a broker-side 
scheduling problem to the client, yet the client is precisely the victim of the 
problem (stuck consumer) and cannot rescue itself.
   
   ---
   
   The core advantage of the Overflow ML approach is **complete isolation**: 
hot-key messages are entirely removed from the main dispatch path — not tagged, 
not deferred, not accommodated by enlarging capacity — so that the replay 
queue, Normal Read, and mark-delete all return to normal. The cost (additional 
BK writes) is precisely proportional — only hot-key messages are written, 
rather than retaining entire ledgers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [pip] PIP-474: Key_Shared Hot Key Overflow Mechanism [pulsar]

Reply via email to