lhotari commented on PR #25706:
URL: https://github.com/apache/pulsar/pull/25706#issuecomment-4406555710

   Responding to the previous comment here.
   
   > Thank you for your feedback. I'll keep my response brief to avoid making 
the thread too long for others to review and share their opinions.
   > 
   > **Alt 2** does not solve any of the core problems — it only delays their 
onset. The argument "if it survives the peak, it's enough" does not hold for 
sustained hot key scenarios. In production, we cannot bet that "sustained 
slow-consumption hot keys will never occur," especially for critical business 
workloads. This would undermine users' confidence in Pulsar.
   
   As a general thought, I think building a system that covers 100% of all 
possible use cases would add complexity, impact reliability, and increase the 
risk of incidents.
   
   For the minority of use cases that aren't covered by the current solution, I 
believe we can find ways to mitigate the possible issues.
   That's why I'd lean toward prioritizing improvements to the existing 
architecture over adding the solution proposed in this PR.
   
   As I mentioned earlier, it's already possible to scale the individual ack 
holes solution to 1M entries, which would help mitigate the issues further.
   Storing individual acks could also be improved: the current algorithm of 
storing everything each time could be changed to an incremental model where 
only the new acks are stored. Roaring Bitmap is very efficient at performing an 
AND over multiple separate bitmaps, so this should work nicely.
   There's nothing preventing us from implementing a solution where there 
wouldn't be a practical limit on the number of ack holes (say, when the limit 
is 10M or 100M). There's already an accepted PIP for this: [PIP-381: Handle 
large PositionInfo 
state](https://github.com/apache/pulsar/blob/master/pip/pip-381.md). The 
PIP-381 implementation wasn't merged because of a previous alternative 
implementation, [PR 9292](https://github.com/apache/pulsar/pull/9292), which is 
now the preferred implementation. One notable detail is that PR 9292 is 
disabled by default in Pulsar 4.0 
(managedLedgerPersistIndividualAckAsLongArray=false). In Pulsar 4.0 needs to be 
separately enabled with managedLedgerPersistIndividualAckAsLongArray=true.
   The existing PR 9292 implementation and the PIP-381 design could be further 
improved to support a "limitless" implementation that incrementally stores the 
individual acks into BookKeeper.
   
   On the client application side, for use cases with hot keys, I think there 
could be ways to identify which keys tend to be hot. In those cases, the 
application could route the hot keys to separate topics already at produce 
time. The same could also be done in a Pulsar Function (or another application) 
that splits hot keys to another topic, although that wouldn't be as efficient 
as splitting at produce time.
   
   As I mentioned earlier, on the client side there are also multiple ways to 
scale consumers so they can handle hot/slow keys while still processing other 
keys. The [virtual threads MessageListener 
example](https://github.com/lhotari/dss25-mastering-key-ordered-demo/blob/740dd8d1b350b8201266fb62d0456f44806299ba/pulsar-listener/src/main/java/com/github/lhotari/dss25/listener/PulsarListenerApp.java#L145-L188)
 is one nice possibility.
   
   > **Alt 1** and **Overflow ML** both solve the core problems. Alt 1's cost 
is storage amplification (the auxiliary cursor's mark-delete cannot advance → 
entire ledgers are retained) + longer broker restart recovery time (the 
auxiliary cursor must replay from far behind). Overflow ML's cost is a 
secondary BK write for hot key data.
   
   This is already optimized so that already acknowledged messages are skipped 
even when a cursor replays from far behind.
   
   > **Alt 3** is the weakest — the victim (stuck consumer) cannot self-rescue.
   > 
   > I still lean toward **Overflow ML** because it addresses all the core 
problems. Its cost — writing hot key data to a secondary BK ledger — can be 
managed through disk capacity planning and expansion. Hot keys may persist for 
a long time, but their data volume is typically a small fraction of total 
traffic. Meanwhile, its backlog and consumption progress metrics remain clean 
and straightforward for operators.
   > 
   > Alt 1's storage amplification can also be addressed via disk expansion, 
but the longer broker restart recovery time is a harder trade-off. Alt 1 could 
also expose additional metrics for backlog and consumption progress, but that 
seems more complex and less user-friendly. Overall, Overflow ML provides 
cleaner operational visibility.
   > 
   > Looking forward to hearing more voices and feedback.
   
   I agree that it feels cleaner as a plan at first thought. The main reason 
for my pushback is that adding complexity in a distributed system usually 
impacts reliability by introducing new failure modes. That's why I'm pushing 
back, so that we also look into other solutions that could improve the existing 
architecture and still meet the main goal of this PIP.
   
   First, we should agree on the problem statement and address the problems 
specifically:
   - hot or slow keys
     - will cause head-of-line blocking for other keys assigned to the 
consumer, due to either serial processing in the consumer or the consumer 
eventually running out of permits
       - this will result in "ack holes", which increase the individual ack size
       - the number of "ack holes" gets amplified by head-of-line blocking, 
since the keys blocked behind a hot or slow key will also remain unacked
     - when the persistent limit for individual acks is reached, message 
delivery will eventually stop for all consumers
   
   By taking a look at each problem and listing the possible alternatives, we 
could make improvements that result in a completely different solution than the 
"Overflow ML" one. There could be better ways to address the root causes of the 
problem with minimal added complexity. Besides solving the hot/slow key 
problems for Key_Shared, there could be broader benefits if we improve 
individual ack handling and address head-of-line blocking issues that possibly 
impact other subscription types (Shared, Exclusive, Failover) besides 
Key_Shared. Head-of-line blocking is also a real issue for Shared 
subscriptions. In extreme cases, users currently work around it for Shared 
subscriptions by setting the receiver queue size to 0 or 1. A very small 
receiver queue size kills performance and doesn't work efficiently with 
workloads that mix fast and slow consumers, or with workloads containing 
messages whose processing durations vary widely.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to