nodece commented on issue #25146: URL: https://github.com/apache/pulsar/issues/25146#issuecomment-3758623450
@lhotari Thanks for pointing out the related issues! The fact that #21082 and similar issues have been reported multiple times, along with the previous attempt in PR #20540, confirms this is a systemic problem rather than an isolated edge case. ## Root cause The fundamental issue is the reliance on one-way broker-to-consumer notifications. During topic unloading (as demonstrated in the test scenario), there's a critical timing window where: - Broker-side state transitions rapidly: fencing → unloading → close consumer → cleanup cache - The close notification can be lost, consumer connections may not receive the notification before being closed, leaving them in a "zombie" state ## Proposed Solution Instead of fixing the complex broker notification timing issues, a more robust approach would be: ### Add a heartbeat mechanism at the consumer level. This would allow consumers to: - Proactively detect subscription issues - Auto-recover from "zombie" states without waiting for broker notifications - Handle topic transfers and unloading gracefully ### Trade-offs Disadvantage: This introduces additional overhead (network traffic, CPU cycles on both broker and consumer sides). Mitigation: - Make heartbeat interval configurable (e.g., 30s-60s by default) - Only enable for critical consumers or make it opt-in - Piggyback heartbeat with existing traffic when possible This aligns with distributed systems best practices where active health checking is more reliable than passive notification, though it comes at a performance cost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
