nodece commented on PR #1500: URL: https://github.com/apache/pulsar-client-go/pull/1500#issuecomment-4504296859
Great fix on the deadlock path from #1494 — the lock-free `partitionConsumers()` snapshot + delayed `startDispatcher()` is a solid direction for the original blocking issue. I think there is still a new concurrency window introduced by removing the old mutex guarding `consumers`. In the new code: - `internalTopicSubscribeToPartitions()` builds `newConsumers`, then publishes via `c.consumers.Store(...)`, then starts dispatchers for new partitions. - `closeWithCause()` closes only one snapshot (`consumers := c.partitionConsumers()`) and no longer serializes with partition expansion. So this interleaving seems possible: 1. Goroutine A enters `internalTopicSubscribeToPartitions()` and starts creating new partition consumers. 2. Goroutine B calls `closeWithCause()`, takes an old snapshot, closes those consumers, and continues close flow. 3. Goroutine A continues and still executes `c.consumers.Store(...)` + `startDispatcher()` for newly created consumers. Result: newly added partition consumers may be published/started after close has already closed only the old snapshot. Before this PR, the parent lock serialized expansion and close-related operations over `c.consumers`; removing that lock fixes deadlock, but also removes that lifecycle serialization. Could we gate expansion with context cancellation or a closing flag (checked before create and before Store/startDispatcher), so in-flight expansion stops/cleans up once close starts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
