qianye1001 commented on PR #10564: URL: https://github.com/apache/rocketmq/pull/10564#issuecomment-4852191511
Thanks for the patch — but I'd argue this scenario is a **misuse of the producer group abstraction, not a bug**, and doesn't need a broker-side fix. Per the official docs, a producer group is explicitly defined as "*a collection of the same type of Producer, which sends the same type of messages with **consistent logic***" ([Concept.md](https://github.com/apache/rocketmq/blob/develop/docs/en/Concept.md)). The transaction check-back mechanism is designed around exactly this invariant: "*If a transaction message is sent and the original producer crashes after sending, the broker will contact **other producers in the same producer group** to commit or rollback the transactional message*" ([Design_Transaction.md](https://github.com/apache/rocketmq/blob/develop/docs/en/Design_Transaction.md)). Any producer in the group must be able to answer a check for any transaction sent by that group — that fungibility is what makes the check-back fault-tolerant. If different topics have independent transaction state / `TransactionListener` logic, the correct fix is on the user side: **use separate producer groups per business/topic**. As @contrueCT already pointed out in #9791, another option is a shared state store (Redis / DB) so `checkLocalTransaction` returns consistent results across all producers in the group. Concerns with the current patch: 1. It papers over the misuse rather than surfacing it; users hitting this will silently keep depending on "sticky" routing that isn't part of the contract. 2. It weakens the group-level fault tolerance. If the originating client is still alive but its `TransactionListener` is stateless / has lost state (e.g. after a restart), we now prefer it over healthier peers that could have answered via a shared state backend. Fallback to round-robin only kicks in when the preferred channel is missing, not when it's semantically the wrong one to ask. 3. It adds a permanent message property (`__TXN_PRODUCER_CID__`) on every half message for a case that has a clean user-side solution — small per-message cost, but it's forever. 4. Rolling-upgrade + rebalancing edge cases (old producer sends half, new broker checks back, or vice versa) add reasoning surface that isn't paying for itself if the root cause is a config mistake. Suggestion: close this and instead update the docs (Design_Transaction / best-practices) to explicitly call out "one producer group per transactional business — don't multiplex unrelated topics into one group". Would be happy to see a docs PR for that. cc @wang-jiahua — no offense intended, the diagnosis in the issue is accurate, I just think the fix belongs on the user side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
