GitHub user lhotari edited a comment on the discussion: Key_Shared consumer 
scale behavior with a partitioned topic

I'm currently working on a new design for Key_Shared AUTO_SPLIT mode which 
would cover the requirement of individual hash tracking. The goal is that a 
message will only get blocked when it is required for preserving ordering.  
#23231 improves the situation as well as the PIP-282 changes, but the goal 
isn't met. The remaining gap are the current "recently joined consumers" rules 
(which were significantly improved in PIP-282).

For STICKY mode, I have created https://github.com/apache/pulsar/pull/23275. 
Currently the "recently joined customer" rules are applied for STICKY mode 
which is causing unnecessary blocking.

For AUTO_SPLIT mode, there's an additional requirement that I'd like to cover, 
which is to have a way for the clients to detect when a specific consumer will 
no longer receive more messages for a particular message key after the 
assignments change in AUTO_SPLIT mode. There's a related issue 
https://github.com/apache/pulsar/issues/6555 about it, but it doesn't currently 
capture the actual requirement. 

The use case is about being able to keep some client-side state for a 
particular message key and discard the state when the message key no longer 
will appear in the stream for a particular consumer. When the message key 
appears the first time, the client might build some entity and add it to a 
cache. When the message key is no longer assigned to the consumer, the client 
would discard it from the cache. No other consumer should be receiving a newer 
message with the same key until the client has acknowledged the last message 
with the message key.

To handle the ordering constraints and cover the requirement of notifying the 
clients about message key assignment, I'm working on a design that would cover 
this together with covering the goal of addressing the currently unoptimal 
"recently joined consumers" rules which could block message delivery when it's 
not necessary.

The exact details are still unknown, but I'm slowly making progress towards 
some concepts that would help work on the initial design.

One of the concepts that I'm thinking of is a "hash assignment generation," 
which is the assignment of consumers to the hashes at a particular point in 
time. When new consumers join and leave, a new generation would be formed. When 
the dispatcher sends messages to consumers, it could attach the hash of the 
message key and the hash assignment generation number to each message. In 
addition to this, each consumer would receive its own hash assignment 
information and the generation number of that assignment in a special marker 
message that is injected by the dispatcher into the messages that are delivered 
to the consumer.

A client library could use this information, for example, to flush client-side 
caches when it no longer continues to receive future messages for a particular 
key. The tracking of the key would be at the level of the hash for that key. 
When the consumer receives the hash information together with the message, 
there's no longer a need to recalculate the hash on the client side. This would 
allow also changing the hash algorithm on the broker side without a dependency 
in clients to know which hash algorithm was used.

I'll follow up with discussions on the dev mailing list which hopefully lead to 
a Pulsar Improvement Proposal (PIP) that could be addressed in the Pulsar 4.0 
timeline.

GitHub link: 
https://github.com/apache/pulsar/discussions/22912#discussioncomment-10599466

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscr...@pulsar.apache.org

Reply via email to