GitHub user gmethvin created a discussion: Allow topic compaction to discard 
messages with duplicate key

**Is your feature request related to a problem? Please describe.**

It's often the case that a producer gets interrupted in the process of 
producing a series of messages to a topic, perhaps due to an application 
restart or crash. In many cases it is useful that only one message per key is 
produced to the topic. For example, if the messages represent emails to be 
sent, we may want only one email message to be sent to each email address on a 
list. Using sequence IDs may not be feasible in many such cases, because the 
underlying list is based on some dynamic features and is constantly changing, 
or because the underlying data store does not guarantee ordering. When the 
producer restarts, it should be able to start over and allow Pulsar to ignore 
the messages with keys it has already seen.

**Describe the solution you'd like**

Pulsar already supports topic compaction, in which it only keeps the *latest* 
message for each key. I propose that it also be possible, with some 
configuration, to keep the *earliest* message for a given key within the 
retention period. In other words, if Pulsar receives a new message with the 
same key, Pulsar will discard the message.

**Describe alternatives you've considered**

It is also possible to achieve something similar by storing keys that have 
already been produced in some other data store, but that requires making sure 
the secondary data store is in sync with the messages in Pulsar.


GitHub link: https://github.com/apache/pulsar/discussions/18842

----
This is an automatically sent email for dev@pulsar.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@pulsar.apache.org

Reply via email to