And to share my experience of doing similar - certain messages on our system must not be duplicated, but as they are bounced back to us from third parties, duplication is inevitable. So I deduplicate them using Spark structured streaming's flapMapGroupsWithState to deduplicate based on a business key derived from the message.
Kind regards, Liam Clarke On Thu, Apr 4, 2019 at 4:09 AM Hans Jespersen <h...@confluent.io> wrote: > Ok what you are describing is different from accidental duplicate message > pruning which is what the idempotent publish feature does. > > You are describing a situation were multiple independent messages just > happen to have the same contents (both key and value). > > Removing those messages is an application specific function as you can > imaging applications which would not want independent but identical > messages to be removed (for example temperature sensor readings, heartbeat > messages, or other telemetry data that has repeat but independent values). > > Your best bet is to write a simple intermediate processor that implements > your message pruning algorithm of choice and republishes (or not) to > another topic that your consumers read from. Its a stateful app because it > needs to remember 1 or more past messages but that can be done using the > Kafka Streams processor API and the embedded rocksdb state store that comes > with Kafka Streams (or as a UDF in KSQL). > > You can alternatively write your consuming apps to implement similar > message pruning functionality themselves and avoid one extra component in > the end to end architecture > > -hans > > > On Apr 2, 2019, at 7:28 PM, jim.me...@concept-solutions.com < > jim.me...@concept-solutions.com> wrote: > > > > > > > >> On 2019/04/02 22:43:31, jim.me...@concept-solutions.com < > jim.me...@concept-solutions.com> wrote: > >> > >> > >>> On 2019/04/02 22:25:16, jim.me...@concept-solutions.com < > jim.me...@concept-solutions.com> wrote: > >>> > >>> > >>>> On 2019/04/02 21:59:21, Hans Jespersen <h...@confluent.io> wrote: > >>>> yes. Idempotent publish uses a unique messageID to discard potential > duplicate messages caused by failure conditions when publishing. > >>>> > >>>> -hans > >>>> > >>>>> On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com < > jim.me...@concept-solutions.com> wrote: > >>>>> > >>>>> Does Kafka have something that behaves like a unique key so a > producer can’t write the same value to a topic twice? > >>> > >>> Hi Hans, > >>> > >>> Is there some documentation or an example with source code where I > can learn more about this feature and how it is implemented? > >>> > >>> Thanks, > >>> Jim > >> > >> By the way I tried this... > >> echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh > --broker-list localhost:9092 --topic TestTopic --property "parse.key=true" > --property "key.separator=:" --property "enable.idempotence=true" > > /dev/null > >> > >> And... that didn't seem to do the trick - after running that command > multiple times I did receive key1 value1 for as many times as I had run the > prior command. > >> > >> Maybe it is the way I am setting the flags... > >> Recently I saw that someone did this... > >> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test > --producer-property enable.idempotence=true --request-required-acks -1 > > > > Also... the reason for my question is that we are going to have two JMS > topics with nearly redundant data in them have the UNION written to Kafka > for further processing. > > >