Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-03 Thread Liam Clarke
And to share my experience of doing similar - certain messages on our
system must not be duplicated, but as they are bounced back to us from
third parties, duplication is inevitable. So I deduplicate them using Spark
structured streaming's flapMapGroupsWithState to deduplicate based on a
business key derived from the message.

Kind regards,

Liam Clarke

On Thu, Apr 4, 2019 at 4:09 AM Hans Jespersen  wrote:

> Ok what you are describing is different from accidental duplicate message
> pruning which is what the idempotent publish feature does.
>
> You are describing a situation were multiple independent messages just
> happen to have the same contents (both key and value).
>
> Removing those messages is an application specific function as you can
> imaging applications which would not want independent but identical
> messages to be removed (for example temperature sensor readings, heartbeat
> messages, or other telemetry data that has repeat but independent values).
>
> Your best bet is to write a simple intermediate processor that implements
> your message pruning algorithm of choice and republishes (or not) to
> another topic that your consumers read from. Its a stateful app because it
> needs to remember 1 or more past messages but that can be done using the
> Kafka Streams processor API and the embedded rocksdb state store that comes
> with Kafka Streams (or as a UDF in KSQL).
>
> You can alternatively write your consuming apps to implement similar
> message pruning functionality themselves and avoid one extra component in
> the end to end architecture
>
> -hans
>
> > On Apr 2, 2019, at 7:28 PM, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> >
> >
> >
> >> On 2019/04/02 22:43:31, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> >>
> >>
> >>> On 2019/04/02 22:25:16, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> >>>
> >>>
>  On 2019/04/02 21:59:21, Hans Jespersen  wrote:
>  yes. Idempotent publish uses a unique messageID to discard potential
> duplicate messages caused by failure conditions when  publishing.
> 
>  -hans
> 
> > On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> >
> > Does Kafka have something that behaves like a unique key so a
> producer can’t write the same value to a topic twice?
> >>>
> >>> Hi Hans,
> >>>
> >>>Is there some documentation or an example with source code where I
> can learn more about this feature and how it is implemented?
> >>>
> >>> Thanks,
> >>> Jim
> >>
> >> By the way I tried this...
> >> echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh
> --broker-list localhost:9092 --topic TestTopic --property "parse.key=true"
> --property "key.separator=:" --property "enable.idempotence=true" >
> /dev/null
> >>
> >> And... that didn't seem to do the trick - after running that command
> multiple times I did receive key1 value1 for as many times as I had run the
> prior command.
> >>
> >> Maybe it is the way I am setting the flags...
> >> Recently I saw that someone did this...
> >> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> --producer-property enable.idempotence=true --request-required-acks -1
> >
> > Also... the reason for my question is that we are going to have two JMS
> topics with nearly redundant data in them have the UNION written to Kafka
> for further processing.
> >
>


Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-03 Thread Hans Jespersen
Ok what you are describing is different from accidental duplicate message 
pruning which is what the idempotent publish feature does.

You are describing a situation were multiple independent messages just happen 
to have the same contents (both key and value).

Removing those messages is an application specific function as you can imaging 
applications which would not want independent but identical messages to be 
removed (for example temperature sensor readings, heartbeat messages, or other 
telemetry data that has repeat but independent values).

Your best bet is to write a simple intermediate processor that implements your 
message pruning algorithm of choice and republishes (or not) to another topic 
that your consumers read from. Its a stateful app because it needs to remember 
1 or more past messages but that can be done using the Kafka Streams processor 
API and the embedded rocksdb state store that comes with Kafka Streams (or as a 
UDF in KSQL).

You can alternatively write your consuming apps to implement similar message 
pruning functionality themselves and avoid one extra component in the end to 
end architecture

-hans

> On Apr 2, 2019, at 7:28 PM, jim.me...@concept-solutions.com 
>  wrote:
> 
> 
> 
>> On 2019/04/02 22:43:31, jim.me...@concept-solutions.com 
>>  wrote: 
>> 
>> 
>>> On 2019/04/02 22:25:16, jim.me...@concept-solutions.com 
>>>  wrote: 
>>> 
>>> 
 On 2019/04/02 21:59:21, Hans Jespersen  wrote: 
 yes. Idempotent publish uses a unique messageID to discard potential 
 duplicate messages caused by failure conditions when  publishing.
 
 -hans  
 
> On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com 
>  wrote:
> 
> Does Kafka have something that behaves like a unique key so a producer 
> can’t write the same value to a topic twice?
>>> 
>>> Hi Hans,
>>> 
>>>Is there some documentation or an example with source code where I can 
>>> learn more about this feature and how it is implemented?
>>> 
>>> Thanks,
>>> Jim
>> 
>> By the way I tried this...
>> echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh --broker-list 
>> localhost:9092 --topic TestTopic --property "parse.key=true" --property 
>> "key.separator=:" --property "enable.idempotence=true" > /dev/null
>> 
>> And... that didn't seem to do the trick - after running that command 
>> multiple times I did receive key1 value1 for as many times as I had run the 
>> prior command.
>> 
>> Maybe it is the way I am setting the flags...
>> Recently I saw that someone did this...
>> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 
>> --producer-property enable.idempotence=true --request-required-acks -1
> 
> Also... the reason for my question is that we are going to have two JMS 
> topics with nearly redundant data in them have the UNION written to Kafka for 
> further processing.
> 


Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-03 Thread Dimitry Lvovsky
I've done this using kafka streams: specifically, I created a processor,
and use a keystore (a functionality of streams) to save/check for keys and
only forwarding messages that were not in the keystore.
Since the keystore is in memory, and backed by the local filesystem on the
node the processor is running, you avoid the network lag you'd have using a
keystore like cassandra.  I think you'll have to use a similar approach to
dedupe -- you don't necessarily need to use streams, you can it handle it
directly in your consumer, but then you'll have to solve a lot of problems
streams already handles ... such as what happens if your node is shutdown
or crashes and etc.

On Wed, Apr 3, 2019 at 9:22 AM Vincent Maurin 
wrote:

> Hi,
>
> Idempotence flag will guarantee that the message is produce exactly one
> time on the topic i.e that running your command a single time will produce
> a single message.
> It is not a unique enforcement on the message key, there is no such thing
> in Kafka.
>
> In Kafka, a topic containing the "history" of values for a given key. That
> means that a consumer need to consume the whole topic and keep only the
> last value for a given key.
> So the uniqueness concept is mean to be done on the consumer side.
> Additionally to that, Kafka can perform log compaction to keep only the
> last value and preserve disk space (but consumers will still receive
> duplicates)
>
>
> Best
>
> On Wed, Apr 3, 2019 at 1:28 AM jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
>
> >
> >
> > On 2019/04/02 22:43:31, jim.me...@concept-solutions.com <
> > jim.me...@concept-solutions.com> wrote:
> > >
> > >
> > > On 2019/04/02 22:25:16, jim.me...@concept-solutions.com <
> > jim.me...@concept-solutions.com> wrote:
> > > >
> > > >
> > > > On 2019/04/02 21:59:21, Hans Jespersen  wrote:
> > > > > yes. Idempotent publish uses a unique messageID to discard
> potential
> > duplicate messages caused by failure conditions when  publishing.
> > > > >
> > > > > -hans
> > > > >
> > > > > > On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com <
> > jim.me...@concept-solutions.com> wrote:
> > > > > >
> > > > > > Does Kafka have something that behaves like a unique key so a
> > producer can’t write the same value to a topic twice?
> > > > >
> > > >
> > > > Hi Hans,
> > > >
> > > > Is there some documentation or an example with source code where
> I
> > can learn more about this feature and how it is implemented?
> > > >
> > > > Thanks,
> > > > Jim
> > > >
> > >
> > > By the way I tried this...
> > >  echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh
> > --broker-list localhost:9092 --topic TestTopic --property
> "parse.key=true"
> > --property "key.separator=:" --property "enable.idempotence=true" >
> > /dev/null
> > >
> > > And... that didn't seem to do the trick - after running that command
> > multiple times I did receive key1 value1 for as many times as I had run
> the
> > prior command.
> > >
> > > Maybe it is the way I am setting the flags...
> > > Recently I saw that someone did this...
> > > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> > --producer-property enable.idempotence=true --request-required-acks -1
> > >
> >
> > Also... the reason for my question is that we are going to have two JMS
> > topics with nearly redundant data in them have the UNION written to Kafka
> > for further processing.
> >
> >
>


Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-03 Thread Vincent Maurin
Hi,

Idempotence flag will guarantee that the message is produce exactly one
time on the topic i.e that running your command a single time will produce
a single message.
It is not a unique enforcement on the message key, there is no such thing
in Kafka.

In Kafka, a topic containing the "history" of values for a given key. That
means that a consumer need to consume the whole topic and keep only the
last value for a given key.
So the uniqueness concept is mean to be done on the consumer side.
Additionally to that, Kafka can perform log compaction to keep only the
last value and preserve disk space (but consumers will still receive
duplicates)


Best

On Wed, Apr 3, 2019 at 1:28 AM jim.me...@concept-solutions.com <
jim.me...@concept-solutions.com> wrote:

>
>
> On 2019/04/02 22:43:31, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> >
> >
> > On 2019/04/02 22:25:16, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> > >
> > >
> > > On 2019/04/02 21:59:21, Hans Jespersen  wrote:
> > > > yes. Idempotent publish uses a unique messageID to discard potential
> duplicate messages caused by failure conditions when  publishing.
> > > >
> > > > -hans
> > > >
> > > > > On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com <
> jim.me...@concept-solutions.com> wrote:
> > > > >
> > > > > Does Kafka have something that behaves like a unique key so a
> producer can’t write the same value to a topic twice?
> > > >
> > >
> > > Hi Hans,
> > >
> > > Is there some documentation or an example with source code where I
> can learn more about this feature and how it is implemented?
> > >
> > > Thanks,
> > > Jim
> > >
> >
> > By the way I tried this...
> >  echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh
> --broker-list localhost:9092 --topic TestTopic --property "parse.key=true"
> --property "key.separator=:" --property "enable.idempotence=true" >
> /dev/null
> >
> > And... that didn't seem to do the trick - after running that command
> multiple times I did receive key1 value1 for as many times as I had run the
> prior command.
> >
> > Maybe it is the way I am setting the flags...
> > Recently I saw that someone did this...
> > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> --producer-property enable.idempotence=true --request-required-acks -1
> >
>
> Also... the reason for my question is that we are going to have two JMS
> topics with nearly redundant data in them have the UNION written to Kafka
> for further processing.
>
>


Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-02 Thread jim . meyer



On 2019/04/02 22:43:31, jim.me...@concept-solutions.com 
 wrote: 
> 
> 
> On 2019/04/02 22:25:16, jim.me...@concept-solutions.com 
>  wrote: 
> > 
> > 
> > On 2019/04/02 21:59:21, Hans Jespersen  wrote: 
> > > yes. Idempotent publish uses a unique messageID to discard potential 
> > > duplicate messages caused by failure conditions when  publishing.
> > > 
> > > -hans  
> > > 
> > > > On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com 
> > > >  wrote:
> > > > 
> > > > Does Kafka have something that behaves like a unique key so a producer 
> > > > can’t write the same value to a topic twice?
> > > 
> > 
> > Hi Hans,
> > 
> > Is there some documentation or an example with source code where I can 
> > learn more about this feature and how it is implemented?
> > 
> > Thanks,
> > Jim
> > 
> 
> By the way I tried this...
>  echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh --broker-list 
> localhost:9092 --topic TestTopic --property "parse.key=true" --property 
> "key.separator=:" --property "enable.idempotence=true" > /dev/null
> 
> And... that didn't seem to do the trick - after running that command multiple 
> times I did receive key1 value1 for as many times as I had run the prior 
> command.
> 
> Maybe it is the way I am setting the flags...
> Recently I saw that someone did this...
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 
> --producer-property enable.idempotence=true --request-required-acks -1
> 

Also... the reason for my question is that we are going to have two JMS topics 
with nearly redundant data in them have the UNION written to Kafka for further 
processing.



Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-02 Thread jim . meyer



On 2019/04/02 22:25:16, jim.me...@concept-solutions.com 
 wrote: 
> 
> 
> On 2019/04/02 21:59:21, Hans Jespersen  wrote: 
> > yes. Idempotent publish uses a unique messageID to discard potential 
> > duplicate messages caused by failure conditions when  publishing.
> > 
> > -hans  
> > 
> > > On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com 
> > >  wrote:
> > > 
> > > Does Kafka have something that behaves like a unique key so a producer 
> > > can’t write the same value to a topic twice?
> > 
> 
> Hi Hans,
> 
> Is there some documentation or an example with source code where I can 
> learn more about this feature and how it is implemented?
> 
> Thanks,
> Jim
> 

By the way I tried this...
 echo "key1:value1" | ~/kafka/bin/kafka-console-producer.sh --broker-list 
localhost:9092 --topic TestTopic --property "parse.key=true" --property 
"key.separator=:" --property "enable.idempotence=true" > /dev/null

And... that didn't seem to do the trick - after running that command multiple 
times I did receive key1 value1 for as many times as I had run the prior 
command.

Maybe it is the way I am setting the flags...
Recently I saw that someone did this...
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 
--producer-property enable.idempotence=true --request-required-acks -1


Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-02 Thread jim . meyer



On 2019/04/02 21:59:21, Hans Jespersen  wrote: 
> yes. Idempotent publish uses a unique messageID to discard potential 
> duplicate messages caused by failure conditions when  publishing.
> 
> -hans  
> 
> > On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com 
> >  wrote:
> > 
> > Does Kafka have something that behaves like a unique key so a producer 
> > can’t write the same value to a topic twice?
> 

Hi Hans,

Is there some documentation or an example with source code where I can 
learn more about this feature and how it is implemented?

Thanks,
Jim


Re: Something like a unique key to prevent same record from being inserted twice?

2019-04-02 Thread Hans Jespersen
yes. Idempotent publish uses a unique messageID to discard potential duplicate 
messages caused by failure conditions when  publishing.

-hans  

> On Apr 1, 2019, at 9:49 PM, jim.me...@concept-solutions.com 
>  wrote:
> 
> Does Kafka have something that behaves like a unique key so a producer can’t 
> write the same value to a topic twice?


Something like a unique key to prevent same record from being inserted twice?

2019-04-02 Thread jim . meyer
Does Kafka have something that behaves like a unique key so a producer can’t 
write the same value to a topic twice?