[
https://issues.apache.org/jira/browse/APEXMALHAR-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569248#comment-15569248
]
Siyuan Hua commented on APEXMALHAR-2283:
----------------------------------------
There are couple of solutions to "exactly-once". To me they are all different
and they all have different assumptions.
First of all, if we assume there is unique message id, we can definitely use
that for dedup, but that is not always the case, then we need appid and
operatorid to do dedup.
Then this information can be store in either key or extra topic, I wouldn't say
either of them is better than the other, really depends on how user's
requirement.
And no matter what we do, as long as there are number of operators writes to
same kafka partition, I'm afraid there is no way to do perfect dedup because we
don't know the safe place to do dedup from or too late to do dedup(too much
noise from the safe place if other operator instances are fully loaded)
I don't remember hashcode, but I think hashcode will include some false
positive?
And there is other solution like, create topic and number of partitions
automatically based on kafka operator instances, in this case, it is much
easier, we are always use what messages needs to be dedup because one operator
only write to one kafka partition. This solution, my understanding, is most
reliable and some user might want it. But the metadata of that kafka topic is
kind of automatic created and it's very hard to support dynamic partition in
this case.
That's the reason why I say there is no general solution for exactly-once kafka
output operator. We may need to provide different solutions in "examples" for
people to choose from.
Anyways, Sandesh, can you wrap up the current solution, post and discuss it in
mailing list?
> Refactor kafka output operator
> ------------------------------
>
> Key: APEXMALHAR-2283
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2283
> Project: Apache Apex Malhar
> Issue Type: Improvement
> Reporter: Siyuan Hua
> Assignee: Siyuan Hua
>
> The abstract kafka output operator needs to be refactored
> 1. Needs to set some mandatory properties on operator level instead of kafka
> property level.
> 2. More document and examples
> 3. Find a standard way to achieve exactly once in both 0.8 and 0.9
> More will be added when working on the ticket
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)