[jira] [Comment Edited] (STORM-2824) Ability to configure topologies for exactly once processing

Anton Alfred (JIRA) Sun, 19 Nov 2017 19:33:22 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258781#comment-16258781
 ]


Anton Alfred edited comment on STORM-2824 at 11/20/17 3:32 AM:
---------------------------------------------------------------

topology.acker.executors to 0
Yes, this is exactly the config that I was looking for. Thanks for pointing it 
out.

Juntaek,
With regard to exactly once processing, yes, I agree its a difficult problem, 
to solve this it all depends on what is acceptable to the end user.

Do refer to the comment on [Failures for 
STORM-2823|https://issues.apache.org/jira/browse/STORM-2823]

Even on an API server while the api is processing a record, if the api server 
goes down, there is no guarantee that this data would not be lost. If  we do 
have time to log the error, we can tell that could not be persisted.

In your case, when we have a topology T1 that has Spout1, Bolt 1and Bolt 2  
when the worker goes down we might loss the record depending on which state the 
worker was on. If Spout1 before acking then it would be replayed, if spout 1 
after acking then it would be lost and error logged inside Spout1 etc but these 
are very rare.
The issue is always with Data in most cases and if we do the handling in every 
bolt correctly we can simulate the exactly once and be done with, which is the 
reason for this flag.

As for the lost record, if we track all the bolts manually and once we finished 
we send a notification to the calling system. This way we also know which 
records are processed and which are not so the calling systems know which data 
to send back passing the problem to the sources in the worst case where 
topologies could not track the record. Agreed this is lot of work but had to be 
done if exactly once is required.





was (Author: antonpi...@gmail.com):
topology.acker.executors to 0
Yes, this is exactly the config that I was looking for. Thanks for pointing it 
out.

Juntaek,
With regard to exactly once processing, yes, I agree its a difficult problem, 
to solve this it all depends on what is acceptable to the end user.

Do refer to the comment on [Failures for 
STORM-2823|https://issues.apache.org/jira/browse/STORM-2823]

Even on an API server while the api is processing a record, if the api server 
goes down, there is no guarantee that this data would not be lost. If  we do 
have time to log the error, we can tell that could not be persisted.

In your case, when the we have a topology T1 that has Spout1, Bolt 1and Bolt 2  
when the worker goes down we might loss the record depending on which state the 
worker was on. If Spout1 before acking then it would be replayed, if spout 1 
after acking then it would be lost and error logged inside Spout1 etc but these 
are very rare.
The issue is always with Data in most cases and if we do the handling in every 
bolt correctly we can simulate the exactly once and be done with, which is the 
reason for this flag.

As for the lost record, if we track all the bolts manually and once we finished 
we send a notification to the calling system. This way we also know which 
records are processed and which are not so the calling systems know which data 
to send back passing the problem to the sources in the worst case where 
topologies could not track the record. Agreed this is lot of work but had to be 
done if exactly once is required.

Marking this as resolved



> Ability to configure topologies for exactly once processing
> -----------------------------------------------------------
>
>                 Key: STORM-2824
>                 URL: https://issues.apache.org/jira/browse/STORM-2824
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>    Affects Versions: 1.0.1
>         Environment: CentOS 7, Docker
>            Reporter: Anton Alfred
>            Priority: Minor
>
> The default implementation of a spout  (Kafka) is to wait for 
> acknowledgement, if an acknowledgement is not provided the tuple is replayed 
> leading to an at least once processing model.
> Can an option be provided to always acknowledge even in the event of error in 
> any spout or bolt and the user decide which mode the topology should be 
> configured.
> There are cases like multiple bolts (B) inserting to persistent stores (PS) 
> like B1 - PS1, B2-PS2, B3-PS3, the fact that B2-PS2 bolt fail doesn't mean 
> that the tuple needs to be replayed leading to complexity on the logic of 
> bolts, it would be easier if this was configurable and the user of the 
> topology decides which style to choose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (STORM-2824) Ability to configure topologies for exactly once processing

Reply via email to