[ https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744648#comment-15744648 ]
ASF GitHub Bot commented on FLINK-3257: --------------------------------------- Github user senorcarbone commented on the issue: https://github.com/apache/flink/pull/1668 These are some good points @StephanEwen, thanks for checking it. How about the following, regarding each issue: - `Concurrent Checkpoints`: Looks like an improvement but I can sure do it in this PR if it is a crucial one. Can you elaborate a bit more or point me out to other concurrent checkpointing operator state examples to get an idea of how you want to do it? - `Reconfiguration` : Sounds interesting...but I am not really aware of it from the devlist. If it is simple enough I could add support for it here. Otherwise I would suggest we address this in a seperate JIRA and PR as an improvement. Is there a design document on how we plan to achieve reconfiguration and repartitioning for operator state specifically somewhere? - `At-most-once blocking queue` : It is obvious from my previous comments that I do not approve this part, but that is something we already got rid of in [FLIP-15](https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+Scoped+Loops+and+Job+Termination) already ([this](https://github.com/FouadMA/flink/commit/9adaac435bcaf3552afe564c739d4e8fd79c433b) commit). How about we address this together with the deadlocks in FLIP-15? - `Deadlocks`: I like the elastic spilling channel idea to resolve deadlocks. I need time to dig a bit more into this and make sure we solve deadlocks and not just improve. Is it ok with you if we address that in [FLIP-15](https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+Scoped+Loops+and+Job+Termination)? I need more time for this part, plus, we need to combine the absense of expiring queues with a proper termination algorithm (otherwise we just solve the deadlocks and the jobs never terminate). What do you think? > Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs > ------------------------------------------------------------------- > > Key: FLINK-3257 > URL: https://issues.apache.org/jira/browse/FLINK-3257 > Project: Flink > Issue Type: Improvement > Reporter: Paris Carbone > Assignee: Paris Carbone > > The current snapshotting algorithm cannot support cycles in the execution > graph. An alternative scheme can potentially include records in-transit > through the back-edges of a cyclic execution graph (ABS [1]) to achieve the > same guarantees. > One straightforward implementation of ABS for cyclic graphs can work as > follows along the lines: > 1) Upon triggering a barrier in an IterationHead from the TaskManager start > block output and start upstream backup of all records forwarded from the > respective IterationSink. > 2) The IterationSink should eventually forward the current snapshotting epoch > barrier to the IterationSource. > 3) Upon receiving a barrier from the IterationSink, the IterationSource > should finalize the snapshot, unblock its output and emit all records > in-transit in FIFO order and continue the usual execution. > -- > Upon restart the IterationSource should emit all records from the injected > snapshot first and then continue its usual execution. > Several optimisations and slight variations can be potentially achieved but > this can be the initial implementation take. > [1] http://arxiv.org/abs/1506.08603 -- This message was sent by Atlassian JIRA (v6.3.4#6332)