Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/1668
  
    Thanks for the reminder, I went over the code today. The code looks mostly 
good, but here are some thoughts:
    
      - The head task supports only one concurrent checkpoint. In general, the 
tasks need to support multiple checkpoints being in progress at the same time. 
It frequently happens when people trigger savepoints concurrent to a running 
checkpoint. I think that is important to support.
    
      - There tail task offers the elements to the blocking queue. That means 
records are simply dropped if the capacity bound queue (one element) is not 
polled by the head task in time.
    
      - With the capacity bound in the feedback queue, it is pretty easy to 
build a full deadlock. Just use a loop function that explodes data into the 
feedback channel.
    
      - Recent code also introduced the ability to change parallelism. What are 
the semantics here when the parallelism of the loop is changed?
    
    Since loops did not support any fault tolerance guarantees, I guess this 
does improve recovery behavior. But as long as the loops can either deadlock or 
drop data, the hard guarantees are in the end still a bit weak. So that leaves 
me a bit ambivalent what to do with this pull request.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to