[ https://issues.apache.org/jira/browse/FLINK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aljoscha Krettek updated FLINK-7949: ------------------------------------ Priority: Critical (was: Major) > AsyncWaitOperator is not restarting when queue is full > ------------------------------------------------------ > > Key: FLINK-7949 > URL: https://issues.apache.org/jira/browse/FLINK-7949 > Project: Flink > Issue Type: Bug > Components: Streaming > Affects Versions: 1.3.2 > Reporter: Bartłomiej Tartanus > Priority: Critical > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > Issue was describe here: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-was-declined-tasks-not-ready-td16066.html > Issue - AsyncWaitOperator can't restart properly after failure (thread is > waiting forever) > Scenario to reproduce this issue: > 1. The queue is full (let's assume that its capacity is N elements) > 2. There is some pending element waiting, so the > pendingStreamElementQueueEntry field in AsyncWaitOperator is not null and > while-loop in addAsyncBufferEntry method is trying to add this element to > the queue (but element is not added because queue is full) > 3. Now the snapshot is taken - the whole queue of N elements is being > written into the ListState in snapshotState method and also (what is more > important) this pendingStreamElementQueueEntry is written to this list too. > 4. The process is being restarted, so it tries to recover all the elements > and put them again into the queue, but the list of recovered elements hold > N+1 element and our queue capacity is only N. Process is not started yet, so > it can not process any element and this one element is waiting endlessly. > But it's never added and the process will never process anything. Deadlock. > 5. Trigger is fired and indeed discarded because the process is not running > yet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)