[ 
https://issues.apache.org/jira/browse/FLINK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain Revol updated FLINK-8717:
--------------------------------
    Description: 
We are encountering what looks like a deadlock of Flink in one of our jobs with 
an "iterate" in it.

I've reduced the job use case to the example in this gist : 
[https://gist.github.com/rrevol/06ddfecd5f5ac7cbc67785b5d3a84dd4]

Nothe that :
 * varying the parallelism affects the rapidity of occurence of the deadlock, 
but it always occur
 * varying MAX_LOOP_NB does affect the deadlock : the higher it is, the faster 
we encounter the deadlock. If MAX_LOOP_NB == 1, no deadlock. It consequently 
leads to think that it happens when the number of iterations reaches some 
threshold.

>From the [^threadDump.txt], it looks like some starvation over buffer 
>allocation, maybe backpressure has flaws on iterate, but I may be mistaking 
>since I don't know well Flink internals.

  was:
We are encountering what looks like a deadlock of Flink in one of our jobs with 
an "iterate" in it.

I've reduced the job use case to the example in this gist : 
[https://gist.github.com/rrevol/06ddfecd5f5ac7cbc67785b5d3a84dd4]

Nothe that :
 * varying the parallelism affects the rapidity of occurence of the deadlock, 
but it always occur
 * varying MAX_LOOP_NB does affect the deadlock : the higher it is, the faster 
we encounter the deadlock. If MAX_LOOP_NB == 1, no deadlock. It consequently 
leads to think that it happens when the number of iterations reaches some 
threshold.

>From the [^threadDump.txt], it looks like some starvation over buffer 
>allocation, maybe backpressure has flaws on iterate, but I may be mistaking 
>since I don't know we'll Flink internals.


> Flink seems to deadlock due to buffer starvation when iterating
> ---------------------------------------------------------------
>
>                 Key: FLINK-8717
>                 URL: https://issues.apache.org/jira/browse/FLINK-8717
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.4.0
>         Environment: Windows 10 Pro 64-bit
> Core i7-6820HQ @ 2.7 GHz
> 16GB RAM
> Flink 1.4
> Scala client
> Scala 2.11.7
>  
>            Reporter: Romain Revol
>            Priority: Major
>         Attachments: threadDump.txt
>
>
> We are encountering what looks like a deadlock of Flink in one of our jobs 
> with an "iterate" in it.
> I've reduced the job use case to the example in this gist : 
> [https://gist.github.com/rrevol/06ddfecd5f5ac7cbc67785b5d3a84dd4]
> Nothe that :
>  * varying the parallelism affects the rapidity of occurence of the deadlock, 
> but it always occur
>  * varying MAX_LOOP_NB does affect the deadlock : the higher it is, the 
> faster we encounter the deadlock. If MAX_LOOP_NB == 1, no deadlock. It 
> consequently leads to think that it happens when the number of iterations 
> reaches some threshold.
> From the [^threadDump.txt], it looks like some starvation over buffer 
> allocation, maybe backpressure has flaws on iterate, but I may be mistaking 
> since I don't know well Flink internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to