[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

ASF GitHub Bot (JIRA) Mon, 30 Apr 2018 10:02:24 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458767#comment-16458767
 ]


ASF GitHub Bot commented on FLINK-9190:
---------------------------------------

Github user sihuazhou commented on the issue:

    https://github.com/apache/flink/pull/5931
  
    Hi @GJL , is it possible that the reason is the same as in the previous PR 
for this ticket, that is even the container setup successfully and connect with 
ResourceManager successfully, but the TM was killed before connecting to 
JobManager successfully. In this case, even though there are enough TMs, 
JobManager won't fire any new request, and the ResourceManager doesn't know 
that the container it assigned to JobManager  has been killed either, so both 
JobManager & ResourceManager won't do anything but waiting for timeout... What 
do you think?


> YarnResourceManager sometimes does not request new Containers
> -------------------------------------------------------------
>
>                 Key: FLINK-9190
>                 URL: https://issues.apache.org/jira/browse/FLINK-9190
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.8.3
> ZooKeeper 3.4.5
> Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8
>            Reporter: Gary Yao
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>         Attachments: yarn-logs
>
>
> *Description*
> The {{YarnResourceManager}} does not request new containers if 
> {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is 
> restarted due to {{NoResourceAvailableException}}, and the job runs normally 
> afterwards. I suspect that {{TaskManager}} failures are not registered if the 
> failure occurs before the {{TaskManager}} registers with the master. Logs are 
> attached; I added additional log statements to 
> {{YarnResourceManager.onContainersCompleted}} and 
> {{YarnResourceManager.onContainersAllocated}}.
> *Expected Behavior*
> The {{YarnResourceManager}} should recognize that the container is completed 
> and keep requesting new containers. The job should run as soon as resources 
> are available. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

Reply via email to