[ 
https://issues.apache.org/jira/browse/FLINK-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287259#comment-14287259
 ] 

ASF GitHub Bot commented on FLINK-1352:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/328#issuecomment-71002640
  
    You are right @hsaputra, because I'm not sure which approach is the best. 
In the corresponding JIRA issue I have tried to give a summary of what I think 
are the pros and cons of indefinitely many registration tries vs. a limited 
number of tries and a constant pause in between tries vs. an increasing pause.
    
    Indefinitely many registration tries:
    Pros: If the JobManager becomes available at some point in time, then the 
TaskManager will definitely connect to it
    Cons: If the JobManager dies of some reason, then the TaskManager will 
linger around for all eternity or until it is stopped manually
    
    Limited number of tries:
    Pros: Will terminate itself after some time
    Cons: The time interval might be too short for the JobManager to get started
    
    Constant pause:
    Pros: Relatively quick response time
    Cons: Causing network traffic until the JobManager has been started
    
    Increasing pause:
    Pros: Reduction of network traffic if the JobManager takes a little bit 
longer to start
    Cons: Might delay the registration process if one interval was just missed


> Buggy registration from TaskManager to JobManager
> -------------------------------------------------
>
>                 Key: FLINK-1352
>                 URL: https://issues.apache.org/jira/browse/FLINK-1352
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>             Fix For: 0.9
>
>
> The JobManager's InstanceManager may refuse the registration attempt from a 
> TaskManager, because it has this taskmanager already connected, or,in the 
> future, because the TaskManager has been blacklisted as unreliable.
> Unpon refused registration, the instance ID is null, to signal that refused 
> registration. TaskManager reacts incorrectly to such methods, assuming 
> successful registration
> Possible solution: JobManager sends back a dedicated "RegistrationRefused" 
> message, if the instance manager returns null as the registration result. If 
> the TastManager receives that before being registered, it knows that the 
> registration response was lost (which should not happen on TCP and it would 
> indicate a corrupt connection)
> Followup question: Does it make sense to have the TaskManager trying 
> indefinitely to connect to the JobManager. With increasing interval (from 
> seconds to minutes)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to