[jira] [Commented] (FLINK-1352) Buggy registration from TaskManager to JobManager

ASF GitHub Bot (JIRA) Thu, 22 Jan 2015 02:59:12 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287270#comment-14287270
 ]


ASF GitHub Bot commented on FLINK-1352:
---------------------------------------

Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/328#issuecomment-71003641
  
    Thanks for the summary, Till :)
    
    On 22 Jan 2015, at 11:48, Till Rohrmann <notificati...@github.com> wrote:
    
    > Indefinitely many registration tries:
    > Pros: If the JobManager becomes available at some point in time, then the 
TaskManager will definitely connect to it
    > Cons: If the JobManager dies of some reason, then the TaskManager will 
linger around for all eternity or until it is stopped manually
    
    I am against this as the lingering around is imo problematic.
    
    > Limited number of tries:
    > Pros: Will terminate itself after some time
    > Cons: The time interval might be too short for the JobManager to get 
started
    > 
    > Constant pause:
    > Pros: Relatively quick response time
    > Cons: Causing network traffic until the JobManager has been started
    > 
    > Increasing pause:
    > Pros: Reduction of network traffic if the JobManager takes a little bit 
longer to start
    > Cons: Might delay the registration process if one interval was just missed
    
    Maybe keep the current strategy (n times constant pause c) and then start 
backing off?
    
    Has this been reported as a problem in a setup? Since this is not very 
complicated, but it's hard to find a heuristic to match all use cases, we might 
just implement all strategies, keep the current as default and make it 
configurable.=


> Buggy registration from TaskManager to JobManager
> -------------------------------------------------
>
>                 Key: FLINK-1352
>                 URL: https://issues.apache.org/jira/browse/FLINK-1352
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>             Fix For: 0.9
>
>
> The JobManager's InstanceManager may refuse the registration attempt from a 
> TaskManager, because it has this taskmanager already connected, or,in the 
> future, because the TaskManager has been blacklisted as unreliable.
> Unpon refused registration, the instance ID is null, to signal that refused 
> registration. TaskManager reacts incorrectly to such methods, assuming 
> successful registration
> Possible solution: JobManager sends back a dedicated "RegistrationRefused" 
> message, if the instance manager returns null as the registration result. If 
> the TastManager receives that before being registered, it knows that the 
> registration response was lost (which should not happen on TCP and it would 
> indicate a corrupt connection)
> Followup question: Does it make sense to have the TaskManager trying 
> indefinitely to connect to the JobManager. With increasing interval (from 
> seconds to minutes)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1352) Buggy registration from TaskManager to JobManager

Reply via email to