Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-07-06 Thread Vishal Santoshi
Yep, pwrfect, that we do.  Can you confirm though that jobs will restart in
the case of a failover ? That is what we see and that is fine..

On Fri, Jul 6, 2018, 8:24 AM Chesnay Schepler  wrote:

> If i remember correctly the masters file is only used by the
> [start|stop]-cluster.sh scripts to determine how many JobManagers should be
> started / stopped and which port they should use.
>
> it's not necessarily *required*, but without it you have to manually
> start/stop all jobmanagers.
>
> On 06.07.2018 14:08, Vishal Santoshi wrote:
>
> Hello Chesnay, I have used an HA setup without the masters file and have
> seen failover happen based on alerts from a leader election routine Is
> it actually required that there be a masters file when there is a central
> arbiterer ZK  that has the alive JMs and a call back to force TMs to switch
> to a new leader in case of failure...
>
> On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler  wrote:
>
>> Please look into high-availability
>> 
>> to make your cluster resistant against shutdowns.
>>
>> On 05.06.2018 12:31, makeyang wrote:
>>
>> can anybody share anythoughts, insights about this issue?
>>
>>
>>
>> --
>> Sent from: 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>>
>>
>


Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-07-06 Thread Chesnay Schepler
If i remember correctly the masters file is only used by the 
[start|stop]-cluster.sh scripts to determine how many JobManagers should 
be started / stopped and which port they should use.


it's not necessarily /required/, but without it you have to manually 
start/stop all jobmanagers.


On 06.07.2018 14:08, Vishal Santoshi wrote:
Hello Chesnay, I have used an HA setup without the masters file and 
have seen failover happen based on alerts from a leader election 
routine Is it actually required that there be a masters file when 
there is a central arbiterer ZK  that has the alive JMs and a call 
back to force TMs to switch to a new leader in case of failure...


On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler > wrote:


Please look into high-availability


to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:

can anybody share anythoughts, insights about this issue?



--
Sent 
from:http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/







Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-07-06 Thread Vishal Santoshi
Even though I must admit that the jobs restart but they do restart
successfully  with the new JM.

On Fri, Jul 6, 2018, 8:08 AM Vishal Santoshi 
wrote:

> Hello Chesnay, I have used an HA setup without the masters file and have
> seen failover happen based on alerts from a leader election routine Is
> it actually required that there be a masters file when there is a central
> arbiterer ZK  that has the alive JMs and a call back to force TMs to switch
> to a new leader in case of failure...
>
> On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler  wrote:
>
>> Please look into high-availability
>> 
>> to make your cluster resistant against shutdowns.
>>
>> On 05.06.2018 12:31, makeyang wrote:
>>
>> can anybody share anythoughts, insights about this issue?
>>
>>
>>
>> --
>> Sent from: 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>>
>>


Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-07-06 Thread Vishal Santoshi
Hello Chesnay, I have used an HA setup without the masters file and have
seen failover happen based on alerts from a leader election routine Is
it actually required that there be a masters file when there is a central
arbiterer ZK  that has the alive JMs and a call back to force TMs to switch
to a new leader in case of failure...

On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler  wrote:

> Please look into high-availability
> 
> to make your cluster resistant against shutdowns.
>
> On 05.06.2018 12:31, makeyang wrote:
>
> can anybody share anythoughts, insights about this issue?
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>
>
>


Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-06-05 Thread Chesnay Schepler
Please look into high-availability 
 
to make your cluster resistant against shutdowns.


On 05.06.2018 12:31, makeyang wrote:

can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/





Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-06-05 Thread makeyang
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-06-04 Thread makeyang
so is there a way or config to ask taskmanager to keep continue connectting
to jobmanager?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-06-04 Thread makeyang
when I debug the jobmanager and below is the error log in task manager:
2018-06-04 17:16:33,295 INFO 
org.apache.flink.runtime.taskexecutor.TaskExecutor- The
heartbeat of ResourceManager with id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed
out.
2018-06-04 17:16:33,297 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor- Close
ResourceManager connection 35df0455efc2fb6fa3f2467f7f5d2ba1.
java.util.concurrent.TimeoutException: The heartbeat of ResourceManager with
id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed out.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1553)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener$$Lambda$26/1975100911.run(Unknown
Source)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:295)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$12/1732386307.apply(Unknown
Source)
at
akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/