Re: Unable to start Flink HA cluster with Zookeeper
Thanks for the info, I have managed to launch a HA cluster with adding rpc.address for all job managers. But it did not work with start-cluster.sh, I had to add one by one. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Unable to start Flink HA cluster with Zookeeper
Hi, It will use HA settings as long as you specify the high-availability: zookeeper. The jobmanager.rpc.adress is used by the jobmanager as a binding address. You can verify it by starting two jobmanagers and then killing the leader. Best, Dawid On Tue, 21 Aug 2018 at 17:46, mozer wrote: > Yeah, you are right. I have already tried to set up jobmanager.rpc.adress > and > it works in that case, but if I use this setting I will not be able to use > HA, am i right ? > How the job manager can register to zookeeper with the right address but > not > localhost ? > > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >
Re: Unable to start Flink HA cluster with Zookeeper
Yeah, you are right. I have already tried to set up jobmanager.rpc.adress and it works in that case, but if I use this setting I will not be able to use HA, am i right ? How the job manager can register to zookeeper with the right address but not localhost ? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Unable to start Flink HA cluster with Zookeeper
Hi, In your case the jobmanager binds itself to localhost and that's what it writes to zookeeper. Try starting the jobmanager manually with jobmanager.rpc.address set to the ip of machine you are running the jobmanager. In other words make sure the jobmanager binds itself to the right ip. Regards Dawid On Tue, 21 Aug 2018 at 15:32, mozer wrote: > FQD or full ip; tried all of them, still no changes ... > For ssh connection, I can connect to each machine without passwords. > > > Do you think that the problem can come from : > > *high-availability.storageDir: file:///shareflink/recovery* ? > > I don't use a HDFS storage but NAS file system which is common for two > machines. > > I also added ; > > > state.backend: filesystem > state.checkpoints.fs.dir: file:///shareflink/recovery/checkpoint > blob.storage.directory: file:///shareflink/recovery/blob > > Logs for zookeeper file : > > 2018-08-21 14:59:32,652 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer > > - tickTime set to 2000 > 2018-08-21 14:59:32,653 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer > > - minSessionTimeout set to -1 > 2018-08-21 14:59:32,653 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer > > - maxSessionTimeout set to -1 > 2018-08-21 14:59:32,661 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory > > - binding to port 0.0.0.0/0.0.0.0:2181 > 2018-08-21 14:59:39,940 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory > > - Accepted socket connection from /Machine1:60186 > 2018-08-21 14:59:40,015 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory > > - Accepted socket connection from /Machine2:54466 > 2018-08-21 14:59:40,017 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer > > - Client attempting to establish new session at /Machine1:60186 > 2018-08-21 14:59:40,017 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer > > - Client attempting to establish new session at /Machine2:54466 > > Log for Job Manager : > > 2018-08-21 14:59:39,327 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to > start actor system at 127.0.0.1:50101 > 2018-08-21 14:59:39,723 INFO akka.event.slf4j.Slf4jLogger > > - Slf4jLogger started > 2018-08-21 14:59:39,766 INFO akka.remote.Remoting > > - Starting remoting > 2018-08-21 14:59:39,859 INFO akka.remote.Remoting > > - Remoting started; listening on addresses > :[akka.tcp://flink@127.0.0.1:50101] > 2018-08-21 14:59:39,865 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor > system > started at akka.tcp://flink@127.0.0.1:50101 > 2018-08-21 14:59:39,872 INFO > org.apache.flink.runtime.blob.FileSystemBlobStore - Creating > highly available BLOB storage directory at > file:///shareflink/recovery///blob > 2018-08-21 14:59:39,876 INFO > org.apache.flink.runtime.util.ZooKeeperUtils > - Enforcing default ACL for ZK connections > 2018-08-21 14:59:39,876 INFO > org.apache.flink.runtime.util.ZooKeeperUtils > - Using '/usr/flink-1.5.1/' as Zookeeper namespace. > 2018-08-21 14:59:39,919 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl > > - Starting > > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >
Re: Unable to start Flink HA cluster with Zookeeper
FQD or full ip; tried all of them, still no changes ... For ssh connection, I can connect to each machine without passwords. Do you think that the problem can come from : *high-availability.storageDir: file:///shareflink/recovery* ? I don't use a HDFS storage but NAS file system which is common for two machines. I also added ; state.backend: filesystem state.checkpoints.fs.dir: file:///shareflink/recovery/checkpoint blob.storage.directory: file:///shareflink/recovery/blob Logs for zookeeper file : 2018-08-21 14:59:32,652 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer - tickTime set to 2000 2018-08-21 14:59:32,653 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer - minSessionTimeout set to -1 2018-08-21 14:59:32,653 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer - maxSessionTimeout set to -1 2018-08-21 14:59:32,661 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory - binding to port 0.0.0.0/0.0.0.0:2181 2018-08-21 14:59:39,940 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /Machine1:60186 2018-08-21 14:59:40,015 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /Machine2:54466 2018-08-21 14:59:40,017 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /Machine1:60186 2018-08-21 14:59:40,017 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /Machine2:54466 Log for Job Manager : 2018-08-21 14:59:39,327 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at 127.0.0.1:50101 2018-08-21 14:59:39,723 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2018-08-21 14:59:39,766 INFO akka.remote.Remoting - Starting remoting 2018-08-21 14:59:39,859 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink@127.0.0.1:50101] 2018-08-21 14:59:39,865 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink@127.0.0.1:50101 2018-08-21 14:59:39,872 INFO org.apache.flink.runtime.blob.FileSystemBlobStore - Creating highly available BLOB storage directory at file:///shareflink/recovery///blob 2018-08-21 14:59:39,876 INFO org.apache.flink.runtime.util.ZooKeeperUtils - Enforcing default ACL for ZK connections 2018-08-21 14:59:39,876 INFO org.apache.flink.runtime.util.ZooKeeperUtils - Using '/usr/flink-1.5.1/' as Zookeeper namespace. 2018-08-21 14:59:39,919 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Unable to start Flink HA cluster with Zookeeper
First of all try with FQD or full ip. Also in order to run HA cluster you need to make sure that you have password less ssh access to your slaves and master communication. . On Tue, Aug 21, 2018 at 4:15 PM mozer wrote: > I am trying to install a Flink HA cluster (Zookeeper mode) but the task > manager cannot find the job manager. > > Here I give you the architecture; > > - Machine 1 : Job Manager + Zookeeper > - Machine 2 : Task Manager > > masters: > > Machine1 > > slaves : > > Machine2 > > flink-conf.yaml: > > #jobmanager.rpc.address: localhost > jobmanager.rpc.port: 6123 > blob.server.port: 50100-50200 > taskmanager.data.port: 6121 > high-availability: zookeeper > high-availability.zookeeper.quorum: Machine1:2181 > high-availability.zookeeper.path.root: /flink-1.5.1 > high-availability.cluster-id: /default_b > high-availability.storageDir: file:///shareflink/recovery > > Here this is the log of Task Manager, it tries to connect to localhost > instead of Machine1: > > 2018-08-17 10:46:44,875 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to > select the network interface and address to use by connecting to the > leading > JobManager. > 2018-08-17 10:46:44,876 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager > will try to connect for 1 milliseconds before falling back to > heuristics > 2018-08-17 10:46:44,966 INFO > org.apache.flink.runtime.net.ConnectionUtils - Retrieved > new target address /127.0.0.1:37133. > 2018-08-17 10:46:45,324 INFO > org.apache.flink.runtime.net.ConnectionUtils - Trying to > connect to address /127.0.0.1:37133 > 2018-08-17 10:46:45,325 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address 'Machine2/IP-Machine2': Connection refused > 2018-08-17 10:46:45,325 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address '/127.0.0.1': Connection refused > 2018-08-17 10:46:45,325 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address '/IP_Machine2': Connection refused > 2018-08-17 10:46:45,325 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address '/127.0.0.1': Connection refused > 2018-08-17 10:46:45,326 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address '/IP_Machine2': Connection refused > 2018-08-17 10:46:45,326 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address '/127.0.0.1': Connection refused > 2018-08-17 10:46:45,726 INFO > org.apache.flink.runtime.net.ConnectionUtils - Trying to > connect to address /127.0.0.1:37133 > 2018-08-17 10:46:45,727 INFO > org.apache.flink.runtime.net.ConnectionUtils - Failed to > connect from address 'Machine2/IP-Machine2 > > 2018-08-17 10:47:22,022 WARN akka.remote.ReliableDeliverySupervisor > > - Association with remote system [akka.tcp://flink@127.0.0.1:36515] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused: > /127.0.0.1:36515] > > 2018-08-17 10:47:22,022 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not > resolve ResourceManager address > akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 1 > ms: > Could not connect to rpc endpoint under address > akka.tcp://flink@127.0.0.1:36515/user/resourcemanager.. > 2018-08-17 10:47:32,037 WARN > akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with java.net.ConnectException: > Connection refused: /127.0.0.1:36515 > > > > PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2** > > > Can you please tell me how the Task Manager can connect to Job Manager ? > > Regards > > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >
Unable to start Flink HA cluster with Zookeeper
I am trying to install a Flink HA cluster (Zookeeper mode) but the task manager cannot find the job manager. Here I give you the architecture; - Machine 1 : Job Manager + Zookeeper - Machine 2 : Task Manager masters: Machine1 slaves : Machine2 flink-conf.yaml: #jobmanager.rpc.address: localhost jobmanager.rpc.port: 6123 blob.server.port: 50100-50200 taskmanager.data.port: 6121 high-availability: zookeeper high-availability.zookeeper.quorum: Machine1:2181 high-availability.zookeeper.path.root: /flink-1.5.1 high-availability.cluster-id: /default_b high-availability.storageDir: file:///shareflink/recovery Here this is the log of Task Manager, it tries to connect to localhost instead of Machine1: 2018-08-17 10:46:44,875 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to select the network interface and address to use by connecting to the leading JobManager. 2018-08-17 10:46:44,876 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager will try to connect for 1 milliseconds before falling back to heuristics 2018-08-17 10:46:44,966 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /127.0.0.1:37133. 2018-08-17 10:46:45,324 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address /127.0.0.1:37133 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'Machine2/IP-Machine2': Connection refused 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Connection refused 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/IP_Machine2': Connection refused 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Connection refused 2018-08-17 10:46:45,326 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/IP_Machine2': Connection refused 2018-08-17 10:46:45,326 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Connection refused 2018-08-17 10:46:45,726 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address /127.0.0.1:37133 2018-08-17 10:46:45,727 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'Machine2/IP-Machine2 2018-08-17 10:47:22,022 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:36515] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused: /127.0.0.1:36515] 2018-08-17 10:47:22,022 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not resolve ResourceManager address akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@127.0.0.1:36515/user/resourcemanager.. 2018-08-17 10:47:32,037 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:36515 PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2** Can you please tell me how the Task Manager can connect to Job Manager ? Regards -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Unable to start Flink HA cluster with Zookeeper
I am trying to install a Flink HA cluster (Zookeeper mode) but the task manager cannot find the job manager. Here I give you the architecture; - Machine 1 : Job Manager + Zookeeper - Machine 2 : Task Manager masters: Machine1 slaves : Machine2 flink-conf.yaml: #jobmanager.rpc.address: localhost jobmanager.rpc.port: 6123 blob.server.port: 50100-50200 taskmanager.data.port: 6121 high-availability: zookeeper high-availability.zookeeper.quorum: Machine1:2181 high-availability.zookeeper.path.root: /flink-1.5.1 high-availability.cluster-id: /default_b high-availability.storageDir: file:///shareflink/recovery Here this is the log of Task Manager, it tries to connect to localhost instead of Machine1: 2018-08-17 10:46:44,875 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to select the network interface and address to use by connecting to the leading JobManager. 2018-08-17 10:46:44,876 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager will try to connect for 1 milliseconds before falling back to heuristics 2018-08-17 10:46:44,966 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /127.0.0.1:37133. 2018-08-17 10:46:45,324 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address /127.0.0.1:37133 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'Machine2/IP-Machine2': Connection refused 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Connection refused 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/IP_Machine2': Connection refused 2018-08-17 10:46:45,325 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Connection refused 2018-08-17 10:46:45,326 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/IP_Machine2': Connection refused 2018-08-17 10:46:45,326 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Connection refused 2018-08-17 10:46:45,726 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address /127.0.0.1:37133 2018-08-17 10:46:45,727 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'Machine2/IP-Machine2 2018-08-17 10:47:22,022 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:36515] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused: /127.0.0.1:36515] 2018-08-17 10:47:22,022 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not resolve ResourceManager address akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@127.0.0.1:36515/user/resourcemanager.. 2018-08-17 10:47:32,037 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:36515 PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2** Can you please tell me how the Task Manager can connect to Job Manager ? Regards -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/