Setting the parallelism in a cluster of machines properly

2018-04-26 Thread m@xi
Hello Flinkers, I have deployed Flink in a cluster of 17 nodes, each having 8 CPUs. Thus, in total there are 136 CPUs available. I have set the parameter askmanager.numberOfTaskSlots = 8 in all machines, since they have 8 CPUs. And when I am going to run ./flink run -c classpath jarFile -p 136 a

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread TechnoMage
Go to the web UI and verify all 136 TaskManagers are visible in the machine you are submitting the job from. I have encountered issues where not all TaskManagers start, or you may not have all 17 configured properly to be one cluster vs 17 clusters of 8. Michael > On Apr 26, 2018, at 10:48 AM

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread m@xi
No man. I have 17 TaskManagers and each has a number of 8 slots. Do you think it is better to have 8 TaskManager (1 slot each) ? Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread TechnoMage
You need to verify your configs are correct. Check that the local machine sees all the task managers, that is the most likely reason it will reject a higher parallelism. I use a java program to submit to a 3 node 18 slot cluster without issue on a job with 18 parallelism. I have not used the

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread TechnoMage
Check that you have slaves and masters set correctly on all machines, and in particular the one submitting jobs. Make sure that from the machine submitting the job that it is talking to the correct job manager (jobmanager.rpc.address). It really sounds like you are some how submitting jobs to

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread Makis Pap
OK Michael! I will look into it and will come back at you! Thanks for the help. I agree that it is quite suspicious the par = 8 Jps? Meaning? Oh I should mention that the JobManager node is also a TaskManager. Best, Max > On 27 Apr 2018, at 01:39, TechnoMage wrote: > > Check that you have s

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread kedar mhaswade
On Thu, Apr 26, 2018 at 10:47 AM, Makis Pap wrote: > OK Michael! > > I will look into it and will come back at you! Thanks for the help. I > agree that it is quite suspicious the par = 8 > > Jps? Meaning? > jps is a tool that comes with JDK (see $JAVA_HOME/bin). This is modeled after the POSIX co

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread TechnoMage
For a cluster that size, having the job manager also be a task manager is not recommended. Michael > On Apr 26, 2018, at 11:47 AM, Makis Pap wrote: > > OK Michael! > > I will look into it and will come back at you! Thanks for the help. I agree > that it is quite suspicious the par = 8 > > J

Re: Setting the parallelism in a cluster of machines properly

2018-04-27 Thread m@xi
Hi Michael! Seems that you were correct. It is weird that I could not set parallelism = 136. I cannot configure the cluster properly so far. I do everything as it is described here [1]. It seems that the JobManager is not reachable. Best, Max [1] -- https://ci.apache.org/projects/flink/flink-

Re: Setting the parallelism in a cluster of machines properly

2018-04-28 Thread m@xi
The TaskManager cannot reach the JobManager. I get this error. Any ideas? Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager. Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Setting the parallelism in a cluster of machines properly

2018-04-29 Thread m@xi
Guys seriously I have done the process as described in the documentation of the standalone cluster 20 times. After I start the cluster with ./start-cluster.sh, I normally see with jps the JobManager process running in the master and the TaskManager processes running in slaves. Although every time I

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread Fabian Hueske
Hi, did you try to increase the Akka timeout [1]? Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#distributed-coordination-via-akka 2018-04-29 19:44 GMT+02:00 m@xi : > Guys seriously I have done the process as described in the documentation of > the

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread m@xi
Hello Fabian! Thanks for the answer. No I did not. Is this a requirement? Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread Fabian Hueske
It's not a requirement but the exception reads "org.apache.flink.runtime. client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.". So increasing the timeout might help. Best, Fabian 2018-05-02 12:20 GMT+02:00 m@xi : > Hello Fabian! > > Thanks for the answer. No I did

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread m@xi
Hey Fabian! Sorry for being unaware regarding Flink configurations, but for me I have followed every step but still setting a simple cluster of 2 nodes proved to be a pain in the as@@#. So, to which value you think I should set the akka timeout? Also, in my head the process is the following : S