Thanks for the responses! I’m currently using 1.2.0 – going to bump it up once I have things stabilized. I haven’t defined any slot sharing groups but I do think that I’ve probably got my job configured sub optimally. I’ve refactored my code so that I can submit subsets of the flow at a time and it seems to work. The break between the JobManager able to acknowledge job and not seems to hover somewhere between 10-20 flows.
I guess what doesn’t make too much sense to me is if the user code is uploaded once to the JobManager and downloaded from each TaskManager, what exactly is the JobManager doing that’s keeping it busy? It’s the same code across the TaskManagers. I’ll get you the logs shortly. From: Till Rohrmann [mailto:trohrm...@apache.org] Sent: Wednesday, November 08, 2017 10:17 AM To: Chan, Regina [Tech] Cc: Chesnay Schepler; user@flink.apache.org Subject: Re: Job Manager Configuration Quick question Regina: Which version of Flink are you running? Cheers, Till On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <till.rohrm...@gmail.com<mailto:till.rohrm...@gmail.com>> wrote: Hi Regina, the user code is uploaded once to the `JobManager` and then downloaded from each `TaskManager` once when it first receives the command to execute the first task of your job. As Chesnay said there is no fundamental limitation to the size of the Flink job. However, it might be the case that you have configured your job sub-optimally. You said that you have 300 parallel flows. Depending on whether you've defined separate slot sharing groups for them or not, it might be the case that parallel subtasks of all 300 parallel jobs share the same slot (if you haven't changed the slot sharing group). Depending on what you calculate, this can be inefficient because the individual tasks don't get much computation time. Moreover, all tasks will allocate some objects on the heap which can lead to more GC. Therefore, it might make sense to group some of the jobs together and run these jobs in batches after the previous batch completed. But this is hard to say without knowing the details of your job and getting a glimpse at the JobManager logs. Concerning the exception you're seeing, it would also be helpful to see the logs of the client and the JobManager. Actually, the scheduling of the job is independent of the response. Only the creation of the ExecutionGraph and making the JobGraph highly available in case of an HA setup are executed before the JobManager acknowledges the job submission. Only if this acknowledge message is not received in time on the client side, then the SubmissionTimeoutException is thrown. Therefore, I assume that somehow the JobManager is too busy or kept from sending the acknowledge message. Cheers, Till On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <regina.c...@gs.com<mailto:regina.c...@gs.com>> wrote: Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to. From: Chesnay Schepler [mailto:ches...@apache.org<mailto:ches...@apache.org>] Sent: Wednesday, November 01, 2017 7:09 AM To: user@flink.apache.org<mailto:user@flink.apache.org> Subject: Re: Job Manager Configuration AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources. The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time. I suggest to split your job into smaller batches or even run each of them independently. On 31.10.2017 16:25, Chan, Regina wrote: Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side. Thanks, Regina From: Chan, Regina [Tech] Sent: Monday, October 30, 2017 3:22 PM To: 'user@flink.apache.org<mailto:user@flink.apache.org>' Subject: Job Manager Configuration Flink Users, I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager? Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission. Regina Chan Goldman Sachs – Enterprise Platforms, Data Architecture 30 Hudson Street, 37th floor | Jersey City, NY 07302<https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D30-2BHudson-2BStreet-2C-2B37th-2Bfloor-2B-257C-2BJersey-2BCity-2C-2BNY-2B07302-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=laVSZTJAQISd6BKl5JXEilWYowD61y4Ai_UMr2jf_9c&s=X1OLt2bSLUDeiuNf2MdsX_68SjaV87OwfR1puLmsKlc&e=> • (212) 902-5697<tel:(212)%20902-5697>