RE: Job Manager Configuration

Chan, Regina Wed, 08 Nov 2017 10:30:31 -0800

Thanks for the responses!

I’m currently using 1.2.0 – going to bump it up once I have things stabilized. 
I haven’t defined any slot sharing groups but I do think that I’ve probably got 
my job configured sub optimally. I’ve refactored my code so that I can submit 
subsets of the flow at a time and it seems to work. The break between the 
JobManager able to acknowledge job and not seems to hover somewhere between 
10-20 flows.


I guess what doesn’t make too much sense to me is if the user code is uploaded 
once to the JobManager and downloaded from each TaskManager, what exactly is 
the JobManager doing that’s keeping it busy? It’s the same code across the 
TaskManagers.

I’ll get you the logs shortly.

From: Till Rohrmann [mailto:trohrm...@apache.org]
Sent: Wednesday, November 08, 2017 10:17 AM
To: Chan, Regina [Tech]
Cc: Chesnay Schepler; user@flink.apache.org
Subject: Re: Job Manager Configuration

Quick question Regina: Which version of Flink are you running?

Cheers,
Till

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann 
<till.rohrm...@gmail.com<mailto:till.rohrm...@gmail.com>> wrote:
Hi Regina,

the user code is uploaded once to the `JobManager` and then downloaded from 
each `TaskManager` once when it first receives the command to execute the first 
task of your job.

As Chesnay said there is no fundamental limitation to the size of the Flink 
job. However, it might be the case that you have configured your job 
sub-optimally. You said that you have 300 parallel flows. Depending on whether 
you've defined separate slot sharing groups for them or not, it might be the 
case that parallel subtasks of all 300 parallel jobs share the same slot (if 
you haven't changed the slot sharing group). Depending on what you calculate, 
this can be inefficient because the individual tasks don't get much computation 
time. Moreover, all tasks will allocate some objects on the heap which can lead 
to more GC. Therefore, it might make sense to group some of the jobs together 
and run these jobs in batches after the previous batch completed. But this is 
hard to say without knowing the details of your job and getting a glimpse at 
the JobManager logs.

Concerning the exception you're seeing, it would also be helpful to see the 
logs of the client and the JobManager. Actually, the scheduling of the job is 
independent of the response. Only the creation of the ExecutionGraph and making 
the JobGraph highly available in case of an HA setup are executed before the 
JobManager acknowledges the job submission. Only if this acknowledge message is 
not received in time on the client side, then the SubmissionTimeoutException is 
thrown. Therefore, I assume that somehow the JobManager is too busy or kept 
from sending the acknowledge message.

Cheers,
Till



On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina 
<regina.c...@gs.com<mailto:regina.c...@gs.com>> wrote:
Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers 
with 2 slots. I’m perfectly fine with it queuing up and running when it has the 
resources to.



From: Chesnay Schepler [mailto:ches...@apache.org<mailto:ches...@apache.org>]
Sent: Wednesday, November 01, 2017 7:09 AM
To: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Job Manager Configuration

AFAIK there is no theoretical limit on the size of the plan, it just depends on 
the available resources.


The job submissions times out since it takes too long to deploy all the 
operators that the job defines. With 300 flows, each with 6 operators you're 
looking at potentially (1800 * parallelism) tasks that have to be deployed. For 
each task Flink copies the user-code of all flows to the executing TaskManager, 
which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them 
independently.

On 31.10.2017 16:25, Chan, Regina wrote:
Asking an additional question, what is the largest plan that the JobManager can 
handle? Is there a limit? My flows don’t need to run in parallel and can run 
independently. I wanted them to run in one single job because it’s part of one 
logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<mailto:user@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 
1 sink which makes for a large job. I keep getting the below timeout exception 
but I’ve already set it to a 30 minute time out with a 6GB heap on the 
JobManager? Is there a heuristic to better configure the job manager?

Caused by: 
org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job 
submission to the JobManager timed out. You may increase 'akka.client.timeout' 
in case the JobManager needs more time to configure and confirm the job 
submission.

Regina Chan
Goldman Sachs – Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 
07302<https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D30-2BHudson-2BStreet-2C-2B37th-2Bfloor-2B-257C-2BJersey-2BCity-2C-2BNY-2B07302-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=laVSZTJAQISd6BKl5JXEilWYowD61y4Ai_UMr2jf_9c&s=X1OLt2bSLUDeiuNf2MdsX_68SjaV87OwfR1puLmsKlc&e=>
 •  (212) 902-5697<tel:(212)%20902-5697>

RE: Job Manager Configuration

Reply via email to