Hi Regina,

the user code is uploaded once to the `JobManager` and then downloaded from
each `TaskManager` once when it first receives the command to execute the
first task of your job.

As Chesnay said there is no fundamental limitation to the size of the Flink
job. However, it might be the case that you have configured your job
sub-optimally. You said that you have 300 parallel flows. Depending on
whether you've defined separate slot sharing groups for them or not, it
might be the case that parallel subtasks of all 300 parallel jobs share the
same slot (if you haven't changed the slot sharing group). Depending on
what you calculate, this can be inefficient because the individual tasks
don't get much computation time. Moreover, all tasks will allocate some
objects on the heap which can lead to more GC. Therefore, it might make
sense to group some of the jobs together and run these jobs in batches
after the previous batch completed. But this is hard to say without knowing
the details of your job and getting a glimpse at the JobManager logs.

Concerning the exception you're seeing, it would also be helpful to see the
logs of the client and the JobManager. Actually, the scheduling of the job
is independent of the response. Only the creation of the ExecutionGraph and
making the JobGraph highly available in case of an HA setup are executed
before the JobManager acknowledges the job submission. Only if this
acknowledge message is not received in time on the client side, then the
SubmissionTimeoutException is thrown. Therefore, I assume that somehow the
JobManager is too busy or kept from sending the acknowledge message.

Cheers,
Till



On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <regina.c...@gs.com> wrote:

> Does it copy per TaskManager or per operator? I only gave it 10
> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and
> running when it has the resources to.
>
>
>
>
>
>
>
> *From:* Chesnay Schepler [mailto:ches...@apache.org]
> *Sent:* Wednesday, November 01, 2017 7:09 AM
> *To:* user@flink.apache.org
> *Subject:* Re: Job Manager Configuration
>
>
>
> AFAIK there is no theoretical limit on the size of the plan, it just
> depends on the available resources.
>
>
> The job submissions times out since it takes too long to deploy all the
> operators that the job defines. With 300 flows, each with 6 operators
> you're looking at potentially (1800 * parallelism) tasks that have to be
> deployed. For each task Flink copies the user-code of *all* flows to the
> executing TaskManager, which the network may just not be handle in time.
>
> I suggest to split your job into smaller batches or even run each of them
> independently.
>
> On 31.10.2017 16:25, Chan, Regina wrote:
>
> Asking an additional question, what is the largest plan that the
> JobManager can handle? Is there a limit? My flows don’t need to run in
> parallel and can run independently. I wanted them to run in one single job
> because it’s part of one logical commit on my side.
>
>
>
> Thanks,
>
> Regina
>
>
>
> *From:* Chan, Regina [Tech]
> *Sent:* Monday, October 30, 2017 3:22 PM
> *To:* 'user@flink.apache.org'
> *Subject:* Job Manager Configuration
>
>
>
> Flink Users,
>
>
>
> I have about 300 parallel flows in one job each with 2 inputs, 3
> operators, and 1 sink which makes for a large job. I keep getting the below
> timeout exception but I’ve already set it to a 30 minute time out with a
> 6GB heap on the JobManager? Is there a heuristic to better configure the
> job manager?
>
>
>
> Caused by: 
> org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException:
> Job submission to the JobManager timed out. You may increase
> 'akka.client.timeout' in case the JobManager needs more time to configure
> and confirm the job submission.
>
>
>
> *Regina Chan*
>
> *Goldman Sachs* *–* Enterprise Platforms, Data Architecture
>
> *30 Hudson Street, 37th floor | Jersey City, NY 07302
> <https://maps.google.com/?q=30+Hudson+Street,+37th+floor+%7C+Jersey+City,+NY+07302&entry=gmail&source=g>*
> (  (212) 902-5697
>
>
>
>
>

Reply via email to