Short answer: could be that your job is simply too big to be serialised, 
distributed and deserialised in the given time and you would have to increase 
timeouts even more.

Long answer: 

Do you have the same problem when you try to submit smaller job? Does your 
cluster work for simpler jobs? Try cutting down/simplifying your job up to the 
point it works. Maybe you will be able to pin down one single operator that’s 
causing the problem (one that have for example huge static data structure). If 
so, you might be able to optimise your operators in some way. Maybe some 
operator is doing some weird things and causing problems.

You could also try to approach this problem from other direction (as previously 
suggested by Fabian). Try to profile/find out what the cluster is doing, where 
is the problem. Job Manager? One Task Manager? All of the Task Managers? Is 
there high cpu usage somewhere? Maybe one thread somewhere is overloaded? High 
network usage? After identifying potential problematic JVM’s, you could attach 
a code profiler or print stack traces to further pin down the problem. 

Piotrek

> On 30 Apr 2018, at 21:53, Chan, Regina <regina.c...@gs.com> wrote:
> 
> Any updates on this one? I'm seeing similar issues with 1.3.3 and the batch 
> api. 
> 
> Main difference is that I have even more operators ~850, mostly maps and 
> filters with one cogroup. I don't really want to set a akka.client.timeout 
> for anything more than 10 minutes seeing that it still fails with that 
> amount. The akka.framesize is already 500Mb... 
> 
> akka.framesize: 524288000b
> akka.ask.timeout: 10min
> akka.client.timeout: 10min
> akka.lookup.timeout: 10min
> 
> 
> Thanks,
> Regina
> 
> 
> 
> -----Original Message-----
> From: Niels [mailto:nielsdenis...@gmail.com <mailto:nielsdenis...@gmail.com>] 
> Sent: Tuesday, February 27, 2018 11:40 AM
> To: user@flink.apache.org <mailto:user@flink.apache.org>
> Subject: Re: Fat jar fails deployment (streaming job too large)
> 
> Hi Till,
> 
> I've just tried to set on the *client*:
> akka.client.timeout: 300s 
> 
> On the *cluster*:
> akka.ask.timeout: 30s
> akka.lookup.timeout: 30s
> akka.client.timeout: 300s
> akka.framesize: 104857600b #(10x the original of 10MB)
> akka.log.lifecycle.events: true
> 
> Still gives me the same issue, the fat jar isn't deployed. See the attached
> files for the logs of the jobmanager and the deployer. Let me know if I can
> provide you with any additional info. Thanks for your help!
> 
> Cheers,
> Niels
> 
> Flink_deploy_log.txt
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_Flink-5Fdeploy-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=HxWMISxclHHjDET_E_zY-P95lt5mvMxU7YfGx9vyFcg&e=
>  
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_Flink-5Fdeploy-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=HxWMISxclHHjDET_E_zY-P95lt5mvMxU7YfGx9vyFcg&e=>
>  >  
> flink_jobmanager_log.txt
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_flink-5Fjobmanager-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=8PvIcLRPFokJ5XOPsczSatUddfM-xd6eG_FxaDlHEBk&e=
>  
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_flink-5Fjobmanager-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=8PvIcLRPFokJ5XOPsczSatUddfM-xd6eG_FxaDlHEBk&e=>
>  >  
> 
> 
> 
> 
> 
> --
> Sent from: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=yX4z6UV1AFsAQtJsVquzujhFio0CgYr-tAIoroUXP8E&e=
>  
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=yX4z6UV1AFsAQtJsVquzujhFio0CgYr-tAIoroUXP8E&e=>

Reply via email to