Hi Timur, I had a look at the plan you shared. I could not find any flow that branches and merges again, a pattern which is prone to cause a deadlocks.
However, I noticed that the plan performs a lot of partitioning steps. You might want to have a look at forwarded field annotations which can help to reduce the partitioning and sorting steps [1]. This might help with complex jobs such as yours. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations 2016-04-27 10:57 GMT+02:00 Vasiliki Kalavri <vasilikikala...@gmail.com>: > Hi Timur, > > I've previously seen large batch jobs hang because of join deadlocks. We > should have fixed those problems, but we might have missed some corner > case. Did you check whether there was any cpu activity when the job hangs? > Can you try running htop on the taskmanager machines and see if they're > idle? > > Cheers, > -Vasia. > > On 27 April 2016 at 02:48, Timur Fayruzov <timur.fairu...@gmail.com> > wrote: > >> Robert, Ufuk, logs, execution plan and a screenshot of the console are in >> the archive: >> https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0 >> >> Note that when I looked in the backpressure view I saw back pressure >> 'high' on following paths: >> >> Input->code_line:123,124->map->join >> Input->code_line:134,135->map->join >> Input->code_line:121->map->join >> >> Unfortunately, I was not able to take thread dumps nor heap dumps >> (neither kill -3, jstack nor jmap worked, some Amazon AMI problem I assume). >> >> Hope that helps. >> >> Please, let me know if I can assist you in any way. Otherwise, I probably >> would not be actively looking at this problem. >> >> Thanks, >> Timur >> >> >> On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <u...@apache.org> wrote: >> >>> Can you please further provide the execution plan via >>> >>> env.getExecutionPlan() >>> >>> >>> >>> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov >>> <timur.fairu...@gmail.com> wrote: >>> > Hello Robert, >>> > >>> > I observed progress for 2 hours(meaning numbers change on dashboard), >>> and >>> > then I waited for 2 hours more. I'm sure it had to spill at some >>> point, but >>> > I figured 2h is enough time. >>> > >>> > Thanks, >>> > Timur >>> > >>> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rmetz...@apache.org> wrote: >>> >> >>> >> Hi Timur, >>> >> >>> >> thank you for sharing the source code of your job. That is helpful! >>> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is >>> much >>> >> more IO heavy with the larger input data because all the joins start >>> >> spilling? >>> >> Our monitoring, in particular for batch jobs is really not very >>> advanced.. >>> >> If we had some monitoring showing the spill status, we would maybe >>> see that >>> >> the job is still running. >>> >> >>> >> How long did you wait until you declared the job hanging? >>> >> >>> >> Regards, >>> >> Robert >>> >> >>> >> >>> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <u...@apache.org> wrote: >>> >>> >>> >>> No. >>> >>> >>> >>> If you run on YARN, the YARN logs are the relevant ones for the >>> >>> JobManager and TaskManager. The client log submitting the job should >>> >>> be found in /log. >>> >>> >>> >>> – Ufuk >>> >>> >>> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov >>> >>> <timur.fairu...@gmail.com> wrote: >>> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are >>> there >>> >>> > any >>> >>> > logs besides what's in flink/log and yarn container logs? >>> >>> > >>> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <u...@apache.org> wrote: >>> >>> > >>> >>> > Hey Timur, >>> >>> > >>> >>> > is it possible to connect to the VMs and get stack traces of the >>> Flink >>> >>> > processes as well? >>> >>> > >>> >>> > We can first have a look at the logs, but the stack traces will be >>> >>> > helpful if we can't figure out what the issue is. >>> >>> > >>> >>> > – Ufuk >>> >>> > >>> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann < >>> trohrm...@apache.org> >>> >>> > wrote: >>> >>> >> Could you share the logs with us, Timur? That would be very >>> helpful. >>> >>> >> >>> >>> >> Cheers, >>> >>> >> Till >>> >>> >> >>> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" < >>> timur.fairu...@gmail.com> >>> >>> >> wrote: >>> >>> >>> >>> >>> >>> Hello, >>> >>> >>> >>> >>> >>> Now I'm at the stage where my job seem to completely hang. Source >>> >>> >>> code is >>> >>> >>> attached (it won't compile but I think gives a very good idea of >>> what >>> >>> >>> happens). Unfortunately I can't provide the datasets. Most of >>> them >>> >>> >>> are >>> >>> >>> about >>> >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks >>> 6GB >>> >>> >>> memory >>> >>> >>> for each. >>> >>> >>> >>> >>> >>> It was working for smaller input sizes. Any idea on what I can do >>> >>> >>> differently is appreciated. >>> >>> >>> >>> >>> >>> Thans, >>> >>> >>> Timur >>> >> >>> >> >>> > >>> >> >> >