Re: Job hangs

Timur Fayruzov Tue, 26 Apr 2016 17:49:19 -0700

Robert, Ufuk, logs, execution plan and a screenshot of the console are in
the archive:
https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0


Note that when I looked in the backpressure view I saw back pressure 'high'
on following paths:

Input->code_line:123,124->map->join
Input->code_line:134,135->map->join
Input->code_line:121->map->join

Unfortunately, I was not able to take thread dumps nor heap dumps (neither
kill -3, jstack nor jmap worked, some Amazon AMI problem I assume).

Hope that helps.

Please, let me know if I can assist you in any way. Otherwise, I probably
would not be actively looking at this problem.

Thanks,
Timur


On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <u...@apache.org> wrote:

> Can you please further provide the execution plan via
>
> env.getExecutionPlan()
>
>
>
> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
> <timur.fairu...@gmail.com> wrote:
> > Hello Robert,
> >
> > I observed progress for 2 hours(meaning numbers change on dashboard), and
> > then I waited for 2 hours more. I'm sure it had to spill at some point,
> but
> > I figured 2h is enough time.
> >
> > Thanks,
> > Timur
> >
> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rmetz...@apache.org> wrote:
> >>
> >> Hi Timur,
> >>
> >> thank you for sharing the source code of your job. That is helpful!
> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is
> much
> >> more IO heavy with the larger input data because all the joins start
> >> spilling?
> >> Our monitoring, in particular for batch jobs is really not very
> advanced..
> >> If we had some monitoring showing the spill status, we would maybe see
> that
> >> the job is still running.
> >>
> >> How long did you wait until you declared the job hanging?
> >>
> >> Regards,
> >> Robert
> >>
> >>
> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <u...@apache.org> wrote:
> >>>
> >>> No.
> >>>
> >>> If you run on YARN, the YARN logs are the relevant ones for the
> >>> JobManager and TaskManager. The client log submitting the job should
> >>> be found in /log.
> >>>
> >>> – Ufuk
> >>>
> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
> >>> <timur.fairu...@gmail.com> wrote:
> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are there
> >>> > any
> >>> > logs besides what's in flink/log and yarn container logs?
> >>> >
> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <u...@apache.org> wrote:
> >>> >
> >>> > Hey Timur,
> >>> >
> >>> > is it possible to connect to the VMs and get stack traces of the
> Flink
> >>> > processes as well?
> >>> >
> >>> > We can first have a look at the logs, but the stack traces will be
> >>> > helpful if we can't figure out what the issue is.
> >>> >
> >>> > – Ufuk
> >>> >
> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <trohrm...@apache.org
> >
> >>> > wrote:
> >>> >> Could you share the logs with us, Timur? That would be very helpful.
> >>> >>
> >>> >> Cheers,
> >>> >> Till
> >>> >>
> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <timur.fairu...@gmail.com
> >
> >>> >> wrote:
> >>> >>>
> >>> >>> Hello,
> >>> >>>
> >>> >>> Now I'm at the stage where my job seem to completely hang. Source
> >>> >>> code is
> >>> >>> attached (it won't compile but I think gives a very good idea of
> what
> >>> >>> happens). Unfortunately I can't provide the datasets. Most of them
> >>> >>> are
> >>> >>> about
> >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks
> 6GB
> >>> >>> memory
> >>> >>> for each.
> >>> >>>
> >>> >>> It was working for smaller input sizes. Any idea on what I can do
> >>> >>> differently is appreciated.
> >>> >>>
> >>> >>> Thans,
> >>> >>> Timur
> >>
> >>
> >
>

Re: Job hangs

Reply via email to