Hey guys,
Thanks for insights.

Bobby, I see that it guesses those values from run time of the whole task.
But as the whole task took 6.6 minutes, how can it come up with 7.27 hours?

Sean, yes there is a data skew. One task taking tens of minutes while other
take tens of seconds. What gave it away in the chart? Wild differences
between min/med/max values of WholeStageCodegen?

Cheers,
Michal


Dne čt 9. 7. 2020 23:47 uživatel Sean Owen <sro...@gmail.com> napsal:

> It sounds like you have huge data skew?
>
> On Thu, Jul 9, 2020 at 4:15 PM Bobby Evans <bo...@apache.org> wrote:
> >
> > Sadly there isn't a lot you can do to fix this.  All of the operations
> take iterators of rows as input and produce iterators of rows as output.
> For efficiency reasons, the timing is not done for each individual row. If
> we did that in many cases it would take longer to measure how long
> something took then it would to just do the operation. So most operators
> actually end up measuring the lifetime of the operator which often is the
> time of the entire task minus how long it took for the first task to get to
> that operator. This is also true of WholeStageCodeGen.
> >
> > On Thu, Jul 9, 2020 at 11:55 AM Michal Sankot <
> michal.san...@spreaker.com.invalid> wrote:
> >>
> >> Hi,
> >> I'm checking execution of SQL queries in Spark UI, trying to find a
> >> bottleneck and values that are displayed in WholeStageCodegen blocks are
> >> confusing.
> >>
> >> In attached example whole query took 6.6 minutes and upper left
> >> WholeStageCodegen block says that median value is 7.8 minutes and
> >> maximum 7.27h :O
> >>
> >> What does it mean? Do those number have any real meaning? Is there a way
> >> to find out how long individual blocks really took?
> >>
> >> Thanks,
> >> Michal
> >> Spark 2.4.4 on AWS EMR
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

Reply via email to