Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-02 Thread David Morávek
I've created FLINK-13550 
to track the issue.

Is there any committer who'd be willing to "shepherd this effort"? :)

Thanks,
D.

On Fri, Aug 2, 2019 at 10:22 AM David Morávek  wrote:

> Hi Paul, for now I only plan to add the one based on java stack traces.
>
> On Fri, Aug 2, 2019 at 9:34 AM Paul Lam  wrote:
>
>> Hi David,
>>
>> Thanks for the new feature! I think the flame graph would be a useful
>> tool to understand the state of job executions, and it looks good too. +1
>> for this.
>>
>> And a minor question: do we plan to support multiple kinds of flame
>> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>>
>> Best,
>> Paul Lam
>>
>> > 在 2019年8月2日,04:24,David Morávek  写道:
>> >
>> > Hi Till, thanks for the feedback! These endpoints are only called when
>> the
>> > vertex is selected in the UI, so there should be any heavy RPC load. For
>> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
>> the
>> > flame-graph, we want to sample the whole stack trace and we need
>> different
>> > sampling rate (longer period, more samples). Those are the main reasons
>> to
>> > split these in two "trackers", but I may be missing something.
>> >
>> > I've prepared a little demo, so others can have a better idea of what I
>> > have in mind.
>> >
>> > https://youtu.be/GUNDehj9z9o
>> >
>> > Please note that this is a proof of concept and I'm not frontend
>> person, so
>> > it may look little clumsy :)
>> >
>> > D.
>> >
>> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann 
>> wrote:
>> >
>> >> Hi David,
>> >>
>> >> thanks for starting this discussion. I like the idea of improving
>> insights
>> >> into Flink's execution and I believe that a flame graph could be
>> helpful.
>> >>
>> >> I quickly glanced over your changes and I think they go in a good
>> >> direction. One idea could be to share the `StackTraceSample` produced
>> by
>> >> the `StackTraceSampleCoordinator` between the different
>> >> `StackTraceOperatorTracker` so that we don't send multiple requests
>> for the
>> >> same operators. That way we would decrease a bit the RPC load.
>> >>
>> >> Apart from that, I think the next steps would be to find a committer
>> who
>> >> could shepherd this effort and help you with merging it.
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> While looking into Flink internals, I've noticed that there is
>> already a
>> >>> mechanism for stack-trace sampling of a particular job vertex.
>> >>>
>> >>> I think it may be really useful to allow user to easily render a cpu
>> >>> flamegraph  in a new UI
>> >> for
>> >>> a
>> >>> selected vertex (new tab next to back pressure) of a running job. Back
>> >>> pressure tab already provides a good idea of which vertex causes
>> trouble,
>> >>> but it's hard to say what's actually going on.
>> >>>
>> >>> I've tried to implement a basic REST endpoint
>> >>> <
>> >>>
>> >>
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>>  ,
>> >>> that prepares data for the flame graph rendering and it seems to be
>> >>> providing good insight.
>> >>>
>> >>> It should be straightforward to render data from the endpoint in new
>> UI
>> >>> using existing 
>> javascript
>> >>> libraries.
>> >>>
>> >>> WDYT? Is this worth pushing forward?
>> >>>
>> >>> D.
>> >>>
>> >>
>>
>>


Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-02 Thread David Morávek
Hi Paul, for now I only plan to add the one based on java stack traces.

On Fri, Aug 2, 2019 at 9:34 AM Paul Lam  wrote:

> Hi David,
>
> Thanks for the new feature! I think the flame graph would be a useful tool
> to understand the state of job executions, and it looks good too. +1 for
> this.
>
> And a minor question: do we plan to support multiple kinds of flame
> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>
> Best,
> Paul Lam
>
> > 在 2019年8月2日,04:24,David Morávek  写道:
> >
> > Hi Till, thanks for the feedback! These endpoints are only called when
> the
> > vertex is selected in the UI, so there should be any heavy RPC load. For
> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
> the
> > flame-graph, we want to sample the whole stack trace and we need
> different
> > sampling rate (longer period, more samples). Those are the main reasons
> to
> > split these in two "trackers", but I may be missing something.
> >
> > I've prepared a little demo, so others can have a better idea of what I
> > have in mind.
> >
> > https://youtu.be/GUNDehj9z9o
> >
> > Please note that this is a proof of concept and I'm not frontend person,
> so
> > it may look little clumsy :)
> >
> > D.
> >
> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann 
> wrote:
> >
> >> Hi David,
> >>
> >> thanks for starting this discussion. I like the idea of improving
> insights
> >> into Flink's execution and I believe that a flame graph could be
> helpful.
> >>
> >> I quickly glanced over your changes and I think they go in a good
> >> direction. One idea could be to share the `StackTraceSample` produced by
> >> the `StackTraceSampleCoordinator` between the different
> >> `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> >> same operators. That way we would decrease a bit the RPC load.
> >>
> >> Apart from that, I think the next steps would be to find a committer who
> >> could shepherd this effort and help you with merging it.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:
> >>
> >>> Hello,
> >>>
> >>> While looking into Flink internals, I've noticed that there is already
> a
> >>> mechanism for stack-trace sampling of a particular job vertex.
> >>>
> >>> I think it may be really useful to allow user to easily render a cpu
> >>> flamegraph  in a new UI
> >> for
> >>> a
> >>> selected vertex (new tab next to back pressure) of a running job. Back
> >>> pressure tab already provides a good idea of which vertex causes
> trouble,
> >>> but it's hard to say what's actually going on.
> >>>
> >>> I've tried to implement a basic REST endpoint
> >>> <
> >>>
> >>
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>  ,
> >>> that prepares data for the flame graph rendering and it seems to be
> >>> providing good insight.
> >>>
> >>> It should be straightforward to render data from the endpoint in new UI
> >>> using existing  javascript
> >>> libraries.
> >>>
> >>> WDYT? Is this worth pushing forward?
> >>>
> >>> D.
> >>>
> >>
>
>


Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-02 Thread Paul Lam
Hi David,

Thanks for the new feature! I think the flame graph would be a useful tool to 
understand the state of job executions, and it looks good too. +1 for this.

And a minor question: do we plan to support multiple kinds of flame graphs? It 
would be great if we have both on-cpu and off-cpu flame graphs.

Best,
Paul Lam

> 在 2019年8月2日,04:24,David Morávek  写道:
> 
> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
> 
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
> 
> https://youtu.be/GUNDehj9z9o
> 
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
> 
> D.
> 
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann  wrote:
> 
>> Hi David,
>> 
>> thanks for starting this discussion. I like the idea of improving insights
>> into Flink's execution and I believe that a flame graph could be helpful.
>> 
>> I quickly glanced over your changes and I think they go in a good
>> direction. One idea could be to share the `StackTraceSample` produced by
>> the `StackTraceSampleCoordinator` between the different
>> `StackTraceOperatorTracker` so that we don't send multiple requests for the
>> same operators. That way we would decrease a bit the RPC load.
>> 
>> Apart from that, I think the next steps would be to find a committer who
>> could shepherd this effort and help you with merging it.
>> 
>> Cheers,
>> Till
>> 
>> On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:
>> 
>>> Hello,
>>> 
>>> While looking into Flink internals, I've noticed that there is already a
>>> mechanism for stack-trace sampling of a particular job vertex.
>>> 
>>> I think it may be really useful to allow user to easily render a cpu
>>> flamegraph  in a new UI
>> for
>>> a
>>> selected vertex (new tab next to back pressure) of a running job. Back
>>> pressure tab already provides a good idea of which vertex causes trouble,
>>> but it's hard to say what's actually going on.
>>> 
>>> I've tried to implement a basic REST endpoint
>>> <
>>> 
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
 ,
>>> that prepares data for the flame graph rendering and it seems to be
>>> providing good insight.
>>> 
>>> It should be straightforward to render data from the endpoint in new UI
>>> using existing  javascript
>>> libraries.
>>> 
>>> WDYT? Is this worth pushing forward?
>>> 
>>> D.
>>> 
>> 



Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-02 Thread boshu Zheng
Big +1 for this helpful feature :)


On 08/02/2019 13:54, Jark Wu wrote:
Hi David,

The demo looks charming! I think it will definitely help a lot when
performance tuning.
A big +1 for this.

I cc-ed Yadong who's one of the main contributors of the new Web UI.
Maybe he can give some help on the front end.

Regards,
Jark

On Fri, 2 Aug 2019 at 04:26, David Morávek  wrote:

> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann 
> wrote:
>
> > Hi David,
> >
> > thanks for starting this discussion. I like the idea of improving
> insights
> > into Flink's execution and I believe that a flame graph could be helpful.
> >
> > I quickly glanced over your changes and I think they go in a good
> > direction. One idea could be to share the `StackTraceSample` produced by
> > the `StackTraceSampleCoordinator` between the different
> > `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> > same operators. That way we would decrease a bit the RPC load.
> >
> > Apart from that, I think the next steps would be to find a committer who
> > could shepherd this effort and help you with merging it.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:
> >
> > > Hello,
> > >
> > > While looking into Flink internals, I've noticed that there is already
> a
> > > mechanism for stack-trace sampling of a particular job vertex.
> > >
> > > I think it may be really useful to allow user to easily render a cpu
> > > flamegraph  in a new UI
> > for
> > > a
> > > selected vertex (new tab next to back pressure) of a running job. Back
> > > pressure tab already provides a good idea of which vertex causes
> trouble,
> > > but it's hard to say what's actually going on.
> > >
> > > I've tried to implement a basic REST endpoint
> > > <
> > >
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > > >,
> > > that prepares data for the flame graph rendering and it seems to be
> > > providing good insight.
> > >
> > > It should be straightforward to render data from the endpoint in new UI
> > > using existing  javascript
> > > libraries.
> > >
> > > WDYT? Is this worth pushing forward?
> > >
> > > D.
> > >
> >
>


Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-01 Thread Jark Wu
Hi David,

The demo looks charming! I think it will definitely help a lot when
performance tuning.
A big +1 for this.

I cc-ed Yadong who's one of the main contributors of the new Web UI.
Maybe he can give some help on the front end.

Regards,
Jark

On Fri, 2 Aug 2019 at 04:26, David Morávek  wrote:

> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann 
> wrote:
>
> > Hi David,
> >
> > thanks for starting this discussion. I like the idea of improving
> insights
> > into Flink's execution and I believe that a flame graph could be helpful.
> >
> > I quickly glanced over your changes and I think they go in a good
> > direction. One idea could be to share the `StackTraceSample` produced by
> > the `StackTraceSampleCoordinator` between the different
> > `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> > same operators. That way we would decrease a bit the RPC load.
> >
> > Apart from that, I think the next steps would be to find a committer who
> > could shepherd this effort and help you with merging it.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:
> >
> > > Hello,
> > >
> > > While looking into Flink internals, I've noticed that there is already
> a
> > > mechanism for stack-trace sampling of a particular job vertex.
> > >
> > > I think it may be really useful to allow user to easily render a cpu
> > > flamegraph  in a new UI
> > for
> > > a
> > > selected vertex (new tab next to back pressure) of a running job. Back
> > > pressure tab already provides a good idea of which vertex causes
> trouble,
> > > but it's hard to say what's actually going on.
> > >
> > > I've tried to implement a basic REST endpoint
> > > <
> > >
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > > >,
> > > that prepares data for the flame graph rendering and it seems to be
> > > providing good insight.
> > >
> > > It should be straightforward to render data from the endpoint in new UI
> > > using existing  javascript
> > > libraries.
> > >
> > > WDYT? Is this worth pushing forward?
> > >
> > > D.
> > >
> >
>


Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-01 Thread David Morávek
Hi Till, thanks for the feedback! These endpoints are only called when the
vertex is selected in the UI, so there should be any heavy RPC load. For
back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
flame-graph, we want to sample the whole stack trace and we need different
sampling rate (longer period, more samples). Those are the main reasons to
split these in two "trackers", but I may be missing something.

I've prepared a little demo, so others can have a better idea of what I
have in mind.

https://youtu.be/GUNDehj9z9o

Please note that this is a proof of concept and I'm not frontend person, so
it may look little clumsy :)

D.

On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann  wrote:

> Hi David,
>
> thanks for starting this discussion. I like the idea of improving insights
> into Flink's execution and I believe that a flame graph could be helpful.
>
> I quickly glanced over your changes and I think they go in a good
> direction. One idea could be to share the `StackTraceSample` produced by
> the `StackTraceSampleCoordinator` between the different
> `StackTraceOperatorTracker` so that we don't send multiple requests for the
> same operators. That way we would decrease a bit the RPC load.
>
> Apart from that, I think the next steps would be to find a committer who
> could shepherd this effort and help you with merging it.
>
> Cheers,
> Till
>
> On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:
>
> > Hello,
> >
> > While looking into Flink internals, I've noticed that there is already a
> > mechanism for stack-trace sampling of a particular job vertex.
> >
> > I think it may be really useful to allow user to easily render a cpu
> > flamegraph  in a new UI
> for
> > a
> > selected vertex (new tab next to back pressure) of a running job. Back
> > pressure tab already provides a good idea of which vertex causes trouble,
> > but it's hard to say what's actually going on.
> >
> > I've tried to implement a basic REST endpoint
> > <
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > >,
> > that prepares data for the flame graph rendering and it seems to be
> > providing good insight.
> >
> > It should be straightforward to render data from the endpoint in new UI
> > using existing  javascript
> > libraries.
> >
> > WDYT? Is this worth pushing forward?
> >
> > D.
> >
>


Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

2019-08-01 Thread Till Rohrmann
Hi David,

thanks for starting this discussion. I like the idea of improving insights
into Flink's execution and I believe that a flame graph could be helpful.

I quickly glanced over your changes and I think they go in a good
direction. One idea could be to share the `StackTraceSample` produced by
the `StackTraceSampleCoordinator` between the different
`StackTraceOperatorTracker` so that we don't send multiple requests for the
same operators. That way we would decrease a bit the RPC load.

Apart from that, I think the next steps would be to find a committer who
could shepherd this effort and help you with merging it.

Cheers,
Till

On Wed, Jul 31, 2019 at 7:05 PM David Morávek  wrote:

> Hello,
>
> While looking into Flink internals, I've noticed that there is already a
> mechanism for stack-trace sampling of a particular job vertex.
>
> I think it may be really useful to allow user to easily render a cpu
> flamegraph  in a new UI for
> a
> selected vertex (new tab next to back pressure) of a running job. Back
> pressure tab already provides a good idea of which vertex causes trouble,
> but it's hard to say what's actually going on.
>
> I've tried to implement a basic REST endpoint
> <
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> >,
> that prepares data for the flame graph rendering and it seems to be
> providing good insight.
>
> It should be straightforward to render data from the endpoint in new UI
> using existing  javascript
> libraries.
>
> WDYT? Is this worth pushing forward?
>
> D.
>


[DISCUSS] CPU flame graph for a job vertex in web UI.

2019-07-31 Thread David Morávek
Hello,

While looking into Flink internals, I've noticed that there is already a
mechanism for stack-trace sampling of a particular job vertex.

I think it may be really useful to allow user to easily render a cpu
flamegraph  in a new UI for a
selected vertex (new tab next to back pressure) of a running job. Back
pressure tab already provides a good idea of which vertex causes trouble,
but it's hard to say what's actually going on.

I've tried to implement a basic REST endpoint
,
that prepares data for the flame graph rendering and it seems to be
providing good insight.

It should be straightforward to render data from the endpoint in new UI
using existing  javascript
libraries.

WDYT? Is this worth pushing forward?

D.