Hi David, thanks for starting this discussion. I like the idea of improving insights into Flink's execution and I believe that a flame graph could be helpful.
I quickly glanced over your changes and I think they go in a good direction. One idea could be to share the `StackTraceSample` produced by the `StackTraceSampleCoordinator` between the different `StackTraceOperatorTracker` so that we don't send multiple requests for the same operators. That way we would decrease a bit the RPC load. Apart from that, I think the next steps would be to find a committer who could shepherd this effort and help you with merging it. Cheers, Till On Wed, Jul 31, 2019 at 7:05 PM David Morávek <d...@apache.org> wrote: > Hello, > > While looking into Flink internals, I've noticed that there is already a > mechanism for stack-trace sampling of a particular job vertex. > > I think it may be really useful to allow user to easily render a cpu > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI for > a > selected vertex (new tab next to back pressure) of a running job. Back > pressure tab already provides a good idea of which vertex causes trouble, > but it's hard to say what's actually going on. > > I've tried to implement a basic REST endpoint > < > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > >, > that prepares data for the flame graph rendering and it seems to be > providing good insight. > > It should be straightforward to render data from the endpoint in new UI > using existing <https://github.com/spiermar/d3-flame-graph> javascript > libraries. > > WDYT? Is this worth pushing forward? > > D. >