Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Alexander Fedulov Thu, 04 Mar 2021 12:15:39 -0800

@Till, I've added the proposed ThreadInfoSamplesRequest and updated the
FLIP and the PR accordingly.


Best,

--

Alexander Fedulov | Solutions Architect

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner



On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <[email protected]>
wrote:

> Added docs to the PR.
> @David, thanks for the tip, it seems like a good place to put them.
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
>
>
>
> On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[email protected]>
> wrote:
>
>> This is going to make performance analysis and optimization much more
>> accessible. I can't wait to include this in our training courses.
>>
>> +1
>>
>> Seth suggested putting the docs for this feature under
>> Operations/Monitoring, but there's already a page in the docs under
>> Operations/Debugging for Application Profiling & Debugging, which is more
>> on point. I think it will be confusing if the flame graphs aren't
>> together with the other profilers.
>>
>> David
>>
>> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[email protected]> wrote:
>>
>> > Cool feature +1
>> >
>> > There is a subsection called monitoring in the operations section of the
>> > docs. It would fit nicely there.
>> >
>> > Seth
>> >
>> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <
>> [email protected]>
>> > wrote:
>> >
>> > > Hi Piotr,
>> > >
>> > > Thanks for the comments - all valid points.
>> > > We should definitely document how the Flame Graphs are constructed - I
>> > will
>> > > work on the docs. Do you have a proposition about the part of which
>> > > page/section they should become?
>> > > I would like to also mention here that I plan to work on further
>> > > improvements, such as the ability to "zoom in" into the Flame Graphs
>> for
>> > > the individual Tasks in the "unsquashed" form,  so some of those
>> concerns
>> > > should be mitigated in the future.
>> > >
>> > > Best,
>> > >
>> > > --
>> > >
>> > > Alexander Fedulov | Solutions Architect
>> > >
>> > > <https://www.ververica.com/>
>> > >
>> > > Follow us @VervericaData
>> > >
>> > >
>> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[email protected]>
>> > > wrote:
>> > >
>> > > > Nice feature +1 from my side for it.
>> > > >
>> > > > In the PR I think we are missing documentation. I think it's
>> especially
>> > > > important to mention the limitations of this approach for
>> performance
>> > > > analysis. If we make it easy for the user to get such kind of data,
>> > it's
>> > > > important they do not proverbially shoot themselves in their own
>> foot
>> > > with
>> > > > false conclusions. We should clearly mention how those data are
>> > sampled,
>> > > > and point to limitations such as:
>> > > > - data from all threads/operators are squashed together, so if there
>> > is a
>> > > > data skew it will be averaged out
>> > > > - stack sampling is/can be biased (JVM threads are more likely to be
>> > > > stopped in some places than others, while skipping/rarely stopping
>> in
>> > the
>> > > > true hot spots - so one should treat the results with a grain of
>> salt
>> > > below
>> > > > a certain level)
>> > > > - true bottleneck might be obscured by the fact flame graphs are
>> > > squashing
>> > > > results from all of the threads. There can be 60% of time spent in
>> one
>> > > > component according to a flame graph, but it might not necessarily
>> be
>> > the
>> > > > bottleneck, which could be in a completely different component
>> running
>> > > > which has a single thread burning 100% of a single CPU core, barely
>> > > visible
>> > > > in the flame graph at all.
>> > > >
>> > > > It's great to have such a nice tool readily and easily available,
>> but
>> > we
>> > > > need to make sure people who are using it are aware when it can be
>> > > > misleading.
>> > > >
>> > > > Piotrek
>> > > >
>> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[email protected]>
>> > napisał(a):
>> > > >
>> > > > > Ah ok. Thanks for the clarification Alex.
>> > > > >
>> > > > > Cheers,
>> > > > > Till
>> > > > >
>> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
>> > > > [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > > > It is passed back as part of the response to the asynchronous
>> > > callback
>> > > > > > within the coordinator and is used to decide if all outstanding
>> > > > requests
>> > > > > to
>> > > > > > the parallel instances of a particular operator returned
>> > > successfully.
>> > > > If
>> > > > > > so, the request is considered successful, sub-results are
>> combined
>> > > and
>> > > > > the
>> > > > > > thread info result future for an operator completes.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
>> > > > > >
>> > > > > > Best,
>> > > > > >
>> > > > > > --
>> > > > > >
>> > > > > > Alexander Fedulov | Solutions Architect
>> > > > > >
>> > > > > > <https://www.ververica.com/>
>> > > > > >
>> > > > > > Follow us @VervericaData
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <
>> > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Why does the caller of
>> > TaskExecutorGateway.requestThreadInfoSamples
>> > > > > need
>> > > > > > to
>> > > > > > > specify the request id? Is it because the caller can send a
>> > second
>> > > > > > request
>> > > > > > > with the same id? Or can the caller query the result of a
>> > previous
>> > > > > > request
>> > > > > > > by specifying the requestId?
>> > > > > > >
>> > > > > > > If the TaskExecutor does not need to know about the id, then
>> we
>> > > might
>> > > > > be
>> > > > > > > able to drop it.
>> > > > > > >
>> > > > > > > Cheers
>> > > > > > > Till
>> > > > > > >
>> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
>> > > > > > [email protected]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Till,
>> > > > > > > >
>> > > > > > > > Thanks for your comments.
>> > > > > > > >
>> > > > > > > > * What is the requestId used for in the RPC call?
>> > > > > > > >
>> > > > > > > > It is the handle that is used as the key in the
>> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I
>> believe
>> > > it
>> > > > > was
>> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I
>> > decided
>> > > > to
>> > > > > > > rename
>> > > > > > > > it because there is also a ThreadInfoSampleService which is
>> > > > actually
>> > > > > > > > responsible for sampling the JVM numSamples number of
>> times. I
>> > > > found
>> > > > > > that
>> > > > > > > > the notion of what a sample is was a bit confusing. Now one
>> > > thread
>> > > > > info
>> > > > > > > > request corresponds to gathering numSamples from a
>> > corresponding
>> > > > > Task.
>> > > > > > > Hope
>> > > > > > > > that makes sense.
>> > > > > > > >
>> > > > > > > > * Would it make sense to group numSubSamples,
>> > delayBetweenSamples
>> > > > and
>> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This
>> > would
>> > > > > > decrease
>> > > > > > > > the number of parameters and group those which are closely
>> > > related
>> > > > > > > > together.
>> > > > > > > >
>> > > > > > > > Good point. I will rework it accordingly.
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > --
>> > > > > > > >
>> > > > > > > > Alexander Fedulov | Solutions Architect
>> > > > > > > > Follow us @VervericaData
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Sent from:
>> > > > > > >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Reply via email to