Nice feature +1 from my side for it.

In the PR I think we are missing documentation. I think it's especially
important to mention the limitations of this approach for performance
analysis. If we make it easy for the user to get such kind of data, it's
important they do not proverbially shoot themselves in their own foot with
false conclusions. We should clearly mention how those data are sampled,
and point to limitations such as:
- data from all threads/operators are squashed together, so if there is a
data skew it will be averaged out
- stack sampling is/can be biased (JVM threads are more likely to be
stopped in some places than others, while skipping/rarely stopping in the
true hot spots - so one should treat the results with a grain of salt below
a certain level)
- true bottleneck might be obscured by the fact flame graphs are squashing
results from all of the threads. There can be 60% of time spent in one
component according to a flame graph, but it might not necessarily be the
bottleneck, which could be in a completely different component running
which has a single thread burning 100% of a single CPU core, barely visible
in the flame graph at all.

It's great to have such a nice tool readily and easily available, but we
need to make sure people who are using it are aware when it can be
misleading.

Piotrek

wt., 2 mar 2021 o 15:12 Till Rohrmann <trohrm...@apache.org> napisaƂ(a):

> Ah ok. Thanks for the clarification Alex.
>
> Cheers,
> Till
>
> On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <alexan...@ververica.com>
> wrote:
>
> > It is passed back as part of the response to the asynchronous callback
> > within the coordinator and is used to decide if all outstanding requests
> to
> > the parallel instances of a particular operator returned successfully. If
> > so, the request is considered successful, sub-results are combined and
> the
> > thread info result future for an operator completes.
> >
> >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> >
> >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> >
> > Best,
> >
> > --
> >
> > Alexander Fedulov | Solutions Architect
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> >
> > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples
> need
> > to
> > > specify the request id? Is it because the caller can send a second
> > request
> > > with the same id? Or can the caller query the result of a previous
> > request
> > > by specifying the requestId?
> > >
> > > If the TaskExecutor does not need to know about the id, then we might
> be
> > > able to drop it.
> > >
> > > Cheers
> > > Till
> > >
> > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > alexan...@ververica.com>
> > > wrote:
> > >
> > > > Hi Till,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > * What is the requestId used for in the RPC call?
> > > >
> > > > It is the handle that is used as the key in the
> > > > ThreadInfoRequestCoordinator's pending responses Map. I believe it
> was
> > > > called sampleId in the StackTraceSampleCoordinator, but I decided to
> > > rename
> > > > it because there is also a ThreadInfoSampleService which is actually
> > > > responsible for sampling the JVM numSamples number of times. I found
> > that
> > > > the notion of what a sample is was a bit confusing. Now one thread
> info
> > > > request corresponds to gathering numSamples from a corresponding
> Task.
> > > Hope
> > > > that makes sense.
> > > >
> > > > * Would it make sense to group numSubSamples, delayBetweenSamples and
> > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would
> > decrease
> > > > the number of parameters and group those which are closely related
> > > > together.
> > > >
> > > > Good point. I will rework it accordingly.
> > > >
> > > > Best,
> > > > --
> > > >
> > > > Alexander Fedulov | Solutions Architect
> > > > Follow us @VervericaData
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > >
> > >
> >
>

Reply via email to