@Till, I've added the proposed ThreadInfoSamplesRequest and updated the FLIP and the PR accordingly.
Best, -- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <alexan...@ververica.com> wrote: > Added docs to the PR. > @David, thanks for the tip, it seems like a good place to put them. > > -- > > Alexander Fedulov | Solutions Architect > > <https://www.ververica.com/> > > Follow us @VervericaData > > > > > On Wed, Mar 3, 2021 at 12:10 PM David Anderson <dander...@apache.org> > wrote: > >> This is going to make performance analysis and optimization much more >> accessible. I can't wait to include this in our training courses. >> >> +1 >> >> Seth suggested putting the docs for this feature under >> Operations/Monitoring, but there's already a page in the docs under >> Operations/Debugging for Application Profiling & Debugging, which is more >> on point. I think it will be confusing if the flame graphs aren't >> together with the other profilers. >> >> David >> >> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <sjwies...@gmail.com> wrote: >> >> > Cool feature +1 >> > >> > There is a subsection called monitoring in the operations section of the >> > docs. It would fit nicely there. >> > >> > Seth >> > >> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov < >> alexan...@ververica.com> >> > wrote: >> > >> > > Hi Piotr, >> > > >> > > Thanks for the comments - all valid points. >> > > We should definitely document how the Flame Graphs are constructed - I >> > will >> > > work on the docs. Do you have a proposition about the part of which >> > > page/section they should become? >> > > I would like to also mention here that I plan to work on further >> > > improvements, such as the ability to "zoom in" into the Flame Graphs >> for >> > > the individual Tasks in the "unsquashed" form, so some of those >> concerns >> > > should be mitigated in the future. >> > > >> > > Best, >> > > >> > > -- >> > > >> > > Alexander Fedulov | Solutions Architect >> > > >> > > <https://www.ververica.com/> >> > > >> > > Follow us @VervericaData >> > > >> > > >> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <pnowoj...@apache.org> >> > > wrote: >> > > >> > > > Nice feature +1 from my side for it. >> > > > >> > > > In the PR I think we are missing documentation. I think it's >> especially >> > > > important to mention the limitations of this approach for >> performance >> > > > analysis. If we make it easy for the user to get such kind of data, >> > it's >> > > > important they do not proverbially shoot themselves in their own >> foot >> > > with >> > > > false conclusions. We should clearly mention how those data are >> > sampled, >> > > > and point to limitations such as: >> > > > - data from all threads/operators are squashed together, so if there >> > is a >> > > > data skew it will be averaged out >> > > > - stack sampling is/can be biased (JVM threads are more likely to be >> > > > stopped in some places than others, while skipping/rarely stopping >> in >> > the >> > > > true hot spots - so one should treat the results with a grain of >> salt >> > > below >> > > > a certain level) >> > > > - true bottleneck might be obscured by the fact flame graphs are >> > > squashing >> > > > results from all of the threads. There can be 60% of time spent in >> one >> > > > component according to a flame graph, but it might not necessarily >> be >> > the >> > > > bottleneck, which could be in a completely different component >> running >> > > > which has a single thread burning 100% of a single CPU core, barely >> > > visible >> > > > in the flame graph at all. >> > > > >> > > > It's great to have such a nice tool readily and easily available, >> but >> > we >> > > > need to make sure people who are using it are aware when it can be >> > > > misleading. >> > > > >> > > > Piotrek >> > > > >> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <trohrm...@apache.org> >> > napisaĆ(a): >> > > > >> > > > > Ah ok. Thanks for the clarification Alex. >> > > > > >> > > > > Cheers, >> > > > > Till >> > > > > >> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < >> > > > alexan...@ververica.com> >> > > > > wrote: >> > > > > >> > > > > > It is passed back as part of the response to the asynchronous >> > > callback >> > > > > > within the coordinator and is used to decide if all outstanding >> > > > requests >> > > > > to >> > > > > > the parallel instances of a particular operator returned >> > > successfully. >> > > > If >> > > > > > so, the request is considered successful, sub-results are >> combined >> > > and >> > > > > the >> > > > > > thread info result future for an operator completes. >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 >> > > > > > >> > > > > > Best, >> > > > > > >> > > > > > -- >> > > > > > >> > > > > > Alexander Fedulov | Solutions Architect >> > > > > > >> > > > > > <https://www.ververica.com/> >> > > > > > >> > > > > > Follow us @VervericaData >> > > > > > >> > > > > > >> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann < >> > trohrm...@apache.org> >> > > > > > wrote: >> > > > > > >> > > > > > > Why does the caller of >> > TaskExecutorGateway.requestThreadInfoSamples >> > > > > need >> > > > > > to >> > > > > > > specify the request id? Is it because the caller can send a >> > second >> > > > > > request >> > > > > > > with the same id? Or can the caller query the result of a >> > previous >> > > > > > request >> > > > > > > by specifying the requestId? >> > > > > > > >> > > > > > > If the TaskExecutor does not need to know about the id, then >> we >> > > might >> > > > > be >> > > > > > > able to drop it. >> > > > > > > >> > > > > > > Cheers >> > > > > > > Till >> > > > > > > >> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < >> > > > > > alexan...@ververica.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi Till, >> > > > > > > > >> > > > > > > > Thanks for your comments. >> > > > > > > > >> > > > > > > > * What is the requestId used for in the RPC call? >> > > > > > > > >> > > > > > > > It is the handle that is used as the key in the >> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I >> believe >> > > it >> > > > > was >> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I >> > decided >> > > > to >> > > > > > > rename >> > > > > > > > it because there is also a ThreadInfoSampleService which is >> > > > actually >> > > > > > > > responsible for sampling the JVM numSamples number of >> times. I >> > > > found >> > > > > > that >> > > > > > > > the notion of what a sample is was a bit confusing. Now one >> > > thread >> > > > > info >> > > > > > > > request corresponds to gathering numSamples from a >> > corresponding >> > > > > Task. >> > > > > > > Hope >> > > > > > > > that makes sense. >> > > > > > > > >> > > > > > > > * Would it make sense to group numSubSamples, >> > delayBetweenSamples >> > > > and >> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This >> > would >> > > > > > decrease >> > > > > > > > the number of parameters and group those which are closely >> > > related >> > > > > > > > together. >> > > > > > > > >> > > > > > > > Good point. I will rework it accordingly. >> > > > > > > > >> > > > > > > > Best, >> > > > > > > > -- >> > > > > > > > >> > > > > > > > Alexander Fedulov | Solutions Architect >> > > > > > > > Follow us @VervericaData >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > Sent from: >> > > > > > > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >