Thanks for the response, @Rui, actually async-profiler also supports memory allocation and wall-clock profiling, and I have updated the FILP to include these profiling options in the next developments. I think it deserves to be described as powerful.
@David, in our internal version, we also support the perf_events option, and I think we can include it as an extension in this FLIP. For the request of dynamically changing the kernel parameters, we would not include this in the FLIP as it might cause permissions issues. Best Yun Tang ________________________________ From: David Christle <david.chris...@discordapp.com.INVALID> Sent: Saturday, October 14, 2023 4:11 To: dev@flink.apache.org <dev@flink.apache.org> Subject: Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers In the Wiki, this FLIP is motivated by: - That the current flamegraph functionality can only see operator-level stack traces, while async-profiler provides CPU/allocation/locks information, along with deeper Java & system call stack information. - Low configurability (e.g. cannot set the sampling interval) + the usability is limited to visual inspection of the flamegraph whenever it happens to read out. The current built-in flamegraph is extremely valuable and easy to use. But as a regular user of async-profiler for Flink applications, I agree these deficiencies are worth improving. It's exciting that async-profiler might be built-in, since it will make using it much easier. I'm a little confused, though, about the scope of functionality we'll have with this FLIP, in particular the use of itimer only & perf_events support. One of the questions near the end of the Wiki is whether sampling with perf_events will be supported. The current answer seems to say that only itimer mode will be supported, as this mode does not rely on perf_events being enabled. However, given that an aim of the FLIP is to support configurability (e.g. sampling intervals), is it that much more work to support configurability of the event & the other common options, too? If the '-e' event flag is fixed to itimer only, we can't use wall clock/alloc/cpu profiling modes. The Wiki mentions async-profiler's JNI interface will be used, which has 'event_str' as an input. So, it seems like supporting different event types (or even multiple event types in one profile) is possible. Regarding perf_events, it's true that it's disabled in many environments. But it is possible to enable it for debugging purposes. In our Kubernetes workloads, this means adding SYS_PTRACE and SYS_ADMIN to the securityContext, deploying the job, and then running: sysctl kernel.kptr_restrict=0 sysctl kernel.perf_event_paranoid=1 before starting async-profiler. It would be nice if dynamically changing the kernel parameters was built-in to this FLIP, somehow, as well, to set these parameters correctly before profiling. If the environment restricts changing these, that's fine. We can simply report to the user via the UI that setting them failed, and that the choice of profiling configurations is limited without them. I also think it's OK if `itimer` is the default in the UI, as it works under the broadest conditions. But given the motivation in the Wiki is that async-profiler can see detailed system call stack info, allocation, etc., and that the async-profiler docs describe itimer mode as a "fallback" rather than the way the profiler is best used, it feels like this FLIP should support async-profiler's regular modes of operation & the other most-common configuration options. From my own experience, `cpu` (requiring perf_events) is a bit more accurate than `itimer`, and if I recall, and samples once per thread. `wall` is very useful to debug blocks on I/O or locks. Getting per-thread information is nice to drill down into specific parts of the Flink application, e.g. the flame graph lets me ignore the many other tasks running on TM & drill down into just the Source threads, when debugging a Source issue. Kind regards, David On Fri, Oct 13, 2023 at 1:45 AM Rui Fan <1996fan...@gmail.com> wrote: > One minor comment: > > In general, the generic java profiler includes memory analysis, > cpu, thread, deadlock, etc. The FLIP title is java profiler, but > the FLIP just supports flamegraph at process level. > So the `powerful java profiler` title may not be suitable. > Would you mind updating the FLIP title? > > Best, > Rui > > On Fri, Oct 13, 2023 at 4:34 PM Yu Chen <yuchen.e...@gmail.com> wrote: > > > Hi all. > > If there are no further questions, we will start a vote on FLIP-375 next > > week. > > > > Best regards, > > Yu Chen > > > > > > Yu Chen <yuchen.e...@gmail.com> 于2023年10月9日周一 17:24写道: > > > > > Hi all, > > > > > > Yun Tang and I are opening this thread to discuss our proposal to > > > integrate async-profiler's capabilities for profiling taskmananger > (e.g., > > > generating flame graphs) in the Flink Web [1]. > > > > > > > > > Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by > > > sampling task threads. The results generated in such way missing the > > > relevant stack of java threads and system calls. The async-profiler[2] > > is a > > > low-overhead sampling profiler for Java, but the steps to use it in the > > > production environment are cumbersome and suffer from permissions and > > > security risks. > > > > > > Therefore, we propose adding rest APIs to provide the capability to > > invoke > > > async-profiler on multiple platforms through JNI, which can be easily > > > operated on Web UI. This enhancement will improve the efficiency and > > > experience of Flink users in identifying performance bottlenecks. > > > > > > > > > > > > Please refer to the FLIP document for more details about the proposed > > design > > > and implementation. We welcome any feedback and opinions on this > > proposal. > > > > > > > > > > > > [1] FLIP-375: Built-in cross-platform powerful java profiler on > > > taskmanagers - Apache Flink - Apache Software Foundation > > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers > > > > > > > > > [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP > > > profiler for Java featuring AsyncGetCallTrace + perf_events > > > <https://github.com/async-profiler/async-profiler> > > > > > > > > > > > > Best regards, > > > > > > Yun Tang and Yu Chen > > > > > >