Thanks for the response,

@Rui, actually async-profiler also supports memory allocation and wall-clock 
profiling, and I have updated the FILP to include these profiling options in 
the next developments. I think it deserves to be described as powerful.

@David, in our internal version, we also support the perf_events option, and I 
think we can include it as an extension in this FLIP. For the request of 
dynamically changing the kernel parameters, we would not include this in the 
FLIP as it might cause permissions issues.

Best
Yun Tang

________________________________
From: David Christle <david.chris...@discordapp.com.INVALID>
Sent: Saturday, October 14, 2023 4:11
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler 
on taskmanagers

In the Wiki, this FLIP is motivated by:

- That the current flamegraph functionality can only see operator-level
stack traces, while async-profiler provides CPU/allocation/locks
information, along with deeper Java & system call stack information.
- Low configurability (e.g. cannot set the sampling interval) + the
usability is limited to visual inspection of the flamegraph whenever it
happens to read out.

The current built-in flamegraph is extremely valuable and easy to use. But
as a regular user of async-profiler for Flink applications, I agree these
deficiencies are worth improving. It's exciting that async-profiler might
be built-in, since it will make using it much easier.

I'm a little confused, though, about the scope of functionality we'll have
with this FLIP, in particular the use of itimer only & perf_events support.
One of the questions near the end of the Wiki is whether sampling with
perf_events will be supported. The current answer seems to say that only
itimer mode will be supported, as this mode does not rely on perf_events
being enabled.

However, given that an aim of the FLIP is to support configurability (e.g.
sampling intervals), is it that much more work to support configurability
of the event & the other common options, too? If the '-e' event flag is
fixed to itimer only, we can't use wall clock/alloc/cpu profiling modes.
The Wiki mentions async-profiler's JNI interface will be used, which has
'event_str' as an input. So, it seems like supporting different event types
(or even multiple event types in one profile) is possible.

Regarding perf_events, it's true that it's disabled in many
environments. But it is possible to enable it for debugging purposes. In
our Kubernetes workloads, this means adding SYS_PTRACE and SYS_ADMIN to the
securityContext, deploying the job, and then running:

sysctl kernel.kptr_restrict=0
sysctl kernel.perf_event_paranoid=1

before starting async-profiler.

It would be nice if dynamically changing the kernel parameters was built-in
to this FLIP, somehow, as well, to set these parameters correctly before
profiling. If the environment restricts changing these, that's fine. We can
simply report to the user via the UI that setting them failed, and that the
choice of profiling configurations is limited without them. I also think
it's OK if `itimer` is the default in the UI, as it works under the
broadest conditions. But given the motivation in the Wiki is that
async-profiler can see detailed system call stack info, allocation, etc.,
and that the async-profiler docs describe itimer mode as a "fallback"
rather than the way the profiler is best used, it feels like this FLIP
should support async-profiler's regular modes of operation & the other
most-common configuration options. From my own experience, `cpu` (requiring
perf_events) is a bit more accurate than `itimer`, and if I recall, and
samples once per thread. `wall` is very useful to debug blocks on I/O or
locks. Getting per-thread information is nice to drill down into specific
parts of the Flink application, e.g. the flame graph lets me ignore the
many other tasks running on TM & drill down into just the Source threads,
when debugging a Source issue.

Kind regards,
David

On Fri, Oct 13, 2023 at 1:45 AM Rui Fan <1996fan...@gmail.com> wrote:

> One minor comment:
>
> In general, the generic java profiler includes memory analysis,
> cpu, thread, deadlock, etc. The FLIP title is java profiler, but
> the FLIP just supports flamegraph at process level.
> So the `powerful java profiler` title may not be suitable.
> Would you mind updating the FLIP title?
>
> Best,
> Rui
>
> On Fri, Oct 13, 2023 at 4:34 PM Yu Chen <yuchen.e...@gmail.com> wrote:
>
> > Hi all.
> > If there are no further questions, we will start a vote on FLIP-375 next
> > week.
> >
> > Best regards,
> > Yu Chen
> >
> >
> > Yu Chen <yuchen.e...@gmail.com> 于2023年10月9日周一 17:24写道:
> >
> > > Hi all,
> > >
> > > Yun Tang and I are opening this thread to discuss our proposal to
> > > integrate async-profiler's capabilities for profiling taskmananger
> (e.g.,
> > > generating flame graphs) in the Flink Web [1].
> > >
> > >
> > > Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by
> > > sampling task threads. The results generated in such way missing the
> > > relevant stack of java threads and system calls. The async-profiler[2]
> > is a
> > > low-overhead sampling profiler for Java, but the steps to use it in the
> > > production environment are cumbersome and suffer from permissions and
> > > security risks.
> > >
> > > Therefore, we propose adding rest APIs to provide the capability to
> > invoke
> > > async-profiler on multiple platforms through JNI, which can be easily
> > > operated on Web UI. This enhancement will improve the efficiency and
> > > experience of Flink users in identifying performance bottlenecks.
> > >
> > >
> > >
> > > Please refer to the FLIP document for more details about the proposed
> > design
> > > and implementation. We welcome any feedback and opinions on this
> > proposal.
> > >
> > >
> > >
> > > [1] FLIP-375: Built-in cross-platform powerful java profiler on
> > > taskmanagers - Apache Flink - Apache Software Foundation
> > > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers
> > >
> > >
> > > [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP
> > > profiler for Java featuring AsyncGetCallTrace + perf_events
> > > <https://github.com/async-profiler/async-profiler>
> > >
> > >
> > >
> > > Best regards,
> > >
> > > Yun Tang and Yu Chen
> > >
> >
>

Reply via email to