Re: Schedule for Tuesday's FESCo Meeting (2023-01-03)

2023-01-04 Thread Andrii Nakryiko
> The change as voted simply does not work at a technical level because
> -mno-omit-leaf-frame-pointer is an architecture-specific GCC option that
> is not available on all Fedora architectures.  I don't think
> -fno-omit-frame-pointer is well-exercised on s390x, so I wouldn't want
> to use it there, either.
> 
> Is it safe to assume this change (as voted) is actually intended for
> x86_64 only, in both directions?  Specifically, that opting out will
> *not* disable frame pointers on aarch64 and ppc64le (where it would
> result in an ABI break)?

I'd say we should just keep the spirit of the proposal in mind: to enable frame 
pointers on all platforms where it's possible. Change owners are certainly not 
experts on intricacies of each possible architecture, so it would definitely 
help to get help and feedback from people like you on what specifically needs 
to be enabled to make this work across all arches.

> Regarding leaf functions, on typical workloads, a significant fraction
> of the profiling samples will be in glibc string functions, which do not
> have frame pointers.  So any profiler has to cope with leaf functions
> with frame pointers anyway.  As a result, do we really need the
> -mno-omit-leaf-frame-pointer option?

I think we should be careful about defining "typical workloads" and not assume 
that it's mostly inside glibc. But even if it is, just because glibcs leaf 
functions don't have frame pointers doesn't seem to imply that leaf pointers 
should be generally left out. As I said above, the spirit of the proposal is to 
make stack traces as accurate as possible. That includes leaf functions, right?

"any profiler has to cope with leaf functions" is true, but what is "cope"? 
Some more advanced solutions combine LBR stacks with frame pointers and are 
able to recover full stacks. But the space of stack trace uses (at least with 
BPF) is vast. There will be BPF-based tools of various degree of 
sophistication, and I'm 100% sure all of them won't go to the trouble of using 
LBR to recover the stack. Certainly this won't happen with simple ad-hoc 
bpftrace scripts.

So that's just to say that I think we should try hard to get leaf frame 
pointers as well so that stack traces are better in general. For glibc cases, 
we'll just get stack traces without last level, unless we do something more 
complicated with LBR.

WDYT?

> 
> On aarch64, we should not use -mno-omit-leaf-frame-pointer because there
> is no skipped frame problem with backtraces: the link register (x30) can
> be read directly even in leaf functions.  The ABI only mandates frame
> pointers for non-leaf functions.  On ppc64le, GCC does not even support
> -mno-omit-leaf-frame-pointer.
> 
> This is why I think the change is implicitly just for x86_64.

Definitely not intentionally, might be just a bias of what we had most hands-on 
experience with.
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

2022-12-07 Thread Andrii Nakryiko
> On Tue, Dec 6, 2022 at 12:56 PM Andrii Nakryiko
>  
> I don't think the two are comparable at all, neither in terms of
> potential performance impact (register pressure across an entire
> program vs at specific API call points in some unique cases) nor in
> terms of the benefits it provides.

It's irrelevant if they are comparable. This is a system-wide change that has 
potential to cause performance regression. By how much -- not clear. 
Hand-waiving such concerns is extremely disappointing in the face of the 
scrutiny given to https://pagure.io/fesco/issue/2817. So, please get 
inspiration from that proposal and do benchmarks.

> 
> Thanks,
> Sid
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

2022-12-07 Thread Andrii Nakryiko
> Andrii,
> 
> copilot to pilot, you are responding to Jakub Jelinek's points, not 

Copilot? Pilot? I don't understand this euphemism. And yes, I'm well aware who 
I'm replying to, thank you.

> Neal's. Jakub is a compiler/toolchain engineer with considerable 

And not sure why you are implying that Neal is somehow less qualified than 
Jakub?..

> experience, so when he talks about compiler technology involved in 
> tracing execution flow, I am inclined to believe him.
> 

Up to you who to believe and for what reason. I'm inclined to argue based on 
facts and what people are saying, not based on their alleged qualifications or 
reputation.

> I understand that your experience lies in running (and 
> tracing/profiling) production applications, but please consider that 
> your viewpoint may be biased by your experience with existing frame 
> pointer-based instrumentation. This means that you just know from 
> experience when your results are solid and when you have to be careful 
> because of e.g. a particular application's large number of small 
> prolog/epilog-dominated functions or complex and intensive flow of 
> memory allocations. Jakub is saying that DWARF is a superior technology 
> that provide the frame pointer information more reliably, so the real 
> issue is that frame pointer based infrastructure is already here and 
> DWARF handling requires more development. I haven't heard anyone 
> question the solidity of the DWARF approach outlined by Jakub and other 
> people involved with the toolchain, so I think it is reasonable to 
> expect that it will get implemented.

I'm biased towards looking at real world experiences, instead of using 
hypotheticals and some particular low-probability corner cases to make a 
misleading generalized statements like "Even with -fno-omit-frame-pointers, you 
can't rely on frame pointers". In practice, yes we can, and yes we do, across 
millions of machines, tons of various applications. They are reliable enough to 
drive multi-million efficiency wins done by large amount of engineers (and they 
don't even have to be performance experts to get a lot of use from frame 
pointer-based profiling data).

> 
> Are you actually against using the DWARF approach for technical reasons, 
> or is your argument based on the practical consideration of what's 
> available right now?
> 

Technical reasons, and it can't be mitigated with better tooling support. 
DWARF-based stack unwinding doesn't work well in practice and is very 
expensive, to the point that we can't and don't use it in practice for 
fleet-wide profiling. There are many technical reasons why this approach is not 
adequate. I'm not questioning of usefulness of DWARF data, I'm saying for 
profiling production workloads this is not appropriate alternative to frame 
pointers. Many people, both in this thread and in 
https://pagure.io/fesco/issue/2817 and related mailing discussions agree with 
me, coming from different backgrounds and environments.

> 
> Very Respectfully
> 
> p.k.
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

2022-12-07 Thread Andrii Nakryiko
> On Tue, Dec 06, 2022 at 05:46:11PM -0000, Andrii Nakryiko wrote:
> 
> Depends on how large functions are.  If you have lots of small functions,
> it can be more than 50% of instructions you get the frame pointer wrong.

You might be technically correct, but if some application is spending 50% of 
the time entering and exiting functions, then it's not doing a ton of useful 
work efficiently anyways and should be rethought and improved. And profiling 
data will point this out very clearly.

Let's just not use these convoluted unrealistic corner cases as a generalized 
claims about how frame pointers are useless, ok?

> 
> 
> You haven't looked carefully enough.
> 

Oh, I didn't claim I studied glibc's assembly code very carefully for sure. I 
did a very similar grepping to yours and found just the same **very few** 
functions that do use %rbp.

> find . -name \*.S | xargs grep -l %rbp | xargs grep -L movq.*%rsp.*%rbp
> ./sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core_avx512.S
> ./sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core_avx2.S
> ./sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core_avx512.S
> ./sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core_avx2.S
> ./sysdeps/x86_64/fpu/svml_d_sincos2_core.S
> ./sysdeps/x86_64/fpu/svml_s_sincosf4_core.S
> ./sysdeps/x86_64/addmul_1.S
> ./sysdeps/x86_64/__longjmp.S
> ./sysdeps/x86_64/setjmp.S
> ./sysdeps/x86_64/_mcount.S
> ./sysdeps/unix/sysv/linux/x86_64/longjmp_chk.S
> ./sysdeps/unix/sysv/linux/x86_64/setcontext.S
> ./sysdeps/unix/sysv/linux/x86_64/swapcontext.S
> ./sysdeps/unix/sysv/linux/x86_64/getcontext.S

Out of all of these, we have addmul_1.S as seemingly generic. But grepping 
around glibc sources I could find any place that's actually calling into 
__mpn_addmul_1, so there must be some more magic happening. I'll believe you 
that it's used somewhere in printf family of functions, though I certainly 
didn't see any mention of __mpn_addmul_1 in llvm-objdump and llvm-readelf 
outputs for a sample program that is just doing printf or snprintf.

But all that aside. You've found 14 files that do use %rbp and again are using 
this as a point that frame pointers are useless.

And I'm telling you that **in practice**, in real world applications, they are 
very much useful and are relied upon, even if in some circumstances we fail to 
get frame pointers. I don't understand how one can honestly can use this line 
of argument to say that frame pointers are useless.

And by enabling frame pointers even wider we make them even more useful.

> 
> E.g. mpn_addmul_1 is used by printf, which certainly shows up a lot in
> normal workloads.  The above cases mean the bp register contains total
> garbage if one tries to use it for backtrace purposes.  And obviously
> glibc isn't the only library with inline assembly out there...
> As I said, you can't rely on frame pointers making sense, with unwind info

See my replies. Yes, frame pointers might be broken in some rare situations, 
yes we don't have 100% success rate capturing stack traces precisely because 
there is embedded asm that clobbers %rbp, or binaries and libraries are build 
with -fomit-frame-pointers. We are asking to minimize the latter.

And for the former, it's on a case-by-case basis to try to mitigate this, but 
ultimately no one expect that stack traces will be captured with 100% success 
rate. We want this rate to be as high as possible though. And arguments like 
yours are not helping, but also misleading people that are not familiar with 
the matters at hand, but trust people like you just because you are "a 
compiler/toolchain engineer with considerable experience", while it's clear 
that compiler/toolchain engineers would rather spend time arguing about petty 
low-probability counter-examples than be helpful and see the problem in a wider 
context.

> such as in DWARF you know that in this case rbp is used as a frame pointer
> and in this case it is not, use some other register or this offset from
> stack pointer etc.
> 
> AFAIK systemtap already uses DWARF unwinder in the kernel and as I said, for
> the profiling purposes one can use only a small subset of that for the vast
> majority of cases.

This does not work for any of multitude of BPF-based tools. And even with perf, 
that supports very expensive DWARF-based unwinding, this doesn't work very well 
in practice. There is evidence even from people commenting in this thread. And 
all this was explained at length in https://pagure.io/fesco/issue/2817.

> 
>   Jakub
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://li

Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

2022-12-06 Thread Andrii Nakryiko
> On Tue, Dec 6, 2022 at 10:06 AM Gary Buhrmaster
>  
> My full comment in that blog post is:
> 
> "We need a proper study of performance and code size to understand the
> magnitude of the impact created by _FORTIFY_SOURCE=3 additional
> runtime code generation. However the performance and code size
> overhead may well be worth it due to the magnitude of improvement in
> security coverage."
> 
> To elaborate, I think the performance and code size study is an
> interesting academic concern, but the magnitude of improvement
> justifies whatever little performance impact this may introduce and it
> should not be a blocker for the improvement.

It's interesting how now it's just an academic concern. Please hold yourself to 
at least the standards set for https://pagure.io/fesco/issue/2817 in terms of 
benchmarking and persuading everyone that performance degradation across entire 
ecosystem is not a concern. 

> 
> 
> Sure, I could add that as a comment in the proposal.

We need benchmarks, not a comment. Thank you.

> 
> Thanks,
> Sid
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

2022-12-06 Thread Andrii Nakryiko
> On Tue, Dec 06, 2022 at 03:12:19AM +, Gary Buhrmaster wrote:
> 
> Note that is not a fully equivalent scenario. The no-omit-frame-pointer
> proposal was only offering a functional debugging benefit to a fairly
> small number of users who are also developers, while adding a likely
> performance hit to all users. There needs to be a high bar to justify
> the performance hit when the benefit offered is narrow.

First, frame pointers are not just for debugging benefit. It's not even it's 
main benefit from POV of https://pagure.io/fesco/issue/2817. Frame pointers are 
for performance profiling and observability first and foremost, and that's most 
useful under real-world conditions of production workloads. Not some custom 
re-built debug versions of applications.

Second, it might benefit a relatively small (but not tiny, it's at least 
thousands of people doing performance profiling) fraction of users, but those 
users (developers that care about performance) are the ones bringing benefits 
to very wide user base.

And third, with appropriate infrastructure of background perf profiling that 
projects like GNOME and KDE can put in place (if they haven't already), user 
doesn't have to be involved in performance profiling directly to benefit the 
ecosystem at large, providing anonymized source of real-world profiling data 
that could be aggregated and looked at by GNOME/KDE/whatnot developers that do 
care about performance.

But this won't happen unless GNOME/KDE can rely on having frame pointers in 
user systems.

> 
> This proposal is adding a functional security benefit to all users,
> alongside the possible performance hit. This is more easily justifiable,
> especially given Fedora's track record of being willing to security
> improvements even when they have a performance hit.
> 
> With regards,
> Daniel
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

2022-12-06 Thread Andrii Nakryiko
> On Tue, Dec 06, 2022 at 08:13:51AM -0500, Neal Gompa wrote:
> 
> That is nonsense.  Even with -fno-omit-frame-pointers, you can't rely
> on frame pointers, they are not accurate in function prologues and epilogues
> and they are total garbage e.g. in a lot of functions written in assembly.

First of all, https://pagure.io/fesco/issue/2817 is first and foremost about 
enabling low-overhead **profiling** of applications (periodic in background or 
on-demand requested by users explicitly), not debugging use cases. For 
debugging use cases GDB might be perfectly adequate to rely only on DWARF 
information. But -fno-omit-frame-pointers is to enable profiling **production 
workloads** as they happen, because very often reproducibility of results is 
impossible without understanding the issue in the first place, which is what 
frame pointers are needed for. Even more often you don't even realize that 
application is doing something suboptimal unless you profile it continuously, 
as it handles *real workload*.

Now, about prologues/epilogues. What percentage of useful workload is spent in 
those? Tiny fraction of a percent at best? Even if we don't get accurate stack 
trace in such cases it doesn't matter in the grand scheme of things.

As for hand-written assembly functions. I looked at strcmp, memcpy and similar 
very frequent and hot functions in glibc. Yes, they don't save %rbp and don't 
maintain frame pointers. But they also don't use %rbp register at all, which 
means **they don't clobber it**. So if we take stack trace while such function 
is executing, we'll still get non-broken stack trace all the way up to the root 
parent function, we just won't have leaf-level function in stack traces. That's 
much better and more useful than completely broken stack and allows to reason 
very well about application performance. Also, almost all fpu-related routines 
under sysdeps/x86_64/fpu/multiarch/svml*.S that are using %rbp, are also 
properly maintaining frame pointers:

There is a very good reason why Meta enabled -fno-omit-frame-pointers across 
all its internal software 5 years ago and never looked back. We rely on frame 
pointers being available across millions of machines to drive performance 
improvements and investigations saving millions of dollars of real 
non-hypothetical savings. Google, Netflix, Apple, etc -- all enable frame 
pointers due to sheer usefulness of them in practice, either for performance 
profiling and/or better real-time introspection of their workloads.

So much for the "nonsense". I realize that not everyone have experience with 
tracing production workloads and generally working in profiling area, but I 
expect people to keep an open mind and not use double standards when thinking 
about implications of system-wide changes like this one or frame pointers one.

> The only reliable way to get backtraces is DWARF info or something derived
> from it, that is for code emitted by the compilers (with the default
> -fasynchronous-unwind-tables) accurate for all instructions and for
> hand written assembly one has at least a way to describe that through
> .cfi_* directives.
> 
> As has been written multiple times, for profiling there doesn't need to be
> full DWARF unwinder in the kernel, it is enough that there is something
> that can handle the wast majority of cases and punt (copy to userland full
> stack) in case of anything unexpected as long as it is rare.
> Say on x86-64 and various other targets, the stack pointer, CFA (how to get
> caller's stack pointer), frame pointer if any or return address is very
> rarely stored anywhere but on the stack, so it should be enough to consider
> CFA, sp, bp, ip during unwinding and ignore everything else and from the
> harder DW_CFA_* ops (those that need DWARF expression evaluation)
> perhaps only pattern recognize the most common ones (say PLT slot, signal
> frame).

You clearly have never tried to do this in practice. You'd know about the price 
of discovering and pre-computing such look up tables, both in terms of memory 
usage, CPU usage, complexity and reliability. And then you'd also realize the 
problem of tracing short-lived processes that started after profiling started 
and were discovered upfront. As for the approach with capturing stack and 
sending it for post-processing. Beyond just the overhead of capturing and 
sending so much data for post-processing, you'd also stop and wonder what is 
the right size of stack to capture to handle deep stacks/recursion or functions 
with large stack size usage.

DWARF is not a panacea. For production workloads it doesn't even come close as 
an satisfactory solution, if employed in isolation.

> 
> Frame pointers will never result in anything reliable though, it results
> purely in severe performance degradation and false feeling they can be
> relied on.

THIS is nonsense, both on degradation and reliability points. Performance 
engineers and just typical applications owners laugh at your