Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-25 Thread Kees Cook
On Tue, Jul 25, 2017 at 10:58 AM, Josh Poimboeuf  wrote:
> [ Adding Kees to CC for the hardened usercopy discussion. ]
>
> Kees, FYI: frame pointers may be disabled by default on x86 relatively
> soon (presumably weeks or months) in favor of the ORC unwinder.  So the
> hardened usercopy stack walk will no longer work as advertised.
>
> Using the ORC unwinder for hardened usercopy would probably be pretty
> bad performance-wise.  I'm not sure what else could be done.  Ingo did
> have a few ideas for sanity checks:
>
> On Tue, Jul 25, 2017 at 11:09:44AM +0200, Ingo Molnar wrote:
>> > > > Well, on x86, hardened usercopy relies on frame pointers, but not the
>> > > > unwinder.  It does the frame pointer walk manually to avoid the full
>> > > > unwinder overhead.  See arch_within_stack_frames().
>>
>> BTW., I think this aspect of the hardened user-copy is crazy stuff - there 
>> can be
>> many stack frames, and this adds a serious amount of overhead even with frame
>> pointers...
>>
>> I think the current behavior is fine: if frame pointers are disabled then
>> arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity
>> checks: we do know the kernel stack range and we could check alignment as 
>> well.
>
> I believe it checks the kernel stack range already in
> check_stack_object() before deciding whether to call
> arch_within_stack_frames().  It also has an overlapping stack check.

Right, pointers starting in the stack are already checked to not go
beyond the stack.

As far as dropping inter-frame overflow checking, while I'd prefer to
keep it, but its benefit in my mind is already pretty minimal since it
already doesn't protect/exclude the stack canary. And since this is a
check for a linear overflow (i.e. a contiguous access) we're mostly
protected by the existing stack canary for writes. For reads, we do
risk allowing return addresses to get exposed, though without the
frame pointer, we've got even less to expose in the first place.

So, mainly, I'm fine with this. I'm slightly sad, but it's not a huge
loss. The main benefit of usercopy hardening is the slab cache object
size protections...

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-25 Thread Kees Cook
On Tue, Jul 25, 2017 at 10:58 AM, Josh Poimboeuf  wrote:
> [ Adding Kees to CC for the hardened usercopy discussion. ]
>
> Kees, FYI: frame pointers may be disabled by default on x86 relatively
> soon (presumably weeks or months) in favor of the ORC unwinder.  So the
> hardened usercopy stack walk will no longer work as advertised.
>
> Using the ORC unwinder for hardened usercopy would probably be pretty
> bad performance-wise.  I'm not sure what else could be done.  Ingo did
> have a few ideas for sanity checks:
>
> On Tue, Jul 25, 2017 at 11:09:44AM +0200, Ingo Molnar wrote:
>> > > > Well, on x86, hardened usercopy relies on frame pointers, but not the
>> > > > unwinder.  It does the frame pointer walk manually to avoid the full
>> > > > unwinder overhead.  See arch_within_stack_frames().
>>
>> BTW., I think this aspect of the hardened user-copy is crazy stuff - there 
>> can be
>> many stack frames, and this adds a serious amount of overhead even with frame
>> pointers...
>>
>> I think the current behavior is fine: if frame pointers are disabled then
>> arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity
>> checks: we do know the kernel stack range and we could check alignment as 
>> well.
>
> I believe it checks the kernel stack range already in
> check_stack_object() before deciding whether to call
> arch_within_stack_frames().  It also has an overlapping stack check.

Right, pointers starting in the stack are already checked to not go
beyond the stack.

As far as dropping inter-frame overflow checking, while I'd prefer to
keep it, but its benefit in my mind is already pretty minimal since it
already doesn't protect/exclude the stack canary. And since this is a
check for a linear overflow (i.e. a contiguous access) we're mostly
protected by the existing stack canary for writes. For reads, we do
risk allowing return addresses to get exposed, though without the
frame pointer, we've got even less to expose in the first place.

So, mainly, I'm fine with this. I'm slightly sad, but it's not a huge
loss. The main benefit of usercopy hardening is the slab cache object
size protections...

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-25 Thread Josh Poimboeuf
[ Adding Kees to CC for the hardened usercopy discussion. ]

Kees, FYI: frame pointers may be disabled by default on x86 relatively
soon (presumably weeks or months) in favor of the ORC unwinder.  So the
hardened usercopy stack walk will no longer work as advertised.

Using the ORC unwinder for hardened usercopy would probably be pretty
bad performance-wise.  I'm not sure what else could be done.  Ingo did
have a few ideas for sanity checks:

On Tue, Jul 25, 2017 at 11:09:44AM +0200, Ingo Molnar wrote:
> > > > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > > > unwinder.  It does the frame pointer walk manually to avoid the full
> > > > unwinder overhead.  See arch_within_stack_frames().
> 
> BTW., I think this aspect of the hardened user-copy is crazy stuff - there 
> can be 
> many stack frames, and this adds a serious amount of overhead even with frame 
> pointers...
> 
> I think the current behavior is fine: if frame pointers are disabled then 
> arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity 
> checks: we do know the kernel stack range and we could check alignment as 
> well.

I believe it checks the kernel stack range already in
check_stack_object() before deciding whether to call
arch_within_stack_frames().  It also has an overlapping stack check.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-25 Thread Josh Poimboeuf
[ Adding Kees to CC for the hardened usercopy discussion. ]

Kees, FYI: frame pointers may be disabled by default on x86 relatively
soon (presumably weeks or months) in favor of the ORC unwinder.  So the
hardened usercopy stack walk will no longer work as advertised.

Using the ORC unwinder for hardened usercopy would probably be pretty
bad performance-wise.  I'm not sure what else could be done.  Ingo did
have a few ideas for sanity checks:

On Tue, Jul 25, 2017 at 11:09:44AM +0200, Ingo Molnar wrote:
> > > > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > > > unwinder.  It does the frame pointer walk manually to avoid the full
> > > > unwinder overhead.  See arch_within_stack_frames().
> 
> BTW., I think this aspect of the hardened user-copy is crazy stuff - there 
> can be 
> many stack frames, and this adds a serious amount of overhead even with frame 
> pointers...
> 
> I think the current behavior is fine: if frame pointers are disabled then 
> arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity 
> checks: we do know the kernel stack range and we could check alignment as 
> well.

I believe it checks the kernel stack range already in
check_stack_object() before deciding whether to call
arch_within_stack_frames().  It also has an overlapping stack check.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-25 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Wed, Jul 12, 2017 at 09:27:50PM +0200, Ingo Molnar wrote:
> > Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig 
> > interface a 
> > bit nicer:
> > 
> >   CONFIG_UNWINDER_FRAME_POINTER
> >   CONFIG_UNWINDER_ORC
> >   CONFIG_UNWINDER_GUESS
> > 
> > ... or so?
> 
> So far I haven't been able to figure out how to make the above three
> options into a multiple choice selection, such that allnoconfig selects
> CONFIG_UNWINDER_GUESS and alldefconfig selects
> CONFIG_UNWINDER_FRAME_POINTER.

I don't think that's a problem: the scheduler preemption model Kconfig setup 
has 
similar behavior - allyesconfig does not enable CONFIG_PREEMPT=y.

The new x86 default will eventually be the Orc unwinder, but not initially.

> > I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to 
> > the 
> > non-trivial speedup it offers - but maybe folks would object?
> 
> Personally I wouldn't have an objection to making ORC the default, though we 
> should probably wait to give it some burn-in time first.

Sure, that's what testing is for.

> If we *do* decide to eventually make it the default, we could flip the switch 
> at 
> the same time we introduced the multiple-choice config and rename above.  
> That 
> way, users of "make oldconfig" would see the change and would be encouraged 
> to 
> switch ORC.

I disagree, as the current Kconfig layout actively hinders the 'more testing' 
part: you can only enable Orc if you knew how to do it, and 99% of our testers 
won't bother. In practice that's a testing coverage that is close to not 
testing 
it at all ...

> > > > CONFIG_FRAME_POINTERS et al would be left for architectures where it 
> > > > has a meaning 
> > > > beyond backtrace generation. (Not sure whether there's any such 
> > > > architectures.)
> > > 
> > > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > > unwinder.  It does the frame pointer walk manually to avoid the full
> > > unwinder overhead.  See arch_within_stack_frames().

BTW., I think this aspect of the hardened user-copy is crazy stuff - there can 
be 
many stack frames, and this adds a serious amount of overhead even with frame 
pointers...

I think the current behavior is fine: if frame pointers are disabled then 
arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity 
checks: we do know the kernel stack range and we could check alignment as well.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-25 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Wed, Jul 12, 2017 at 09:27:50PM +0200, Ingo Molnar wrote:
> > Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig 
> > interface a 
> > bit nicer:
> > 
> >   CONFIG_UNWINDER_FRAME_POINTER
> >   CONFIG_UNWINDER_ORC
> >   CONFIG_UNWINDER_GUESS
> > 
> > ... or so?
> 
> So far I haven't been able to figure out how to make the above three
> options into a multiple choice selection, such that allnoconfig selects
> CONFIG_UNWINDER_GUESS and alldefconfig selects
> CONFIG_UNWINDER_FRAME_POINTER.

I don't think that's a problem: the scheduler preemption model Kconfig setup 
has 
similar behavior - allyesconfig does not enable CONFIG_PREEMPT=y.

The new x86 default will eventually be the Orc unwinder, but not initially.

> > I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to 
> > the 
> > non-trivial speedup it offers - but maybe folks would object?
> 
> Personally I wouldn't have an objection to making ORC the default, though we 
> should probably wait to give it some burn-in time first.

Sure, that's what testing is for.

> If we *do* decide to eventually make it the default, we could flip the switch 
> at 
> the same time we introduced the multiple-choice config and rename above.  
> That 
> way, users of "make oldconfig" would see the change and would be encouraged 
> to 
> switch ORC.

I disagree, as the current Kconfig layout actively hinders the 'more testing' 
part: you can only enable Orc if you knew how to do it, and 99% of our testers 
won't bother. In practice that's a testing coverage that is close to not 
testing 
it at all ...

> > > > CONFIG_FRAME_POINTERS et al would be left for architectures where it 
> > > > has a meaning 
> > > > beyond backtrace generation. (Not sure whether there's any such 
> > > > architectures.)
> > > 
> > > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > > unwinder.  It does the frame pointer walk manually to avoid the full
> > > unwinder overhead.  See arch_within_stack_frames().

BTW., I think this aspect of the hardened user-copy is crazy stuff - there can 
be 
many stack frames, and this adds a serious amount of overhead even with frame 
pointers...

I think the current behavior is fine: if frame pointers are disabled then 
arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity 
checks: we do know the kernel stack range and we could check alignment as well.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-14 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 09:27:50PM +0200, Ingo Molnar wrote:
> Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig 
> interface a 
> bit nicer:
> 
>   CONFIG_UNWINDER_FRAME_POINTER
>   CONFIG_UNWINDER_ORC
>   CONFIG_UNWINDER_GUESS
> 
> ... or so?

So far I haven't been able to figure out how to make the above three
options into a multiple choice selection, such that allnoconfig selects
CONFIG_UNWINDER_GUESS and alldefconfig selects
CONFIG_UNWINDER_FRAME_POINTER.

I *think* I should be able to do it by setting the choice default to
FRAME_POINTER, and setting the 'allnoconfig_y' option for
UNWINDER_GUESS.  But kconfig apparently doesn't support 'allnoconfig_y'
for choice selections yet.  I may need to modify kconfig for that.

But IMO, this change can come later, and the current patches should be
fine to merge as-is.  And it might make sense to delay such a patch
anyway, see below.

> Default would be the historic FRAME_POINTER, at least initially, I think.
> 
> I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
> non-trivial speedup it offers - but maybe folks would object?

Personally I wouldn't have an objection to making ORC the default,
though we should probably wait to give it some burn-in time first.

If we *do* decide to eventually make it the default, we could flip the
switch at the same time we introduced the multiple-choice config and
rename above.  That way, users of "make oldconfig" would see the change
and would be encouraged to switch ORC.

> > > CONFIG_FRAME_POINTERS et al would be left for architectures where it has 
> > > a meaning 
> > > beyond backtrace generation. (Not sure whether there's any such 
> > > architectures.)
> > 
> > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > unwinder.  It does the frame pointer walk manually to avoid the full
> > unwinder overhead.  See arch_within_stack_frames().
> 
> Oh well...
> 
> > Ok, how about:
> > 
> >   "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
> >   kernel) than DWARF eh_frame tables."
> > 
> > (My previous 1MB number was from my distro-based config, and it also
> > forgot to take into account the fast lookup table (".orc_lookup")).
> 
> Sounds good to me!

Ok, I'll post a v3.1 of patch 9 with the changed wording.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-14 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 09:27:50PM +0200, Ingo Molnar wrote:
> Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig 
> interface a 
> bit nicer:
> 
>   CONFIG_UNWINDER_FRAME_POINTER
>   CONFIG_UNWINDER_ORC
>   CONFIG_UNWINDER_GUESS
> 
> ... or so?

So far I haven't been able to figure out how to make the above three
options into a multiple choice selection, such that allnoconfig selects
CONFIG_UNWINDER_GUESS and alldefconfig selects
CONFIG_UNWINDER_FRAME_POINTER.

I *think* I should be able to do it by setting the choice default to
FRAME_POINTER, and setting the 'allnoconfig_y' option for
UNWINDER_GUESS.  But kconfig apparently doesn't support 'allnoconfig_y'
for choice selections yet.  I may need to modify kconfig for that.

But IMO, this change can come later, and the current patches should be
fine to merge as-is.  And it might make sense to delay such a patch
anyway, see below.

> Default would be the historic FRAME_POINTER, at least initially, I think.
> 
> I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
> non-trivial speedup it offers - but maybe folks would object?

Personally I wouldn't have an objection to making ORC the default,
though we should probably wait to give it some burn-in time first.

If we *do* decide to eventually make it the default, we could flip the
switch at the same time we introduced the multiple-choice config and
rename above.  That way, users of "make oldconfig" would see the change
and would be encouraged to switch ORC.

> > > CONFIG_FRAME_POINTERS et al would be left for architectures where it has 
> > > a meaning 
> > > beyond backtrace generation. (Not sure whether there's any such 
> > > architectures.)
> > 
> > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > unwinder.  It does the frame pointer walk manually to avoid the full
> > unwinder overhead.  See arch_within_stack_frames().
> 
> Oh well...
> 
> > Ok, how about:
> > 
> >   "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
> >   kernel) than DWARF eh_frame tables."
> > 
> > (My previous 1MB number was from my distro-based config, and it also
> > forgot to take into account the fast lookup table (".orc_lookup")).
> 
> Sounds good to me!

Ok, I'll post a v3.1 of patch 9 with the changed wording.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> > > The results wouldn't be 100% accurate, but they could end up being useful 
> > > over time.
> > 
> > And to expound further on the bad idea, maybe the "bad" addresses could be 
> > filtered out somehow in post-processing (insert lots of hand waving).
> 
> And some details on the post-processing: in most cases it should be possible 
> to 
> determine which of the found stack addresses are valid by looking at the call 
> instructions immediately preceding the stack text addresses, and making sure 
> the 
> call target points to the same function as the previously found address.  But 
> of 
> course that wouldn't work for indirect calls.

I believe this is similar to how OProfile did graph/dwarf profiling, by saving 
a 
copy of the stack and post-processing it.

By my best recollection (but I haven't used OProfile that much) it was both a 
performance nightmare, was limited (because it only saved a part of the stack), 
and was rather fragile as well, because it depended on the task VM being 
post-processable.

I think the highest quality implementation is to generate the call trace either 
in 
hardware (LBR), or as close to the event as possible: generate the kernel call 
chain in the PMI context, and the user-space call chain before user-space 
executes 
again (at the latest). Call chain generation should be roughly O(chain_depth), 
which both FP and ORC ensures.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> > > The results wouldn't be 100% accurate, but they could end up being useful 
> > > over time.
> > 
> > And to expound further on the bad idea, maybe the "bad" addresses could be 
> > filtered out somehow in post-processing (insert lots of hand waving).
> 
> And some details on the post-processing: in most cases it should be possible 
> to 
> determine which of the found stack addresses are valid by looking at the call 
> instructions immediately preceding the stack text addresses, and making sure 
> the 
> call target points to the same function as the previously found address.  But 
> of 
> course that wouldn't work for indirect calls.

I believe this is similar to how OProfile did graph/dwarf profiling, by saving 
a 
copy of the stack and post-processing it.

By my best recollection (but I haven't used OProfile that much) it was both a 
performance nightmare, was limited (because it only saved a part of the stack), 
and was rather fragile as well, because it depended on the task VM being 
post-processable.

I think the highest quality implementation is to generate the call trace either 
in 
hardware (LBR), or as close to the event as possible: generate the kernel call 
chain in the PMI context, and the user-space call chain before user-space 
executes 
again (at the latest). Call chain generation should be roughly O(chain_depth), 
which both FP and ORC ensures.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> For user space stack unwinding, the kernel could emulate what the kernel
> 'guess' unwinder does by scanning the user space stack and returning all
> the text addresses it finds.

User-space stacks tend to be much larger than kernel stacks, the cost of doing 
such a full scan on every PMI would kill a lot of profiling workloads.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> For user space stack unwinding, the kernel could emulate what the kernel
> 'guess' unwinder does by scanning the user space stack and returning all
> the text addresses it finds.

User-space stacks tend to be much larger than kernel stacks, the cost of doing 
such a full scan on every PMI would kill a lot of profiling workloads.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 09:29:17PM -0700, Andi Kleen wrote:
> On Wed, Jul 12, 2017 at 05:47:59PM -0500, Josh Poimboeuf wrote:
> > On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > > Josh Poimboeuf  writes:
> > > >
> > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > >
> > > Can we have an option to just use dwarf instead? For people
> > > who don't want to waste a MB+ to solve a problem that doesn't
> > > exist (as proven by many years of opensuse kernel experience)
> > > 
> > > As far as I can tell this whole thing has only downsides compared
> > > to the dwarf unwinder that was earlier proposed. I don't see
> > > a single advantage.
> > 
> > Improved speed, reliability, maintainability.  Are those not advantages?
> 
> Ok. We'll see how it works out.
> 
> The memory overhead is quite bad though. You're basically undoing many
> years of efforts to shrink kernel text. I hope this can be still
> done better.

If we're talking *text*, this further shrinks text size by 3% because
frame pointers can be disabled.

As far as the data size goes, is anyone *truly* impacted by that extra
1MB or so?  If you're enabling a DWARF/ORC unwinder, you're already
signing up for a few extra megs anyway.

I do have a vague idea about how to reduce the data size, if/when the
size becomes a problem.  Basically there's a *lot* of duplication in the
ORC data:

  $ tools/objtool/objtool orc dump vmlinux | wc -l
  311095

  $ tools/objtool/objtool orc dump vmlinux |cut -d' ' -f2- |sort |uniq |wc -l
  345

So that's over 300,000 6-byte entries, only 345 of which are unique.
There should be a way to compress that.  However, it will probably
require sacrificing some combination of speed and simplicity.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 09:29:17PM -0700, Andi Kleen wrote:
> On Wed, Jul 12, 2017 at 05:47:59PM -0500, Josh Poimboeuf wrote:
> > On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > > Josh Poimboeuf  writes:
> > > >
> > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > >
> > > Can we have an option to just use dwarf instead? For people
> > > who don't want to waste a MB+ to solve a problem that doesn't
> > > exist (as proven by many years of opensuse kernel experience)
> > > 
> > > As far as I can tell this whole thing has only downsides compared
> > > to the dwarf unwinder that was earlier proposed. I don't see
> > > a single advantage.
> > 
> > Improved speed, reliability, maintainability.  Are those not advantages?
> 
> Ok. We'll see how it works out.
> 
> The memory overhead is quite bad though. You're basically undoing many
> years of efforts to shrink kernel text. I hope this can be still
> done better.

If we're talking *text*, this further shrinks text size by 3% because
frame pointers can be disabled.

As far as the data size goes, is anyone *truly* impacted by that extra
1MB or so?  If you're enabling a DWARF/ORC unwinder, you're already
signing up for a few extra megs anyway.

I do have a vague idea about how to reduce the data size, if/when the
size becomes a problem.  Basically there's a *lot* of duplication in the
ORC data:

  $ tools/objtool/objtool orc dump vmlinux | wc -l
  311095

  $ tools/objtool/objtool orc dump vmlinux |cut -d' ' -f2- |sort |uniq |wc -l
  345

So that's over 300,000 6-byte entries, only 345 of which are unique.
There should be a way to compress that.  However, it will probably
require sacrificing some combination of speed and simplicity.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Thu, Jul 13, 2017 at 07:21:15AM -0500, Josh Poimboeuf wrote:
> On Thu, Jul 13, 2017 at 07:17:55AM -0500, Josh Poimboeuf wrote:
> > BTW, while we're throwing out ideas for this, here's another idea,
> > though it's almost certainly not a good one :-)
> > 
> > For user space stack unwinding, the kernel could emulate what the kernel
> > 'guess' unwinder does by scanning the user space stack and returning all
> > the text addresses it finds.

To clarify, text address would mean any address in a VMA with the
executable bit set.

> > The results wouldn't be 100% accurate, but they could end up being
> > useful over time.
> 
> And to expound further on the bad idea, maybe the "bad" addresses could
> be filtered out somehow in post-processing (insert lots of hand waving).

And some details on the post-processing: in most cases it should be
possible to determine which of the found stack addresses are valid by
looking at the call instructions immediately preceding the stack text
addresses, and making sure the call target points to the same function
as the previously found address.  But of course that wouldn't work for
indirect calls.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Thu, Jul 13, 2017 at 07:21:15AM -0500, Josh Poimboeuf wrote:
> On Thu, Jul 13, 2017 at 07:17:55AM -0500, Josh Poimboeuf wrote:
> > BTW, while we're throwing out ideas for this, here's another idea,
> > though it's almost certainly not a good one :-)
> > 
> > For user space stack unwinding, the kernel could emulate what the kernel
> > 'guess' unwinder does by scanning the user space stack and returning all
> > the text addresses it finds.

To clarify, text address would mean any address in a VMA with the
executable bit set.

> > The results wouldn't be 100% accurate, but they could end up being
> > useful over time.
> 
> And to expound further on the bad idea, maybe the "bad" addresses could
> be filtered out somehow in post-processing (insert lots of hand waving).

And some details on the post-processing: in most cases it should be
possible to determine which of the found stack addresses are valid by
looking at the call instructions immediately preceding the stack text
addresses, and making sure the call target points to the same function
as the previously found address.  But of course that wouldn't work for
indirect calls.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Thu, Jul 13, 2017 at 07:17:55AM -0500, Josh Poimboeuf wrote:
> BTW, while we're throwing out ideas for this, here's another idea,
> though it's almost certainly not a good one :-)
> 
> For user space stack unwinding, the kernel could emulate what the kernel
> 'guess' unwinder does by scanning the user space stack and returning all
> the text addresses it finds.
> 
> The results wouldn't be 100% accurate, but they could end up being
> useful over time.

And to expound further on the bad idea, maybe the "bad" addresses could
be filtered out somehow in post-processing (insert lots of hand waving).

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Thu, Jul 13, 2017 at 07:17:55AM -0500, Josh Poimboeuf wrote:
> BTW, while we're throwing out ideas for this, here's another idea,
> though it's almost certainly not a good one :-)
> 
> For user space stack unwinding, the kernel could emulate what the kernel
> 'guess' unwinder does by scanning the user space stack and returning all
> the text addresses it finds.
> 
> The results wouldn't be 100% accurate, but they could end up being
> useful over time.

And to expound further on the bad idea, maybe the "bad" addresses could
be filtered out somehow in post-processing (insert lots of hand waving).

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Thu, Jul 13, 2017 at 11:19:11AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > > One gloriously ugly hack would be to delay the userspace unwind to 
> > > return-to-userspace, at which point we have a schedulable context and can 
> > > take 
> > > faults.
> 
> I don't think it's ugly, and it has various advantages:
> 
> > > Of course, then you have to somehow identify this later unwind sample 
> > > with all 
> > > relevant prior samples and stitch the whole thing back together, but that 
> > > should be doable.
> > > 
> > > In fact, it would not be at all hard to do, just queue a task_work from 
> > > the 
> > > NMI and have that do the EH based unwind.
> 
> This would have a couple of advantages:
> 
>  - as you mention, being able to fault in debug info and generally do 
>IO/scheduling,
> 
>  - profiling overhead would be accounted to the task context that generates 
> it,
>not the NMI context,
> 
>  - there would be a natural batching/coalescing optimization if multiple 
> events
>hit the same system call: the user-space backtrace would only have to be 
> looked 
>up once for all samples that got collected.
> 
> This could be done by separating the user-space backtrace into a separate 
> event, 
> and perf tooling would then apply the same user-space backtrace to all prior 
> kernel samples.
> 
> I.e. the ring-buffer would have trace entries like:
> 
>  [ kernel sample #1, with kernel backtrace #1 ]
>  [ kernel sample #2, with kernel backtrace #2 ]
>  [ kernel sample #3, with kernel backtrace #3 ]
>  [ user-space backtrace #1 at syscall return ]
>  ...
> 
> Note how the three kernel samples didn't have to do any user-space unwinding 
> at 
> all, so the user-space unwinding overhead got reduced by a factor of 3.
> 
> Tooling would know that 'user-space backtrace #1' applies to the previous 
> three 
> kernel samples.
> 
> Or so?

BTW, while we're throwing out ideas for this, here's another idea,
though it's almost certainly not a good one :-)

For user space stack unwinding, the kernel could emulate what the kernel
'guess' unwinder does by scanning the user space stack and returning all
the text addresses it finds.

The results wouldn't be 100% accurate, but they could end up being
useful over time.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Josh Poimboeuf
On Thu, Jul 13, 2017 at 11:19:11AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > > One gloriously ugly hack would be to delay the userspace unwind to 
> > > return-to-userspace, at which point we have a schedulable context and can 
> > > take 
> > > faults.
> 
> I don't think it's ugly, and it has various advantages:
> 
> > > Of course, then you have to somehow identify this later unwind sample 
> > > with all 
> > > relevant prior samples and stitch the whole thing back together, but that 
> > > should be doable.
> > > 
> > > In fact, it would not be at all hard to do, just queue a task_work from 
> > > the 
> > > NMI and have that do the EH based unwind.
> 
> This would have a couple of advantages:
> 
>  - as you mention, being able to fault in debug info and generally do 
>IO/scheduling,
> 
>  - profiling overhead would be accounted to the task context that generates 
> it,
>not the NMI context,
> 
>  - there would be a natural batching/coalescing optimization if multiple 
> events
>hit the same system call: the user-space backtrace would only have to be 
> looked 
>up once for all samples that got collected.
> 
> This could be done by separating the user-space backtrace into a separate 
> event, 
> and perf tooling would then apply the same user-space backtrace to all prior 
> kernel samples.
> 
> I.e. the ring-buffer would have trace entries like:
> 
>  [ kernel sample #1, with kernel backtrace #1 ]
>  [ kernel sample #2, with kernel backtrace #2 ]
>  [ kernel sample #3, with kernel backtrace #3 ]
>  [ user-space backtrace #1 at syscall return ]
>  ...
> 
> Note how the three kernel samples didn't have to do any user-space unwinding 
> at 
> all, so the user-space unwinding overhead got reduced by a factor of 3.
> 
> Tooling would know that 'user-space backtrace #1' applies to the previous 
> three 
> kernel samples.
> 
> Or so?

BTW, while we're throwing out ideas for this, here's another idea,
though it's almost certainly not a good one :-)

For user space stack unwinding, the kernel could emulate what the kernel
'guess' unwinder does by scanning the user space stack and returning all
the text addresses it finds.

The results wouldn't be 100% accurate, but they could end up being
useful over time.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Jiri Kosina
On Wed, 12 Jul 2017, Andi Kleen wrote:

> It's somewhat surprising. It would be good to under stand why that 
> happens. Is it icache misses, data cache misses for the stack, or simply 
> more instructions executed, or worse tail calls?

http://lkml.kernel.org/r/20170602104048.jkkzssljsompj...@suse.de

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Jiri Kosina
On Wed, 12 Jul 2017, Andi Kleen wrote:

> It's somewhat surprising. It would be good to under stand why that 
> happens. Is it icache misses, data cache misses for the stack, or simply 
> more instructions executed, or worse tail calls?

http://lkml.kernel.org/r/20170602104048.jkkzssljsompj...@suse.de

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > Josh Poimboeuf  writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> > 
> > As far as I can tell this whole thing has only downsides compared
> > to the dwarf unwinder that was earlier proposed. I don't see
> > a single advantage.
> 
> Improved speed, reliability, maintainability.  Are those not advantages?

Exactly, and all these advantages of the ORC debuginfo over DWARF debuginfo are 
enabled by an unwinding optimized data format that the kernel project 
generates, 
controls and is able to trust inherently.

DWARF generated by external tooling can just never reach that level of trust, 
without insane amounts of formal verification.

Even if ORC was _slower_ its reliability would be reason enough to merge. The 
fact 
that it's 20-40 times faster than the DWARF unwinder is really just icing on 
the 
cake.

BTW., as a side note, (and I hope my optimism isn't premature), I believe the 
ORC 
unwinder is a prime example of where Linus's stubborness resisting poor 
concepts 
paid off in the long run: had we merged the DWARF unwinder years ago we'd never 
have gained the ORC unwinder. We quite literally had to wait over a decade, but 
good things happened in the end.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > Josh Poimboeuf  writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> > 
> > As far as I can tell this whole thing has only downsides compared
> > to the dwarf unwinder that was earlier proposed. I don't see
> > a single advantage.
> 
> Improved speed, reliability, maintainability.  Are those not advantages?

Exactly, and all these advantages of the ORC debuginfo over DWARF debuginfo are 
enabled by an unwinding optimized data format that the kernel project 
generates, 
controls and is able to trust inherently.

DWARF generated by external tooling can just never reach that level of trust, 
without insane amounts of formal verification.

Even if ORC was _slower_ its reliability would be reason enough to merge. The 
fact 
that it's 20-40 times faster than the DWARF unwinder is really just icing on 
the 
cake.

BTW., as a side note, (and I hope my optimism isn't premature), I believe the 
ORC 
unwinder is a prime example of where Linus's stubborness resisting poor 
concepts 
paid off in the long run: had we merged the DWARF unwinder years ago we'd never 
have gained the ORC unwinder. We quite literally had to wait over a decade, but 
good things happened in the end.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> > One gloriously ugly hack would be to delay the userspace unwind to 
> > return-to-userspace, at which point we have a schedulable context and can 
> > take 
> > faults.

I don't think it's ugly, and it has various advantages:

> > Of course, then you have to somehow identify this later unwind sample with 
> > all 
> > relevant prior samples and stitch the whole thing back together, but that 
> > should be doable.
> > 
> > In fact, it would not be at all hard to do, just queue a task_work from the 
> > NMI and have that do the EH based unwind.

This would have a couple of advantages:

 - as you mention, being able to fault in debug info and generally do 
   IO/scheduling,

 - profiling overhead would be accounted to the task context that generates it,
   not the NMI context,

 - there would be a natural batching/coalescing optimization if multiple events
   hit the same system call: the user-space backtrace would only have to be 
looked 
   up once for all samples that got collected.

This could be done by separating the user-space backtrace into a separate 
event, 
and perf tooling would then apply the same user-space backtrace to all prior 
kernel samples.

I.e. the ring-buffer would have trace entries like:

 [ kernel sample #1, with kernel backtrace #1 ]
 [ kernel sample #2, with kernel backtrace #2 ]
 [ kernel sample #3, with kernel backtrace #3 ]
 [ user-space backtrace #1 at syscall return ]
 ...

Note how the three kernel samples didn't have to do any user-space unwinding at 
all, so the user-space unwinding overhead got reduced by a factor of 3.

Tooling would know that 'user-space backtrace #1' applies to the previous three 
kernel samples.

Or so?

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> > One gloriously ugly hack would be to delay the userspace unwind to 
> > return-to-userspace, at which point we have a schedulable context and can 
> > take 
> > faults.

I don't think it's ugly, and it has various advantages:

> > Of course, then you have to somehow identify this later unwind sample with 
> > all 
> > relevant prior samples and stitch the whole thing back together, but that 
> > should be doable.
> > 
> > In fact, it would not be at all hard to do, just queue a task_work from the 
> > NMI and have that do the EH based unwind.

This would have a couple of advantages:

 - as you mention, being able to fault in debug info and generally do 
   IO/scheduling,

 - profiling overhead would be accounted to the task context that generates it,
   not the NMI context,

 - there would be a natural batching/coalescing optimization if multiple events
   hit the same system call: the user-space backtrace would only have to be 
looked 
   up once for all samples that got collected.

This could be done by separating the user-space backtrace into a separate 
event, 
and perf tooling would then apply the same user-space backtrace to all prior 
kernel samples.

I.e. the ring-buffer would have trace entries like:

 [ kernel sample #1, with kernel backtrace #1 ]
 [ kernel sample #2, with kernel backtrace #2 ]
 [ kernel sample #3, with kernel backtrace #3 ]
 [ user-space backtrace #1 at syscall return ]
 ...

Note how the three kernel samples didn't have to do any user-space unwinding at 
all, so the user-space unwinding overhead got reduced by a factor of 3.

Tooling would know that 'user-space backtrace #1' applies to the previous three 
kernel samples.

Or so?

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Peter Zijlstra
On Thu, Jul 13, 2017 at 10:50:15AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 13, 2017 at 09:12:53AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> > > If you want perf to be able to use ORC instead of DWARF for user space
> > > binaries, that's not currently possible, though I don't see any
> > > technical blockers for doing so.  Perf would need to be taught to read
> > > ORC data.
> > 
> > So the problem with userspace stuff is that the unwind data isn't
> > readily available from NMI context.
> > 
> > So the kernel unwinder will trigger a fault and abort.
> > 
> > The very best we can hope for is using the EH [*] stuff that all
> > binaries actually have _and_ map. The only problem is that most programs
> > don't actually use the EH stuff much so while its mapped, its not
> > actually paged in, so we're still stuck.
> 
> One gloriously ugly hack would be to delay the userspace unwind to
> return-to-userspace, at which point we have a schedulable context and
> can take faults.
> 
> Of course, then you have to somehow identify this later unwind sample
> with all relevant prior samples and stitch the whole thing back
> together, but that should be doable.
> 
> In fact, it would be at all hard to do, just queue a task_work from the

+not

> NMI and have that do the EH based unwind.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Peter Zijlstra
On Thu, Jul 13, 2017 at 10:50:15AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 13, 2017 at 09:12:53AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> > > If you want perf to be able to use ORC instead of DWARF for user space
> > > binaries, that's not currently possible, though I don't see any
> > > technical blockers for doing so.  Perf would need to be taught to read
> > > ORC data.
> > 
> > So the problem with userspace stuff is that the unwind data isn't
> > readily available from NMI context.
> > 
> > So the kernel unwinder will trigger a fault and abort.
> > 
> > The very best we can hope for is using the EH [*] stuff that all
> > binaries actually have _and_ map. The only problem is that most programs
> > don't actually use the EH stuff much so while its mapped, its not
> > actually paged in, so we're still stuck.
> 
> One gloriously ugly hack would be to delay the userspace unwind to
> return-to-userspace, at which point we have a schedulable context and
> can take faults.
> 
> Of course, then you have to somehow identify this later unwind sample
> with all relevant prior samples and stitch the whole thing back
> together, but that should be doable.
> 
> In fact, it would be at all hard to do, just queue a task_work from the

+not

> NMI and have that do the EH based unwind.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Peter Zijlstra
On Thu, Jul 13, 2017 at 09:12:53AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> > If you want perf to be able to use ORC instead of DWARF for user space
> > binaries, that's not currently possible, though I don't see any
> > technical blockers for doing so.  Perf would need to be taught to read
> > ORC data.
> 
> So the problem with userspace stuff is that the unwind data isn't
> readily available from NMI context.
> 
> So the kernel unwinder will trigger a fault and abort.
> 
> The very best we can hope for is using the EH [*] stuff that all
> binaries actually have _and_ map. The only problem is that most programs
> don't actually use the EH stuff much so while its mapped, its not
> actually paged in, so we're still stuck.

One gloriously ugly hack would be to delay the userspace unwind to
return-to-userspace, at which point we have a schedulable context and
can take faults.

Of course, then you have to somehow identify this later unwind sample
with all relevant prior samples and stitch the whole thing back
together, but that should be doable.

In fact, it would be at all hard to do, just queue a task_work from the
NMI and have that do the EH based unwind.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Peter Zijlstra
On Thu, Jul 13, 2017 at 09:12:53AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> > If you want perf to be able to use ORC instead of DWARF for user space
> > binaries, that's not currently possible, though I don't see any
> > technical blockers for doing so.  Perf would need to be taught to read
> > ORC data.
> 
> So the problem with userspace stuff is that the unwind data isn't
> readily available from NMI context.
> 
> So the kernel unwinder will trigger a fault and abort.
> 
> The very best we can hope for is using the EH [*] stuff that all
> binaries actually have _and_ map. The only problem is that most programs
> don't actually use the EH stuff much so while its mapped, its not
> actually paged in, so we're still stuck.

One gloriously ugly hack would be to delay the userspace unwind to
return-to-userspace, at which point we have a schedulable context and
can take faults.

Of course, then you have to somehow identify this later unwind sample
with all relevant prior samples and stitch the whole thing back
together, but that should be doable.

In fact, it would be at all hard to do, just queue a task_work from the
NMI and have that do the EH based unwind.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Peter Zijlstra
On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> If you want perf to be able to use ORC instead of DWARF for user space
> binaries, that's not currently possible, though I don't see any
> technical blockers for doing so.  Perf would need to be taught to read
> ORC data.

So the problem with userspace stuff is that the unwind data isn't
readily available from NMI context.

So the kernel unwinder will trigger a fault and abort.

The very best we can hope for is using the EH [*] stuff that all
binaries actually have _and_ map. The only problem is that most programs
don't actually use the EH stuff much so while its mapped, its not
actually paged in, so we're still stuck.

[*] C++ ABI requires EH bits for stack unwinding for exception handling
and the like, and because C++ can unwind through C code, C ABI also
mandates EH bits be present.


ORC doesn't much change this. What is currently an option is for perf to
simply copy out the top n-Kb of the stack for each sample (talk about
expensive) and then have userspace unwind it. And for userspace
unwinding in userspace, libunwind and the like are fine, I see absolutely
no reason to use ORC bits here.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-13 Thread Peter Zijlstra
On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> If you want perf to be able to use ORC instead of DWARF for user space
> binaries, that's not currently possible, though I don't see any
> technical blockers for doing so.  Perf would need to be taught to read
> ORC data.

So the problem with userspace stuff is that the unwind data isn't
readily available from NMI context.

So the kernel unwinder will trigger a fault and abort.

The very best we can hope for is using the EH [*] stuff that all
binaries actually have _and_ map. The only problem is that most programs
don't actually use the EH stuff much so while its mapped, its not
actually paged in, so we're still stuck.

[*] C++ ABI requires EH bits for stack unwinding for exception handling
and the like, and because C++ can unwind through C code, C ABI also
mandates EH bits be present.


ORC doesn't much change this. What is currently an option is for perf to
simply copy out the top n-Kb of the stack for each sample (talk about
expensive) and then have userspace unwind it. And for userspace
unwinding in userspace, libunwind and the like are fine, I see absolutely
no reason to use ORC bits here.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Mike Galbraith
On Wed, 2017-07-12 at 21:40 -0700, Andi Kleen wrote:
> On Thu, Jul 13, 2017 at 06:28:43AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> > > On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > > > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > > > Josh Poimboeuf  writes:
> > > > > >
> > > > > > The ORC data format does have a few downsides compared to DWARF.  
> > > > > > The
> > > > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame 
> > > > > > tables.
> > > > > >
> > > > > Can we have an option to just use dwarf instead? For people
> > > > > who don't want to waste a MB+ to solve a problem that doesn't
> > > > > exist (as proven by many years of opensuse kernel experience)
> > > > 
> > > > Sure the dwarf unwinder works well for crashes, but at the price of
> > > > demolishing ftrace/perf utility.
> > > 
> > > You mean the unwind performance?
> > 
> > Yeah, it hurts.. massively, has even been known to kill big boxen.
> 
> Why was that?

Presuming you mean the big box bit, danged if I know, I haven't
personally met that, only the massive overhead.

> > > That's a valid concern, but neither ORC nor dwarf are likely
> > > to address it. However most usages of ftrace/perf shouldn't be that
> > > depending on unwind performance -- just lower the frequency of your
> > > events. 
> > > 
> > > The only possible win is if the win from not using FP code is
> > > significant enough. On the x86 side the only modern CPUs that should 
> > > really
> > > care about this are Atoms.
> > 
> > Nope, they all care.  Measure performance delta of fast/light stuff.
> 
> Well if your test cares that much about function overhead you may want to try
> LTO. It can get rid of a lot of functions by doing cross file
> inlining.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=lto-411-2
> 
> > Maybe I'm expecting too much good stuff to follow, but don't spoil it
> > for me, I think I'm looking at a real winner :)
> 
> It's somewhat surprising. It would be good to under stand why that
> happens. Is it icache misses, data cache misses for the stack, or
> simply more instructions executed, or worse tail calls?

No idea.  It was speculated that it was register loss, but I played
with that, saw nearly zero delta until I stole too many.

-Mike


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Mike Galbraith
On Wed, 2017-07-12 at 21:40 -0700, Andi Kleen wrote:
> On Thu, Jul 13, 2017 at 06:28:43AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> > > On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > > > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > > > Josh Poimboeuf  writes:
> > > > > >
> > > > > > The ORC data format does have a few downsides compared to DWARF.  
> > > > > > The
> > > > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame 
> > > > > > tables.
> > > > > >
> > > > > Can we have an option to just use dwarf instead? For people
> > > > > who don't want to waste a MB+ to solve a problem that doesn't
> > > > > exist (as proven by many years of opensuse kernel experience)
> > > > 
> > > > Sure the dwarf unwinder works well for crashes, but at the price of
> > > > demolishing ftrace/perf utility.
> > > 
> > > You mean the unwind performance?
> > 
> > Yeah, it hurts.. massively, has even been known to kill big boxen.
> 
> Why was that?

Presuming you mean the big box bit, danged if I know, I haven't
personally met that, only the massive overhead.

> > > That's a valid concern, but neither ORC nor dwarf are likely
> > > to address it. However most usages of ftrace/perf shouldn't be that
> > > depending on unwind performance -- just lower the frequency of your
> > > events. 
> > > 
> > > The only possible win is if the win from not using FP code is
> > > significant enough. On the x86 side the only modern CPUs that should 
> > > really
> > > care about this are Atoms.
> > 
> > Nope, they all care.  Measure performance delta of fast/light stuff.
> 
> Well if your test cares that much about function overhead you may want to try
> LTO. It can get rid of a lot of functions by doing cross file
> inlining.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=lto-411-2
> 
> > Maybe I'm expecting too much good stuff to follow, but don't spoil it
> > for me, I think I'm looking at a real winner :)
> 
> It's somewhat surprising. It would be good to under stand why that
> happens. Is it icache misses, data cache misses for the stack, or
> simply more instructions executed, or worse tail calls?

No idea.  It was speculated that it was register loss, but I played
with that, saw nearly zero delta until I stole too many.

-Mike


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
On Thu, Jul 13, 2017 at 06:28:43AM +0200, Mike Galbraith wrote:
> On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> > On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > > Josh Poimboeuf  writes:
> > > > >
> > > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > > >
> > > > Can we have an option to just use dwarf instead? For people
> > > > who don't want to waste a MB+ to solve a problem that doesn't
> > > > exist (as proven by many years of opensuse kernel experience)
> > > 
> > > Sure the dwarf unwinder works well for crashes, but at the price of
> > > demolishing ftrace/perf utility.
> > 
> > You mean the unwind performance?
> 
> Yeah, it hurts.. massively, has even been known to kill big boxen.

Why was that? 

> 
> > That's a valid concern, but neither ORC nor dwarf are likely
> > to address it. However most usages of ftrace/perf shouldn't be that
> > depending on unwind performance -- just lower the frequency of your
> > events. 
> > 
> > The only possible win is if the win from not using FP code is
> > significant enough. On the x86 side the only modern CPUs that should really
> > care about this are Atoms.
> 
> Nope, they all care.  Measure performance delta of fast/light stuff.

Well if your test cares that much about function overhead you may want to try
LTO. It can get rid of a lot of functions by doing cross file
inlining.

https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=lto-411-2

> Maybe I'm expecting too much good stuff to follow, but don't spoil it
> for me, I think I'm looking at a real winner :)

It's somewhat surprising. It would be good to under stand why that
happens. Is it icache misses, data cache misses for the stack, or
simply more instructions executed, or worse tail calls?

-Andi


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
On Thu, Jul 13, 2017 at 06:28:43AM +0200, Mike Galbraith wrote:
> On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> > On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > > Josh Poimboeuf  writes:
> > > > >
> > > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > > >
> > > > Can we have an option to just use dwarf instead? For people
> > > > who don't want to waste a MB+ to solve a problem that doesn't
> > > > exist (as proven by many years of opensuse kernel experience)
> > > 
> > > Sure the dwarf unwinder works well for crashes, but at the price of
> > > demolishing ftrace/perf utility.
> > 
> > You mean the unwind performance?
> 
> Yeah, it hurts.. massively, has even been known to kill big boxen.

Why was that? 

> 
> > That's a valid concern, but neither ORC nor dwarf are likely
> > to address it. However most usages of ftrace/perf shouldn't be that
> > depending on unwind performance -- just lower the frequency of your
> > events. 
> > 
> > The only possible win is if the win from not using FP code is
> > significant enough. On the x86 side the only modern CPUs that should really
> > care about this are Atoms.
> 
> Nope, they all care.  Measure performance delta of fast/light stuff.

Well if your test cares that much about function overhead you may want to try
LTO. It can get rid of a lot of functions by doing cross file
inlining.

https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=lto-411-2

> Maybe I'm expecting too much good stuff to follow, but don't spoil it
> for me, I think I'm looking at a real winner :)

It's somewhat surprising. It would be good to under stand why that
happens. Is it icache misses, data cache misses for the stack, or
simply more instructions executed, or worse tail calls?

-Andi


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Mike Galbraith
On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > Josh Poimboeuf  writes:
> > > >
> > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > >
> > > Can we have an option to just use dwarf instead? For people
> > > who don't want to waste a MB+ to solve a problem that doesn't
> > > exist (as proven by many years of opensuse kernel experience)
> > 
> > Sure the dwarf unwinder works well for crashes, but at the price of
> > demolishing ftrace/perf utility.
> 
> You mean the unwind performance?

Yeah, it hurts.. massively, has even been known to kill big boxen.

> That's a valid concern, but neither ORC nor dwarf are likely
> to address it. However most usages of ftrace/perf shouldn't be that
> depending on unwind performance -- just lower the frequency of your
> events. 
> 
> The only possible win is if the win from not using FP code is
> significant enough. On the x86 side the only modern CPUs that should really
> care about this are Atoms.

Nope, they all care.  Measure performance delta of fast/light stuff.

Maybe I'm expecting too much good stuff to follow, but don't spoil it
for me, I think I'm looking at a real winner :)

-Mike


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Mike Galbraith
On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > Josh Poimboeuf  writes:
> > > >
> > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > >
> > > Can we have an option to just use dwarf instead? For people
> > > who don't want to waste a MB+ to solve a problem that doesn't
> > > exist (as proven by many years of opensuse kernel experience)
> > 
> > Sure the dwarf unwinder works well for crashes, but at the price of
> > demolishing ftrace/perf utility.
> 
> You mean the unwind performance?

Yeah, it hurts.. massively, has even been known to kill big boxen.

> That's a valid concern, but neither ORC nor dwarf are likely
> to address it. However most usages of ftrace/perf shouldn't be that
> depending on unwind performance -- just lower the frequency of your
> events. 
> 
> The only possible win is if the win from not using FP code is
> significant enough. On the x86 side the only modern CPUs that should really
> care about this are Atoms.

Nope, they all care.  Measure performance delta of fast/light stuff.

Maybe I'm expecting too much good stuff to follow, but don't spoil it
for me, I think I'm looking at a real winner :)

-Mike


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
On Wed, Jul 12, 2017 at 05:47:59PM -0500, Josh Poimboeuf wrote:
> On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > Josh Poimboeuf  writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> > 
> > As far as I can tell this whole thing has only downsides compared
> > to the dwarf unwinder that was earlier proposed. I don't see
> > a single advantage.
> 
> Improved speed, reliability, maintainability.  Are those not advantages?

Ok. We'll see how it works out.

The memory overhead is quite bad though. You're basically undoing many
years of efforts to shrink kernel text. I hope this can be still
done better.

-Andi



Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
On Wed, Jul 12, 2017 at 05:47:59PM -0500, Josh Poimboeuf wrote:
> On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > Josh Poimboeuf  writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> > 
> > As far as I can tell this whole thing has only downsides compared
> > to the dwarf unwinder that was earlier proposed. I don't see
> > a single advantage.
> 
> Improved speed, reliability, maintainability.  Are those not advantages?

Ok. We'll see how it works out.

The memory overhead is quite bad though. You're basically undoing many
years of efforts to shrink kernel text. I hope this can be still
done better.

-Andi



Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > Josh Poimboeuf  writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> 
> Sure the dwarf unwinder works well for crashes, but at the price of
> demolishing ftrace/perf utility.

You mean the unwind performance?

That's a valid concern, but neither ORC nor dwarf are likely
to address it. However most usages of ftrace/perf shouldn't be that
depending on unwind performance -- just lower the frequency of your
events. 

The only possible win is if the win from not using FP code is
significant enough. On the x86 side the only modern CPUs that should really
care about this are Atoms.

-Andi



Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > Josh Poimboeuf  writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> 
> Sure the dwarf unwinder works well for crashes, but at the price of
> demolishing ftrace/perf utility.

You mean the unwind performance?

That's a valid concern, but neither ORC nor dwarf are likely
to address it. However most usages of ftrace/perf shouldn't be that
depending on unwind performance -- just lower the frequency of your
events. 

The only possible win is if the win from not using FP code is
significant enough. On the x86 side the only modern CPUs that should really
care about this are Atoms.

-Andi



Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Mike Galbraith
On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> Josh Poimboeuf  writes:
> >
> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> >
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)

Sure the dwarf unwinder works well for crashes, but at the price of
demolishing ftrace/perf utility.

-Mike


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Mike Galbraith
On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> Josh Poimboeuf  writes:
> >
> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> >
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)

Sure the dwarf unwinder works well for crashes, but at the price of
demolishing ftrace/perf utility.

-Mike


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andy Lutomirski
On Wed, Jul 12, 2017 at 3:30 PM, Andi Kleen  wrote:
> Josh Poimboeuf  writes:
>>
>> The ORC data format does have a few downsides compared to DWARF.  The
>> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
>>
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)
>
> As far as I can tell this whole thing has only downsides compared
> to the dwarf unwinder that was earlier proposed. I don't see
> a single advantage.
>

If someone wanted to write an in-kernel DWARF parser that hooked into
the same machinery that Josh is using and comes with a complete formal
verification package, I might not object, with the caveat that it's
likely to be *much* slower than ORC.  By complete formal verification,
I mean that a user tool running the exact same code, compiled with
strong sanitization, should decode the tables for every single kernel
IP and confirm that (a) the output is sane, (b) the output matches
what objtool says it should do and (c) doesn't crash.

I'm not sure I see the point, though.  I also think that Linus would
object, since I asked him quite recently and he said he'd object.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andy Lutomirski
On Wed, Jul 12, 2017 at 3:30 PM, Andi Kleen  wrote:
> Josh Poimboeuf  writes:
>>
>> The ORC data format does have a few downsides compared to DWARF.  The
>> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
>>
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)
>
> As far as I can tell this whole thing has only downsides compared
> to the dwarf unwinder that was earlier proposed. I don't see
> a single advantage.
>

If someone wanted to write an in-kernel DWARF parser that hooked into
the same machinery that Josh is using and comes with a complete formal
verification package, I might not object, with the caveat that it's
likely to be *much* slower than ORC.  By complete formal verification,
I mean that a user tool running the exact same code, compiled with
strong sanitization, should decode the tables for every single kernel
IP and confirm that (a) the output is sane, (b) the output matches
what objtool says it should do and (c) doesn't crash.

I'm not sure I see the point, though.  I also think that Linus would
object, since I asked him quite recently and he said he'd object.


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andres Freund
On 2017-07-12 17:40:45 -0500, Josh Poimboeuf wrote:
> On Wed, Jul 12, 2017 at 03:36:05PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> > > If you want perf to be able to use ORC instead of DWARF for user space
> > > binaries, that's not currently possible, though I don't see any
> > > technical blockers for doing so.  Perf would need to be taught to read
> > > ORC data.
> > 
> > Right, that's what I was hoping for.
> > 
> > 
> > > And I think it should be possible to convert DWARF to ORC, assuming the
> > > DWARF data is trusted.  We could probably add an objtool subcommand for
> > > that.
> > 
> > That'd be pretty helpful.
> 
> Can I ask why?  Is DWARF too slow, or is it something else?

Both. Dwarf is really slow and uses a lot of space - on a bigger machine
it's often nearly unusable. Secondly dwarf isn't available for BPF based
stuff, IIUC because the kernel has to create a full backtrace there
(rather than saving enough data that userland can do so). Which wasn't
"allowed" to be done in-kernel w/ dwarf, just fp so far.

- Andres


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andres Freund
On 2017-07-12 17:40:45 -0500, Josh Poimboeuf wrote:
> On Wed, Jul 12, 2017 at 03:36:05PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> > > If you want perf to be able to use ORC instead of DWARF for user space
> > > binaries, that's not currently possible, though I don't see any
> > > technical blockers for doing so.  Perf would need to be taught to read
> > > ORC data.
> > 
> > Right, that's what I was hoping for.
> > 
> > 
> > > And I think it should be possible to convert DWARF to ORC, assuming the
> > > DWARF data is trusted.  We could probably add an objtool subcommand for
> > > that.
> > 
> > That'd be pretty helpful.
> 
> Can I ask why?  Is DWARF too slow, or is it something else?

Both. Dwarf is really slow and uses a lot of space - on a bigger machine
it's often nearly unusable. Secondly dwarf isn't available for BPF based
stuff, IIUC because the kernel has to create a full backtrace there
(rather than saving enough data that userland can do so). Which wasn't
"allowed" to be done in-kernel w/ dwarf, just fp so far.

- Andres


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> Josh Poimboeuf  writes:
> >
> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> >
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)
> 
> As far as I can tell this whole thing has only downsides compared
> to the dwarf unwinder that was earlier proposed. I don't see
> a single advantage.

Improved speed, reliability, maintainability.  Are those not advantages?

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> Josh Poimboeuf  writes:
> >
> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> >
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)
> 
> As far as I can tell this whole thing has only downsides compared
> to the dwarf unwinder that was earlier proposed. I don't see
> a single advantage.

Improved speed, reliability, maintainability.  Are those not advantages?

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 03:36:05PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> > If you want perf to be able to use ORC instead of DWARF for user space
> > binaries, that's not currently possible, though I don't see any
> > technical blockers for doing so.  Perf would need to be taught to read
> > ORC data.
> 
> Right, that's what I was hoping for.
> 
> 
> > And I think it should be possible to convert DWARF to ORC, assuming the
> > DWARF data is trusted.  We could probably add an objtool subcommand for
> > that.
> 
> That'd be pretty helpful.

Can I ask why?  Is DWARF too slow, or is it something else?

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 03:36:05PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> > If you want perf to be able to use ORC instead of DWARF for user space
> > binaries, that's not currently possible, though I don't see any
> > technical blockers for doing so.  Perf would need to be taught to read
> > ORC data.
> 
> Right, that's what I was hoping for.
> 
> 
> > And I think it should be possible to convert DWARF to ORC, assuming the
> > DWARF data is trusted.  We could probably add an objtool subcommand for
> > that.
> 
> That'd be pretty helpful.

Can I ask why?  Is DWARF too slow, or is it something else?

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andres Freund
Hi,

On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> If you want perf to be able to use ORC instead of DWARF for user space
> binaries, that's not currently possible, though I don't see any
> technical blockers for doing so.  Perf would need to be taught to read
> ORC data.

Right, that's what I was hoping for.


> And I think it should be possible to convert DWARF to ORC, assuming the
> DWARF data is trusted.  We could probably add an objtool subcommand for
> that.

That'd be pretty helpful.

Greetings,

Andres Freund


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andres Freund
Hi,

On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> If you want perf to be able to use ORC instead of DWARF for user space
> binaries, that's not currently possible, though I don't see any
> technical blockers for doing so.  Perf would need to be taught to read
> ORC data.

Right, that's what I was hoping for.


> And I think it should be possible to convert DWARF to ORC, assuming the
> DWARF data is trusted.  We could probably add an objtool subcommand for
> that.

That'd be pretty helpful.

Greetings,

Andres Freund


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 02:49:20PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-07-11 10:33:37 -0500, Josh Poimboeuf wrote:
> > The simpler debuginfo format also enables the unwinder to be much faster
> > than DWARF, which is important for perf and lockdep.
> 
> Is this going to be usable for userland call-graphs as well? If one
> converts dwarf to that, I mean? Because right now with perf dwarf is
> often the only thing that works properly through libc, as libc isn't
> compiled with fps and has hardcoded asm not preserving fp. lbr isn't
> available for many events, and often not at all available in VMs etc.

Just to clarify, these patches are completely separate from DWARF and
shouldn't break any existing DWARF-based functionality for user space
tooling.  So perf can still use DWARF for user space binaries just fine.

(Also, tools which rely on CONFIG_DEBUG_INFO for kernel debugging, like
gdb and crash, will continue to work.)

If you want perf to be able to use ORC instead of DWARF for user space
binaries, that's not currently possible, though I don't see any
technical blockers for doing so.  Perf would need to be taught to read
ORC data.

And I think it should be possible to convert DWARF to ORC, assuming the
DWARF data is trusted.  We could probably add an objtool subcommand for
that.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 02:49:20PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-07-11 10:33:37 -0500, Josh Poimboeuf wrote:
> > The simpler debuginfo format also enables the unwinder to be much faster
> > than DWARF, which is important for perf and lockdep.
> 
> Is this going to be usable for userland call-graphs as well? If one
> converts dwarf to that, I mean? Because right now with perf dwarf is
> often the only thing that works properly through libc, as libc isn't
> compiled with fps and has hardcoded asm not preserving fp. lbr isn't
> available for many events, and often not at all available in VMs etc.

Just to clarify, these patches are completely separate from DWARF and
shouldn't break any existing DWARF-based functionality for user space
tooling.  So perf can still use DWARF for user space binaries just fine.

(Also, tools which rely on CONFIG_DEBUG_INFO for kernel debugging, like
gdb and crash, will continue to work.)

If you want perf to be able to use ORC instead of DWARF for user space
binaries, that's not currently possible, though I don't see any
technical blockers for doing so.  Perf would need to be taught to read
ORC data.

And I think it should be possible to convert DWARF to ORC, assuming the
DWARF data is trusted.  We could probably add an objtool subcommand for
that.

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
Josh Poimboeuf  writes:
>
> The ORC data format does have a few downsides compared to DWARF.  The
> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
>
Can we have an option to just use dwarf instead? For people
who don't want to waste a MB+ to solve a problem that doesn't
exist (as proven by many years of opensuse kernel experience)

As far as I can tell this whole thing has only downsides compared
to the dwarf unwinder that was earlier proposed. I don't see
a single advantage.

-Andi


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andi Kleen
Josh Poimboeuf  writes:
>
> The ORC data format does have a few downsides compared to DWARF.  The
> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
>
Can we have an option to just use dwarf instead? For people
who don't want to waste a MB+ to solve a problem that doesn't
exist (as proven by many years of opensuse kernel experience)

As far as I can tell this whole thing has only downsides compared
to the dwarf unwinder that was earlier proposed. I don't see
a single advantage.

-Andi


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andres Freund
Hi,

On 2017-07-11 10:33:37 -0500, Josh Poimboeuf wrote:
> The simpler debuginfo format also enables the unwinder to be much faster
> than DWARF, which is important for perf and lockdep.

Is this going to be usable for userland call-graphs as well? If one
converts dwarf to that, I mean? Because right now with perf dwarf is
often the only thing that works properly through libc, as libc isn't
compiled with fps and has hardcoded asm not preserving fp. lbr isn't
available for many events, and often not at all available in VMs etc.

Regards,

Andres


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Andres Freund
Hi,

On 2017-07-11 10:33:37 -0500, Josh Poimboeuf wrote:
> The simpler debuginfo format also enables the unwinder to be much faster
> than DWARF, which is important for perf and lockdep.

Is this going to be usable for userland call-graphs as well? If one
converts dwarf to that, I mean? Because right now with perf dwarf is
often the only thing that works properly through libc, as libc isn't
compiled with fps and has hardcoded asm not preserving fp. lbr isn't
available for many events, and often not at all available in VMs etc.

Regards,

Andres


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Wed, Jul 12, 2017 at 10:27:10AM +0200, Ingo Molnar wrote:
> > > Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> > > into the x86 unwinder framework.  Objtool is used to generate the ORC
> > > debuginfo.  The ORC debuginfo format is basically a simplified version
> > > of DWARF CFI.  More details below.
> > 
> > BTW., we should perhaps consolidate our unwinder related Kconfig space, 
> > hierarchically:
> > 
> > CONFIG_UNWINDER
> > CONFIG_UNWINDER_ORC
> > CONFIG_UNWINDER_FRAME_POINTERS
> > 
> > Note that as a side effect it would be a valid small systems build option 
> > to have 
> > no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n 
> > would 
> > be a sibling to !CONFIG_BUG.
> 
> So is the idea that CONFIG_UNWINDER=n means "use the 'guess' unwinder"?
> Or should it mean that the unwind API isn't available?
> 
> Without frame pointers and orc, it defaults to the 'guess' unwinder, for
> which the only overhead is a tiny amount of code.  It's still
> technically considered an unwinder because it plugs into the unwind
> interfaces (unwind_start(), unwind_next_frame(), etc) and is used for
> things like /proc//stack.
> 
> So I'm not really sure CONFIG_UNWINDER=n would make sense.  Maybe there
> should just be a multiple-choice where you have to choose one of
> CONFIG_UNWINDER_{ORC,FRAME_POINTER,GUESS}.

Ok, you are right.

Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig 
interface a 
bit nicer:

  CONFIG_UNWINDER_FRAME_POINTER
  CONFIG_UNWINDER_ORC
  CONFIG_UNWINDER_GUESS

... or so?

Default would be the historic FRAME_POINTER, at least initially, I think.

I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
non-trivial speedup it offers - but maybe folks would object?

> > CONFIG_FRAME_POINTERS et al would be left for architectures where it has a 
> > meaning 
> > beyond backtrace generation. (Not sure whether there's any such 
> > architectures.)
> 
> Well, on x86, hardened usercopy relies on frame pointers, but not the
> unwinder.  It does the frame pointer walk manually to avoid the full
> unwinder overhead.  See arch_within_stack_frames().

Oh well...

> Ok, how about:
> 
>   "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
>   kernel) than DWARF eh_frame tables."
> 
> (My previous 1MB number was from my distro-based config, and it also
> forgot to take into account the fast lookup table (".orc_lookup")).

Sounds good to me!

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Wed, Jul 12, 2017 at 10:27:10AM +0200, Ingo Molnar wrote:
> > > Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> > > into the x86 unwinder framework.  Objtool is used to generate the ORC
> > > debuginfo.  The ORC debuginfo format is basically a simplified version
> > > of DWARF CFI.  More details below.
> > 
> > BTW., we should perhaps consolidate our unwinder related Kconfig space, 
> > hierarchically:
> > 
> > CONFIG_UNWINDER
> > CONFIG_UNWINDER_ORC
> > CONFIG_UNWINDER_FRAME_POINTERS
> > 
> > Note that as a side effect it would be a valid small systems build option 
> > to have 
> > no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n 
> > would 
> > be a sibling to !CONFIG_BUG.
> 
> So is the idea that CONFIG_UNWINDER=n means "use the 'guess' unwinder"?
> Or should it mean that the unwind API isn't available?
> 
> Without frame pointers and orc, it defaults to the 'guess' unwinder, for
> which the only overhead is a tiny amount of code.  It's still
> technically considered an unwinder because it plugs into the unwind
> interfaces (unwind_start(), unwind_next_frame(), etc) and is used for
> things like /proc//stack.
> 
> So I'm not really sure CONFIG_UNWINDER=n would make sense.  Maybe there
> should just be a multiple-choice where you have to choose one of
> CONFIG_UNWINDER_{ORC,FRAME_POINTER,GUESS}.

Ok, you are right.

Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig 
interface a 
bit nicer:

  CONFIG_UNWINDER_FRAME_POINTER
  CONFIG_UNWINDER_ORC
  CONFIG_UNWINDER_GUESS

... or so?

Default would be the historic FRAME_POINTER, at least initially, I think.

I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
non-trivial speedup it offers - but maybe folks would object?

> > CONFIG_FRAME_POINTERS et al would be left for architectures where it has a 
> > meaning 
> > beyond backtrace generation. (Not sure whether there's any such 
> > architectures.)
> 
> Well, on x86, hardened usercopy relies on frame pointers, but not the
> unwinder.  It does the frame pointer walk manually to avoid the full
> unwinder overhead.  See arch_within_stack_frames().

Oh well...

> Ok, how about:
> 
>   "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
>   kernel) than DWARF eh_frame tables."
> 
> (My previous 1MB number was from my distro-based config, and it also
> forgot to take into account the fast lookup table (".orc_lookup")).

Sounds good to me!

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 10:27:10AM +0200, Ingo Molnar wrote:
> > Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> > into the x86 unwinder framework.  Objtool is used to generate the ORC
> > debuginfo.  The ORC debuginfo format is basically a simplified version
> > of DWARF CFI.  More details below.
> 
> BTW., we should perhaps consolidate our unwinder related Kconfig space, 
> hierarchically:
> 
>   CONFIG_UNWINDER
>   CONFIG_UNWINDER_ORC
>   CONFIG_UNWINDER_FRAME_POINTERS
> 
> Note that as a side effect it would be a valid small systems build option to 
> have 
> no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n 
> would 
> be a sibling to !CONFIG_BUG.

So is the idea that CONFIG_UNWINDER=n means "use the 'guess' unwinder"?
Or should it mean that the unwind API isn't available?

Without frame pointers and orc, it defaults to the 'guess' unwinder, for
which the only overhead is a tiny amount of code.  It's still
technically considered an unwinder because it plugs into the unwind
interfaces (unwind_start(), unwind_next_frame(), etc) and is used for
things like /proc//stack.

So I'm not really sure CONFIG_UNWINDER=n would make sense.  Maybe there
should just be a multiple-choice where you have to choose one of
CONFIG_UNWINDER_{ORC,FRAME_POINTER,GUESS}.

> CONFIG_FRAME_POINTERS et al would be left for architectures where it has a 
> meaning 
> beyond backtrace generation. (Not sure whether there's any such 
> architectures.)

Well, on x86, hardened usercopy relies on frame pointers, but not the
unwinder.  It does the frame pointer walk manually to avoid the full
unwinder overhead.  See arch_within_stack_frames().

> > The unwinder works well in my testing.  It unwinds through interrupts,
> > exceptions, and preemption, with and without frame pointers, across
> > aligned stacks and dynamically allocated stacks.  If something goes
> > wrong during an oops, it successfully falls back to printing the '?'
> > entries just like the frame pointer unwinder.
> 
> Ok, I'll start applying your patches after -rc1, unless anyone objects.

Thank you Ingo!

> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> 
> Could we also write this in percentage, not absolute RAM size - i.e. ORC 
> unwind 
> tables take 30% more RAM (+0.7 MB on an x86 defconfig kernel) than DWARF 
> eh_frame 
> tables.

Ok, how about:

  "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
  kernel) than DWARF eh_frame tables."

(My previous 1MB number was from my distro-based config, and it also
forgot to take into account the fast lookup table (".orc_lookup")).

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Josh Poimboeuf
On Wed, Jul 12, 2017 at 10:27:10AM +0200, Ingo Molnar wrote:
> > Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> > into the x86 unwinder framework.  Objtool is used to generate the ORC
> > debuginfo.  The ORC debuginfo format is basically a simplified version
> > of DWARF CFI.  More details below.
> 
> BTW., we should perhaps consolidate our unwinder related Kconfig space, 
> hierarchically:
> 
>   CONFIG_UNWINDER
>   CONFIG_UNWINDER_ORC
>   CONFIG_UNWINDER_FRAME_POINTERS
> 
> Note that as a side effect it would be a valid small systems build option to 
> have 
> no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n 
> would 
> be a sibling to !CONFIG_BUG.

So is the idea that CONFIG_UNWINDER=n means "use the 'guess' unwinder"?
Or should it mean that the unwind API isn't available?

Without frame pointers and orc, it defaults to the 'guess' unwinder, for
which the only overhead is a tiny amount of code.  It's still
technically considered an unwinder because it plugs into the unwind
interfaces (unwind_start(), unwind_next_frame(), etc) and is used for
things like /proc//stack.

So I'm not really sure CONFIG_UNWINDER=n would make sense.  Maybe there
should just be a multiple-choice where you have to choose one of
CONFIG_UNWINDER_{ORC,FRAME_POINTER,GUESS}.

> CONFIG_FRAME_POINTERS et al would be left for architectures where it has a 
> meaning 
> beyond backtrace generation. (Not sure whether there's any such 
> architectures.)

Well, on x86, hardened usercopy relies on frame pointers, but not the
unwinder.  It does the frame pointer walk manually to avoid the full
unwinder overhead.  See arch_within_stack_frames().

> > The unwinder works well in my testing.  It unwinds through interrupts,
> > exceptions, and preemption, with and without frame pointers, across
> > aligned stacks and dynamically allocated stacks.  If something goes
> > wrong during an oops, it successfully falls back to printing the '?'
> > entries just like the frame pointer unwinder.
> 
> Ok, I'll start applying your patches after -rc1, unless anyone objects.

Thank you Ingo!

> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> 
> Could we also write this in percentage, not absolute RAM size - i.e. ORC 
> unwind 
> tables take 30% more RAM (+0.7 MB on an x86 defconfig kernel) than DWARF 
> eh_frame 
> tables.

Ok, how about:

  "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
  kernel) than DWARF eh_frame tables."

(My previous 1MB number was from my distro-based config, and it also
forgot to take into account the fast lookup table (".orc_lookup")).

-- 
Josh


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> The biggest change is that undwarf was renamed to ORC.  Here's the
> relevant explanation from the docs:
> 
>   Etymology
>   -
>   
>   Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
>   enemies.  Similarly, the ORC unwinder was created in opposition to the
>   complexity and slowness of DWARF.
>   
>   "Although Orcs rarely consider multiple solutions to a problem, they do
>   excel at getting things done because they are creatures of action, not
>   thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
>   veracious ORC unwinder wastes no time or siloconic effort decoding
>   variable-length zero-extended unsigned-integer byte-coded
>   state-machine-based debug information entries.
>   
>   Similar to how Orcs frequently unravel the well-intentioned plans of
>   their adversaries, the ORC unwinder frequently unravels stacks with
>   brutal, unyielding efficiency.
>   
>   ORC stands for Oops Rewind Capability.

Perfect naming!

(ORC might also stand for "Optimized Rewind Capability".)

> Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> into the x86 unwinder framework.  Objtool is used to generate the ORC
> debuginfo.  The ORC debuginfo format is basically a simplified version
> of DWARF CFI.  More details below.

BTW., we should perhaps consolidate our unwinder related Kconfig space, 
hierarchically:

CONFIG_UNWINDER
CONFIG_UNWINDER_ORC
CONFIG_UNWINDER_FRAME_POINTERS

Note that as a side effect it would be a valid small systems build option to 
have 
no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n 
would 
be a sibling to !CONFIG_BUG.

CONFIG_FRAME_POINTERS et al would be left for architectures where it has a 
meaning 
beyond backtrace generation. (Not sure whether there's any such architectures.)

> The unwinder works well in my testing.  It unwinds through interrupts,
> exceptions, and preemption, with and without frame pointers, across
> aligned stacks and dynamically allocated stacks.  If something goes
> wrong during an oops, it successfully falls back to printing the '?'
> entries just like the frame pointer unwinder.

Ok, I'll start applying your patches after -rc1, unless anyone objects.

> The ORC data format does have a few downsides compared to DWARF.  The
> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.

Could we also write this in percentage, not absolute RAM size - i.e. ORC unwind 
tables take 30% more RAM (+0.7 MB on an x86 defconfig kernel) than DWARF 
eh_frame 
tables.

Thanks,

Ingo


Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)

2017-07-12 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> The biggest change is that undwarf was renamed to ORC.  Here's the
> relevant explanation from the docs:
> 
>   Etymology
>   -
>   
>   Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
>   enemies.  Similarly, the ORC unwinder was created in opposition to the
>   complexity and slowness of DWARF.
>   
>   "Although Orcs rarely consider multiple solutions to a problem, they do
>   excel at getting things done because they are creatures of action, not
>   thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
>   veracious ORC unwinder wastes no time or siloconic effort decoding
>   variable-length zero-extended unsigned-integer byte-coded
>   state-machine-based debug information entries.
>   
>   Similar to how Orcs frequently unravel the well-intentioned plans of
>   their adversaries, the ORC unwinder frequently unravels stacks with
>   brutal, unyielding efficiency.
>   
>   ORC stands for Oops Rewind Capability.

Perfect naming!

(ORC might also stand for "Optimized Rewind Capability".)

> Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> into the x86 unwinder framework.  Objtool is used to generate the ORC
> debuginfo.  The ORC debuginfo format is basically a simplified version
> of DWARF CFI.  More details below.

BTW., we should perhaps consolidate our unwinder related Kconfig space, 
hierarchically:

CONFIG_UNWINDER
CONFIG_UNWINDER_ORC
CONFIG_UNWINDER_FRAME_POINTERS

Note that as a side effect it would be a valid small systems build option to 
have 
no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n 
would 
be a sibling to !CONFIG_BUG.

CONFIG_FRAME_POINTERS et al would be left for architectures where it has a 
meaning 
beyond backtrace generation. (Not sure whether there's any such architectures.)

> The unwinder works well in my testing.  It unwinds through interrupts,
> exceptions, and preemption, with and without frame pointers, across
> aligned stacks and dynamically allocated stacks.  If something goes
> wrong during an oops, it successfully falls back to printing the '?'
> entries just like the frame pointer unwinder.

Ok, I'll start applying your patches after -rc1, unless anyone objects.

> The ORC data format does have a few downsides compared to DWARF.  The
> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.

Could we also write this in percentage, not absolute RAM size - i.e. ORC unwind 
tables take 30% more RAM (+0.7 MB on an x86 defconfig kernel) than DWARF 
eh_frame 
tables.

Thanks,

Ingo