Re: linux-next: add utrace tree

2010-02-08 Thread Avi Kivity

On 01/27/2010 01:05 PM, Ananth N Mavinakayanahalli wrote:

We don't need to write one. I don't know how easy it is to make the kvm
emulator less kvm-centric (vcpus, kvm_context, etc). Avi?
   


It's a lot of mindless work but not too difficult; replacing hardcoded 
accessors with function pointers.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-02-07 Thread Avi Kivity

On 01/27/2010 12:23 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:
   


(back from vacation)


If so then you ignore the obvious solution to _that_ problem: dont use
INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks.
It's _MUCH_ faster than _any_ breakpoint based solution - literally just
the cost of a function call (or not even that - i've written very fast
inlined tracers - they do rock when it comes to performance). Problem
solved and none of the INT3 details matters at all.
   

However did I not think of that?  Yes, and let's rip off kprobes tracing
from the kernel, we can always rebuild it.

Well, I'm observing an issue in a production system now.  I may not want to
take it down, or if I take it down I may not be able to observe it again as
the problem takes a couple of days to show up, or I may not have the full
source, or it takes 10 minutes to build and so an iterative edit/build/run
cycle can stretch for hours.
 

You have somewhat misconstrued my argument. What i said above is that _if_ you
need extreme levels of performance you always have the option to go even
faster via specialized tracing solutions. I did not promote it as a
replacement solution. Specialization obviously brings in a new set of
problems: infexibility and non-transparency, an example of what you gave
above.

Your proposed solution brings in precisely such kinds of issues, on a
different level, just to improve performance at the cost of transparency and
at the cost of features and robustness.
   


We just disagree on the intrusiveness, then.  IMO it will be a very rare 
application that really suffers from a vma injection, since most apps 
don't manage their vmas directly but leave it to the kernel and ld.so.



It's btw rather ironic as your arguments are somewhat similar to the Xen vs.
KVM argument just turned around: KVM started out slower by relying on hardware
implementation for virtualization while Xen relied on a clever but limiting
hack. With each CPU generation the hardware got faster, while the various
design limitations of Xen are hurting it and KVM is winning that race.

A (partially) similar situation exists here: INT3 into ring 0 and handling it
there in a protected environment might be more expensive, but _if_ it matters
to performance it sure could be made faster in hardware (and in fact it will
become faster with every new generation of hardware).
   


Not at all.  For kvm hardware eliminates exits completely where pv Xen 
tries to reduce their cost, but an INT3 will be forever much more 
expensive than a jump.


You are right however that we should favour hardware support where 
available, and for high bandwidth tracing, it is available: branch trace 
store.  With that, it is easy to know how many times the processor 
passed through some code point as well as to reconstruct the entire call 
chain, basically what the function tracer does for the kernel.


Do we have facilities for exposing that to userspace?  It can also be 
very useful for the kernel.


It will still be slower if we only trace a few points, and it can't 
trace register and memory values, but it's a good tool to have IMO.



Both Peter and me are telling you that we are considering your solution too
specialized, at the cost of flexibility, features and robustness.
   


We'll agree to disagree on that then.

--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-27 Thread Avi Kivity

On 01/27/2010 10:24 AM, Ingo Molnar wrote:




Not to mention that that process could wreck the trace data rendering it
utterly unreliable.
   

It could, but it also might not.  Are we going to deny high performance
tracing to users just because it doesn't work in all cases?
 

Tracing and monitoring is foremost about being able to trust the instrument,
then about performance and usability. That's one of the big things about
ftrace and perf.

By proposing 'user space tracing' you are missing two big aspects:

  - That self-contained, kernel-driven tracing can be replicated in user-space.
It cannot. Sharing and global state is much harder to maintain reliably,
but the bigger problem is that user-space can stomp on its own tracing
state and can make it unreliable. Tracing is often used to figure out bugs,
and tracers will be trusted less if they can stomp on themselves.

  - That somehow it's much faster and that this edge matters. It isnt and it
doesnt matter. The few places that need very very fast tracing wont use any
of these facilities - it will use something specialized.

So you are creating a solution for special cases that dont need it, and you
are also ignoring prime qualities of a good tracing framework.
   


I see it exactly the opposite.  Only a very small minority of cases will 
have such severe memory corruption that tracing will fall apart because 
of random writes to memory; especially on 64-bit where the address space 
is sparse.  On the other hand, knowing that the cost is a few dozen 
cycles rather than a thousand or so means that you can trace production 
servers running full loads without worrying about whether tracing will 
affect whatever it is you're trying to observe.


I'm not against slow reliable tracing, but we shouldn't ignore the need 
for speed.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-20 Thread Avi Kivity

On 01/19/2010 07:47 PM, Jim Keniston wrote:



This is still with a kernel entry, yes?
 

Yes, this involves setting a breakpoint and trapping into the kernel
when it's hit.  The 6-7x figure is with the current 2-trap approach
(breakpoint, single-step).  Boosting could presumably make that more
like 12-14x.
   


A trap is IIRC ~1000 cycles, we can reduce this to ~50 (totally 
negligible from the executed code's point of view).



Do you have plans for a variant
that's completely in userspace?
 

I don't know of any such plans, but I'd be interested to read more of
your thoughts here.  As I understand it, you've suggested replacing the
probed instruction with a jump into an instrumentation vma (the XOL
area, or something similar).  Masami has demonstrated -- through his
djprobes enhancement to kprobes -- that this can be done for many x86
instructions.

What does the code in the jumped-to vma do?


1. Write a trace entry into shared memory, trap into the kernel on overflow.
2. Trap if a condition is satisfied (fast watchpoint implementation).


Is the instrumentation code
that corresponds to the uprobe handlers encoded in an ad hoc .so?
   


Looks like a good idea, but it doesn't matter much to me.

--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-20 Thread Avi Kivity

On 01/20/2010 11:57 AM, Peter Zijlstra wrote:

On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote:
   

1. Write a trace entry into shared memory, trap into the kernel on overflow.
2. Trap if a condition is satisfied (fast watchpoint implementation).
 

So now you want to consume more of a process' address space to store
trace data as well?


Yes.  I know I'm bad.


Not to mention that that process could wreck the
trace data rendering it utterly unreliable.
   


It could, but it also might not.  Are we going to deny high performance 
tracing to users just because it doesn't work in all cases?


Note this applies to any kind of monitoring or debugging technology.  A 
process can be influenced by the debugger and render any debug info you 
get out of it unreliable.  One non-timing example is a process using a 
checksum of its text as an input to some algorithm.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-20 Thread Avi Kivity

On 01/20/2010 12:45 PM, Srikar Dronamraju wrote:

What does the code in the jumped-to vma do?
   

1. Write a trace entry into shared memory, trap into the kernel on overflow.
2. Trap if a condition is satisfied (fast watchpoint implementation).
 

That looks to be a nice idea. We should certainly look into this
possibility. However can we look at this option probably a little later?

Our plan was to do one step at a time i.e have the basic uprobes in
first and target the booster (i.e jump to the next instruction without
the need for single-stepping next).

We could look at this option of using jump instead of int3 after we are
done with the booster.  Hope that's okay.
   


I'm all for incremental development and merging, as long as we keep the 
interfaces flexible enough for the future.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-19 Thread Avi Kivity

On 01/19/2010 12:15 AM, Jim Keniston wrote:



I don't like the idea but if the performance benefits are real (are
they?),
 

Based on what seems to be the closest thing to an apples-to-apples
comparison -- counting the number of calls to a specified function --
uprobes is 6-7 times faster than the ptrace-based equivalent, ltrace -c.
And of course, uprobes provides much, much more flexibility, appears to
scale better, and works with multithreaded apps.

Likewise, FWIW, utrace is more than 10x faster than strace -c in
counting system calls.

   


This is still with a kernel entry, yes?  Do you have plans for a variant 
that's completely in userspace?


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 09:45 AM, Peter Zijlstra wrote:



This is debugging.  We're playing with registers, we're playing with the
cpu, we're playing with memory contents.  Why not the address space as well?
 

Because you want thins go to be as transparent as possible in order to
avoid heisenbugs. Sure we cannot avoid everything, but we should avoid
everything we possibly can.
   


If we reserve some address space, you don't add any heisenbugs (at 
least, not any additional ones over emulation).  Even if we don't, 
address space layout randomization means we're not keeping the address 
space layout constant between runs anyway.



Also, aside of the VDSO, we simply do not force map things into address
spaces (and like said before, I think the VDSO stinks for doing that)
and I think we don't want to create (more) precedents in this case.
   


You've made it clear that you don't like it, but not why.

The kernel already manages the user's address space (except for 
MAP_FIXED which is unreliable unless you've already reserved the address 
space).  I don't see why adding a vma for debugging is so horrible.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 01:44 PM, Peter Zijlstra wrote:

On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote:
   

You've made it clear that you don't like it, but not why.

The kernel already manages the user's address space (except for
MAP_FIXED which is unreliable unless you've already reserved the address
space).  I don't see why adding a vma for debugging is so horrible.
 

Well, the kernel only does what the user (and loader) tell it through
mmap().


What I meant was that the kernel chooses the addresses (unless you go 
the MAP_FIXED way).  From the user's point of view, there is no change 
in behaviour: the kernel picks an address.  If the constraints have 
changed (because we reserve a range), that doesn't affect the user.



Other than that we never (except this VDSO thing) inject vmas,
and I see no reason to start doing that now.
   


Maybe you place no value on uprobes.  But people who debug userspace 
likely will see a reason.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 02:06 PM, Peter Zijlstra wrote:

On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote:
   

Maybe you place no value on uprobes.  But people who debug userspace
likely will see a reason.
 

I do see value in uprobes, I just don't like it mucking about with the
address space. Nor does it appear required.
   


Well, the alternatives are very unappealing.  Emulation and 
single-stepping are going to be very slow compared to a couple of jumps.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 02:13 PM, Pekka Enberg wrote:

So how big chunks of the address space are we talking here for uprobes?
   


That's for the authors to answer, but at a guess, 32 bytes per probe 
(largest x86 instruction is 15 bytes), so 32 MB will give you a million 
probes.  That's a piece of cake for x86-64, probably harder to justify 
for i386.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 02:51 PM, Pekka Enberg wrote:


And how many probes do we expected to be live at the same time in
real-world scenarios? I guess Avi's one million is more than enough?
   


I don't think a user will ever come close to a million, but we can 
expect some inflation from inlined functions (I don't know if uprobes 
replicates such probes, but if it doesn't, it should).


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 02:57 PM, Pekka Enberg wrote:

On 01/18/2010 02:51 PM, Pekka Enberg wrote:

And how many probes do we expected to be live at the same time in
real-world scenarios? I guess Avi's one million is more than enough?


Avi Kivity kirjoitti:
I don't think a user will ever come close to a million, but we can 
expect some inflation from inlined functions (I don't know if uprobes 
replicates such probes, but if it doesn't, it should).


Right. I guess we're looking at few megabytes of the address space for 
normal scenarios which doesn't seem too excessive.


However, as Peter pointed out, the bigger problem is that now we're 
opening the door for other features to steal chunks of the address 
space. And I think it's a legitimate worry that it's going to cause 
problems for 32-bit in the future.


I don't like the idea but if the performance benefits are real (are 
they?), maybe it's a worthwhile trade-off. Dunno.


If uprobes can trace to buffer memory in the process address space, I 
think the win can be dramatic.  Incidentally it will require injecting 
even more vmas into a process.


Basically it means very low cost tracing, like the kernel tracers.

--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 03:15 PM, Peter Zijlstra wrote:

On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote:
   

On 01/18/2010 02:14 PM, Peter Zijlstra wrote:
 
   

Well, the alternatives are very unappealing.  Emulation and
single-stepping are going to be very slow compared to a couple of jumps.

 

With CPL2 or RPL on user segments the protection issue seems to be
manageable for running the instructions from kernel space.

   

CPL2 gives unrestricted access to the kernel address space; and RPL does
not affect page level protection.  Segment limits don't work on x86-64.
But perhaps I missed something - these things are tricky.
 

So setting RPL to 3 on the user segments allows access to kernel pages
just fine? How useful.. :/
   


The further we stay away from segmentation, the better.  Thankfully AMD 
removed hardware task switching from x86-64 so we can't even think about 
that.



It should be possible to translate the instruction into an address space
check, followed by the action, but that's still slower due to privilege
level switches.
 

Well, if you manage to do the address validation you don't need the priv
level switch anymore, right?
   


Right.


Are the ins encodings sane enough to recognize mem parameters without
needing to know the actual ins?
   


No.  You need to know whether the instruction accesses memory or not.

Look at the tables at the beginning of arch/x86/kvm/emulate.c.  Opcodes 
marked with ModRM, BitOp, MemAbs, String, Stack are all different styles 
of memory instructions.  You need to know the operand size for the edge 
cases.  And there are probably a few special cases in the code.



How about using a hw-breakpoint to close the gap for the inline single
step? You could even re-insert the int3 lazily when you need the
hw-breakpoint again. It would consume one hw-breakpoint register for
each task/cpu that has probes though..
   


If you have more than four threads, it breaks, no?  And you need an IPI 
each time you hit the breakpoint.


Ultimately I'd like to see the breakpoint avoided as well, use a jump to 
the XOL area and trace in ~20 cycles instead of ~1000.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-18 Thread Avi Kivity

On 01/18/2010 05:43 PM, Ananth N Mavinakayanahalli wrote:



Well, the alternatives are very unappealing.  Emulation and single-stepping
are going to be very slow compared to a couple of jumps.
   

So how big chunks of the address space are we talking here for uprobes?
 

As Srikar mentioned, the least we start with is 1 page. Though you can
have as many probes as you want, there are certain optimizations we can
do, depending on the most common usecases.

For eg., if you'd consider the start of a routine to be the most
commonly traced location, most routines in a binary would generally
start with the same instruction (say push %ebp), and we can refcount a
slot with that instruction to be used for all probes of the same
instruction.
   


But then you can't follow the instruction with a jump back to the code...

--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-17 Thread Avi Kivity

On 01/16/2010 02:58 AM, Jim Keniston wrote:


I hear (er, read) you.  Emulation may turn out to be the answer for some
architectures.  But here are some things to keep in mind about the
various approaches:

1. Single-stepping inline is easiest: you need to know very little about
the instruction set you're probing.  But it's inadequate for
multithreaded apps.
2. Single-stepping out of line solves the multithreading issue (as do #3
and #4), but requires more knowledge of the instruction set.  (In
particular, calls, jumps, and returns need special care; as do
rip-relative instructions in x86_64.)  I count 9 architectures that
support kprobes.  I think most of these do SSOL.
3. Boosted probes (where an appended jump instruction removes the need
for the single-step trap on many instructions) require even more
knowledge of the instruction set, and like SSOL, require XOL slots.
Right now, as far as I know, x86 is the only architecture with boosted
kprobes.
4. Emulation removes the need for the XOL area, but requires pretty much
total knowledge of the instruction set.  It's also a performance win for
architectures that can't do #3.  I see kvm implemented on 4
architectures (ia64, powerpc, s390, x86).  Coincidentally, those are the
architectures to which uprobes (old uprobes, with ubp and xol bundled
in) has already been ported (though Intel hasn't been maintaining their
ia64 port).  So it sort of comes down to how objectionable the XOL vma
(or page) really is.
   


The kvm emulator emulates only a subset of the x86 instruction set 
(basically mmio instructions and commonly-used page-table manipulation 
instructions, as well as some privileged instructions).  It would take a 
lot of work to expand it to be completely generic; and even then it will 
fail if userspace uses an instruction set extension the kernel is not 
aware of.


To me, boosted probes with a fallback to single-stepping seems to be the 
better option by far.


--
error compiling committee.c: too many arguments to function



Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

2010-01-17 Thread Avi Kivity

On 01/17/2010 05:03 PM, Peter Zijlstra wrote:



btw, an alternative is to require the caller to provide the address
space for this.  If the caller is in another process, we need to allow
it to play with the target's address space (i.e. mmap_process()).  I
don't think uprobes justifies this by itself, but mmap_process() can be
very useful for sandboxing with seccomp.
 

mmap_process() sounds utterly gross, one process playing with another
process's address space.. yuck!
   


This is debugging.  We're playing with registers, we're playing with the 
cpu, we're playing with memory contents.  Why not the address space as well?


For seccomp, this really should be generalized.  Run a system call on 
behalf of another process, but don't let that process do anything to 
affect it.  I think Google is doing something clever with one thread in 
seccomp mode and another unconstrained, but that's very hacky - you have 
to stop the constrained thread so it can't interfere with the live one.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



Re: x86: do_debug PTRACE_SINGLESTEP broken by 08d68323d1f0c34452e614263b212ca556dae47f

2009-12-20 Thread Avi Kivity

On 12/19/2009 01:15 AM, Frederic Weisbecker wrote:



Apparently it does.  You should hack some printks into do_debug() and see
how kvm is differing from real hardware.  (Actually you can probably do
this with a notifier added by a module, not that you are shy about
recompiling!)

Probably kvm's emulation of the hardware behavior wrt the DR6 bits is not
sufficiently faithful.  Conceivably, kvm is being consistent with some
older hardware and we have encoded assumptions that only newer hardware
meets.  But I'd guess it's just a plain kvm bug.
 
   


A kvm bug is most likely.


It looks like in kvm, before entering the guest, we restore its
debug registers:

vcpu_enter_guest():
if (unlikely(vcpu-arch.switch_db_regs)) {
set_debugreg(0, 7);
set_debugreg(vcpu-arch.eff_db[0], 0);
set_debugreg(vcpu-arch.eff_db[1], 1);
set_debugreg(vcpu-arch.eff_db[2], 2);
set_debugreg(vcpu-arch.eff_db[3], 3);
}


But what happens to dr6, I don't know.
   


That's done later, in vmx.c:vmx_vcpu_run():

if (vcpu-arch.switch_db_regs)
set_debugreg(vcpu-arch.dr6, 6);

Can you describe the failure?  I'll try to construct a test case 
reproducer and work with Jan to fix it.


--
error compiling committee.c: too many arguments to function