[PATCH] some optimizations for Virtual Machines
Roland (and utrace-devel community), I have just completed, together with Andrea Gasparini, a first implementation of a kernel module based on utrace as a fast support for our virtualization environment (view-os/umview). The name of the module is "kmview" kernel-mode-view-os, and the user level tool will have the same name. We will (GPL) release both module and user level program as soon as possible. utrace is a wondeful and well designed tool. However, IMHO, during the implementation of kmview we have found that there are some improvements that can be done (and that we have already implemented) for a better support of virtual machines. Here are some comments, I hope you'll share our ideas and you'll insert our improvements soon in utrace's mainstream code. 1- Order of callbacks You say: Engines are called in the order they attached. It is meaningful for kernel generated events but unfortunately it does not provide a significant semantics for engine nesting when applied to report_syscall_entry. When dealing with several tracing/virtual machine tools the report_syscall_entry callbacks must be evaluated in the reverse way. As an example I tried to use strace on a view-os like virtual machine (some syscalls get virtualized). strace works but being the last engine it shows the modified calls, but the return values of the original calls. Wrong order: syscall enter: call -> VM (modification) -> strace -> kernel syscall exit: call -> VM (restore) -> strace -> kernel Right order: syscall enter: call -> strace -> VM (modification) -> kernel syscall exit: call -> VM (restore) -> strace -> kernel Reversing the attached engine list traversal for syscall_entry solves the problem. 2- Access to traced process vm. Your interface provides the call utrace_access_process_vm: it allows tracer processes to use /dev/*/mem. Unfortunately write access is denied (as stated in fs/proc/base.c): > #define mem_write NULL > #ifndef mem_write > /* This is a security hazard */ The /dev/*/mem way to access process vm's would be useless anyway. When I write a virtual machine support for hundreds of processes I cannot keep hundreds of open files. On the other hand I cannot open and close file for each memory access: we need fast access! I propose a new call: int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string); which give I-O access to the memory of the process. It has about the same interface of access_process_vm (mm/memory.c) with the extra "string" option (significative only when write==0). Sometimes a read buffer can be significantly larger than the actual field used for a string. If string==1 the transfer terminates at '\0' avoiding the memory error that could arise for unallocated memory after the string (and a slight increase in performance). Prior to give access to the process vm, utrace_access_process_vm check the rights to do so using utrace_allow_access_process_vm (it has the same degree of protection of your access to /dev/*/mem). 3- In the patch I have also implemented the support for PTRACE_MULTI and PTRACE_SYSVM. These two extra features provide: -- PTRACE_MULTI: multiple PTRACE operation using one call, including data transfer of chunks of memory and registers. (it would speed up many commands, have a look of "strace strace ls", to see how many bursts of prace could collapse!). I designed this call for virtual machine support. -- PTRACE_SYSVM: can be used instead of PTRACE_SYSCALL or SYSEMU. At the end of the pre-syscall protocol it is possible to choose among three different behavior: i- call againg after the syscall (maybe some parameters gets modified by the virtualization. (like PTRACE_SYSCALL) ii- skip the upcall after the syscall but do perform the syscall (for a non virtualized call) iii- skip both the system call and the second upcall event (for a completely virtualized call). PTRACE_SYSVM almost half the number of context switches for Virtual Machines. (SYSEMU works just for total Virtual Machines, while SYSVM works also for partial Virtual Machines) There is a extensive description of SYSVM in some messages I sent some time ago on KDML. We already implemented these features on vanilla kernel, this verstion based on utrace is architecture independent. --- THe complete patch is here: http://www.cs.unibo.it/~renzo/utrace/ Unfortunately it is against 2.4.22. I have a very slow connection to the Internet here, I'll try to update the patch to the latest kernel as soon as I return home. ciao renzo -- Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501
Re: [PATCH] some optimizations for Virtual Machines
A bugfix for my ASCII ART ;-) Wrong order (utrace behavior now): syscall enter: process -VM (modification) -> strace -> kernel syscall exit: kernel -VM (restore) -> strace -> process Right order (proposed): syscall enter: process -> strace -> VM (modification) -> kernel syscall exit: kernel -> VM (restore) -> strace -> process Sorry for this trailing errata message. renzo
[PATCH] update: some optimizations for Virtual Machines
Just a quick note to say that I have updated my patch. More precisely I have refined the virtualized syscall nesting. When there are more engines for a task and report-callbacks can change the status the quiescent state must be managed for each engine. syscall enter: process -> VM0 -> VM1 (modify) -> VM2 (second modification) -> kernel syscall exit: kernel -> VM2 (restore) -> VM1 (restore) -> VM0 -> process Both for syscall enter and exit the modification of VM2 must take place after VM1 has completed its job. If VM1 requires the quiescent state to compute its modification of the state VM2 report_syscall_entry has to wait for VM1 to finish its job. The same for restore in the opposite way. One more idea: The entry.S code provides the feature to skip the system call (by setting the syscall number to -1). This feature must be provided to VM nesting. The new patch provide the correct nesting and if VM1 (in the example) sets the syscall number to -1 it skips also VM2 report syscall report. In this case during syscall exit skips VM2, restore the syscall number and calls the report_syscall_exit from VM1 and then VM0. Maybe it is useless to call VM1 too (starting from VM0 given that the call has been skipped) but for now I do in this way for the sake of simmetry. The new patch implements this policy. Maybe the same idea (wait for quiescent state at each engine) must be applied to all the other report* that can change the status (e.g. report_signal?). renzo
Re: [PATCH] some optimizations for Virtual Machines
We have updated the patch to the latest kernel. It is here: http://www.cs.unibo.it/~renzo/utrace. > 1- Order of callbacks This patch is crucial. Without this change no virtual machines can be nested if based on utrace. > > 2- Access to traced process vm. > This patch is very important: with this change VM hypervisors can access their process memory efficiently. > > I propose a new call: > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, > char __user *ubuf, int len, int write, int string); > which give I-O access to the memory of the process. I have seen that access_process_vm has been exported to modules: it is a change included in mainstream 2.6.23 from the first rc. so I have exported access_process_vm_user too. access_process_vm_user has two main differences with access_process_vm: - it copies a memory area directly from a process vm to the user space of current and viceversa (it uses an internal one page buffer, it does not need extra buffers or extra code loops). - it supports the "string" flag for reading: no useless copies of data after the end of string, no memory errors due to short string read into large buffers. utrace_access_process_vm can be kept or not: modules can call it instead of directly accessing access_process_vm_user when they need to check that the requesting process has the right to access the other process vm. > > 3- In the patch I have also implemented the support for PTRACE_MULTI > and PTRACE_SYSVM. This patch is just useful. We'll use it to compare the performance between umview and kmview. Several ptrace based application could benefit from these features (e.g. when they need to load chunks of memory or chunks of registers, burst of ptrace calls could be sent as a single call reducing the number of mode switches). That's all for now. renzo
Re: [PATCH] some optimizations for Virtual Machines
> > > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, > > > char __user *ubuf, int len, int write, int string); > > "string" smells like a hack, someone will come up with his favourite > structure and ask for flag for copying, say, single-linked lists. :( There are no system calls asking for linked lists as parameters. Instead there are many having string! All those having a pathname. Look at this: char *s=strdup(x); fd=open (s,O_RDONLY); THis is a quiet, safe chunk of user code. What do you do to grab the value of s (from a virtual machine monitor)? With ptrace you can do a loop of PEEK_DATA one for each word of memory, when there is a NULL byte you leave the loop. IF you are designing a Virtual Machine this is just a performance suicide. You could open /proc//mem, in this case either you keep one descriptor opened for each controlled process or you open /read/close the proc file per each access. Either a scalability or a performance nightmare. if you need to go fast something like: access_process_vm(,address_of_s, PATH_MAX ...) can fail because the s could be in the unluky position at the end of an allocated partition. > > - it copies a memory area directly from a process vm to the user space > > of current and viceversa (it uses an internal one page buffer, > > check for allocation failure You mean: actual> mm = get_task_mm(tsk); actual> if (!mm) actual>return 0; actual> actual> buf=kmalloc(PAGE_SIZE, GFP_KERNEL); must be changed as: updated> mm = get_task_mm(tsk); updated> buf=kmalloc(PAGE_SIZE, GFP_KERNEL); updated> if (!mm || !buf) updated>return 0; Okay, you're right, I'll update the patch. > > > > 3- In the patch I have also implemented the support for PTRACE_MULTI > > > and PTRACE_SYSVM. > > PTRACE_MULTI is horrible, it is asking for pain with compat version. I cannot understand. If you do not use PTRACE_MULTI, ptrace works as usual. Old ptrace do not use PTRACE_MULTI so you do not need to support PTRACE_MULTI for backward compatibility with old versions. > > Mode switches are fast. This is a reason why read(2) doesn't have > batched version, and read is called waaay more often than ptrace. readv do exist. Mode switches are fast, but having less mode switches is even faster. > > I also wonder if this was tested with list debugging on: iterating over > RCU protected list backwards when prev pointers are poisoned shouldn't > work. I do not know if the reverse scan of the list can be done better or maybe my implementation is buggy. I say that we do need that reverse traversal for SYSCALL_ENTRY otherwise it is not possible to implement nested services based on utrace. Regarding the 3 points of my original message: 1- order of calls: the patch (or a different patch implementing the same idea) is needed, otherwise the support for nested engines become meaningless when dealing with system calls virtualization. 2- access to process vm: some solution is needed for a fast access to a utraced process vm. 3- I need this patch for my project, I feel that it could speed up some other programs, but this is not so crucial as 1 and 2. renzo
Is PTRACE_SINGLEBLOCK buggy?
Hi Roland, hi everybody, I have finished teaching my spring term so I am back working on utrace. I am porting my stuff about virtualquare kmview on the new version of kernels. I ran into something that seems to be a bug on PTRACE_SINGLEBLOCK. The source code here enclosed says "OKAY" on a standard 2.6.25.4, while it generates a kernel panic on a 2.6.25.4 + http://people.redhat.com/roland/utrace/2.6-current/linux-2.6-utrace.patch. Is this a bug? (I think so, no combination of syscall parms should ever generate kernel panics ;) Is this a known bug? (e.g. because PTRACE_SINGLEBLOCK is already a WIP with utrace and you are already working on it...) ciao renzo --- #include #include #include #include #include static int child(void *arg) { if(ptrace(PTRACE_TRACEME, 0, 0, 0) < 0){ perror("ptrace traceme"); } kill(getpid(), SIGSTOP); return 0; } int main() { int pid, status, rv; static char stack[1024]; if((pid = clone(child, &stack[1020], SIGCHLD, NULL)) < 0){ perror("clone"); return 0; } if((pid = waitpid(pid, &status, WUNTRACED)) < 0){ perror("Waiting for stop"); return 0; } ptrace(33, pid, 0, 0); /* PTRACE_SINGLEBLOCK */ printf("OKAY\n"); return 0; }
Re: Is PTRACE_SINGLEBLOCK buggy?
Jan Kratochvil has just sent me an E-mail saying that it seems to be a kvm bug (or a bug caused by kvm). He is right: using qemu/kqemu instead of kvm it does not panic. Anyway I am puzzled. Using kvm the PTRACE_SINGLEBLOCK should have the same effect on 2.6.25.4 and 2.6.25.4+utrace. 2.6.25.4: ptrace_resume(kernel/ptrace.c)->user_enable_block_step 2.6.25.4+utrace: ptrace_common(kernel/ptrace.c) sets UTRACE_ACTION_BLOCKSTEP ->utrace_quiescent(kernel/utrace.c) tests UTRACE_ACTION_BLOCKSTEP ->user_enable_block_step I wonder where is the difference... Anyway, let us wait for kvm people to fix it... I want to thank Jan for his quick feedback. renzo
3- utrace module nesting (again)
(again because we already discussed this point) As a matter of Fact utrace is a very useful and powerful tool to support Virtualization (it is not just for debugging!). When dealing with nested virtualization, i.e. nested utrace modules registered to track one process, there is a problem. Almost all the events managed by utrace refer to the notification of changes by the kernel, thus it is clearly consistent that all the modules get informed in the order they registered. Each module can change the perception of the event and the next utrace module receive a modified event (or none when UTRACE_ACTION_HIDE). This is the case for: _UTRACE_EVENT_QUIESCE, /* Tracing requests stop. */ _UTRACE_EVENT_REAP, /* Zombie reaped, no more tracing possible. */ _UTRACE_EVENT_CLONE, /* Successful clone/fork/vfork just done. */ _UTRACE_EVENT_VFORK_DONE, /* vfork woke from waiting for child. */ _UTRACE_EVENT_EXEC, /* Successful execve just completed. */ _UTRACE_EVENT_EXIT, /* Thread exit in progress. */ _UTRACE_EVENT_DEATH, /* Thread has died. */ _UTRACE_EVENT_SYSCALL_EXIT, /* Returning to user after system call. */ _UTRACE_EVENT_SIGNAL(*), /* Signal delivery will run a user handler. */ _UTRACE_EVENT_JCTL, /* Job control stop or continue completed. */ This is the sole exception I see: _UTRACE_EVENT_SYSCALL_ENTRY, /* User entered kernel for system call. */ When utrace manages a syscall request (it is an event generated by the process) the notification must be sent following the reverse order, i.e. starting from the last registered module towards the first one. Each module's report_syscall_entry can change parameters, and the call itself, or even shortcut the call (in this latter case no further module should manage the event). If the system call (maybe a different system call) survives the chain it is submitted to the kernel and exit event (kernel generated) is gets managed using the standard sequence. Using the same sequence for _UTRACE_EVENT_SYSCALL_ENTRY and _UTRACE_EVENT_SYSCALL_EXIT is inconsistent, it simply forbids virtualization nesting. IMHO, the sequence for _UTRACE_EVENT_SYSCALL_ENTRY must be the one here proposed and no one else. Virtualization can be used for protection, (the sandbox effect), providing a way to change the processing order of calls can be used to create threats. I have updated my previous patch (x86_32 only), you can see it from the svn of viewos. http://view-os.svn.sourceforge.net/viewvc/view-os/trunk/kmview-kernel-module/kernel_patches/ renzo
Some ideas/proposals on utrace
Dear Roland and dear utrace developers, I am using utrace in View-OS. kmview is a partial virtual machine engine based on utrace. kmview uses a kernel module and it is a flexible, performant and transparent replacement for umview. (see wiki.virtualsquare.org). I am sending some messages about issues I found with utrace. I am ready to submit code to implement the fix I propose, but I would like to discuss with you and agree on goals and methods. Three messages follows, with subjects: 1- TIF_SYSCALL_EMU is useless. 2- "skip syscall" management. 3- utrace module nesting (again) renzo -- ======== Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487
2- "skip syscall" management.
arch/x86/kernel/entry_32.S provides two ways to skip the call: > syscall_trace_entry: > movl $-ENOSYS,PT_EAX(%esp) > movl %esp, %eax > xorl %edx,%edx > call do_syscall_trace > cmpl $0, %eax *** this: > jne resume_userspace# ret != 0 -> running under PTRACE_SYSEMU, ># so must skip actual syscall > movl PT_ORIG_EAX(%esp), %eax > cmpl $(nr_syscalls), %eax *** or this: > jnae syscall_call > jmp syscall_exit Old ptrace used a non-zero return value by do_syscall_trace to skip the call (skipping also the second do_syscall_trace on exit). If orig_eax (syscall no) is -1 the jnae fails as it is seen as the largest unsigned number. Now PTRACE_SYSEMU is implemented using this latter method in kernel/ptrace.c. IMHO the former is better. In all architectures the code uses the following layers: 1-assembly code layer (entry_*.S for x86) 2-arch/*/kernel/ptrace.c 3-kernel/utrace.c 4-utrace module or 4-kernel/ptrace.c when backward ptrace compatibility is required Syscall skipping is a useful feature that many utrace modules may require. Thus my proposal is to use a return value through all the interfaces to skip the call. More precisely: - interface 1-2, is already in place for x86_32. when do_syscall_trace returns nonzero the syscall get skipped. A similar management should be coded for the other architectures. I have already written the fix for ppc, ppc64 and (untested) x86_64 (I needed this for my PTRACE_SYSVM patch). - interface 2-3, the tracehook_report_syscall_entry should return an integer, the call get skipped when non-zero. - interface 3-4, i propose to add an action flag to skip the call. report_syscall_entry can have one extra ACTION_FLAG say: #define UTRACE_SYSCALL_SKIP 0x0100 It is possible to ask the lower level to abort the syscall, the arch-dependent part of the kernel decides how to implement it #define UTRACE_SYSCALL_ENOSYS 0x0200 My proposal has some pros: - SYSEMU management becomes architecture-independent Statements like these can be eliminated. unsigned long *scno = ®s->orig_ax; /* XXX */ unsigned long *retval = ®s->ax;/* XXX */ - The boundary between arch-independent and arch-dependent sections of the kernel is more consistent. - It can be ported to different architrectures. kernel/ptrace.c is independent from strange syscall and return value encodigs. (BTW: I continue to say that my PTRACE_SYSVM is more flexible than PTRACE_SYSEMU and at least as performant. In with PTRACE_SYSEMU the next System Call is always virtualized (skipped), with PTRACE_SYSVM it is possible to process the system call parameters and decide on the fly if the call has to be virtualized or not. PTRACE_SYSEMU supports only global virtualization (like User-Mode Linux), while PTRACE_SYSVM supports *also* partial virtualization (like my umview/kmview).) renzo
1- TIF_SYSCALL_EMU is useless.
This flag was used by the old ptrace. PTRACE_SYSEMU is now managed by kernel/ptrace.c. In fact, TIF_SYSCALL_EMU is cleared in ptrace_disable(arch/x86/kernel/ptrace.c) and tested but never set. renzo
Re: Tracing Syscalls under Fedora 9
On Fri, Jun 06, 2008 at 04:38:34PM +0200, Martin Süßkraut wrote: > has the tracing of system calls changed in utrace between Fedora 8 and 9? > > My module works fine under Fedora 8, but under Fedora 9 the callbacks > report_syscall_entry and report_syscall_exit seam not to be invoked > any more. I had the same problem. For some reason the only way to trace the syscall is to trace also UTRACE_EVENT(SIGNAL_TERM) or CORE. I added an empty report_signal function and now it works. This behavior was caused by this statement in arch/x86/kernel/ptrace.c: > if (!tracehook_consider_fatal_signal(current, SIGTRAP, SIG_DFL)) >goto out; and in include/linux/tracehook.h: > static inline int tracehook_consider_fatal_signal(struct task_struct *task, > int sig, > void __user *handler) > { > return (tsk_utrace_flags(task) & (UTRACE_EVENT(SIGNAL_TERM) | > UTRACE_EVENT(SIGNAL_CORE))); > } so if neither SIGNAL_TERM nor SIGNAL_CORE got catched, syscalls cannot be traced. Roland, is this a feature or a bug? renzo
Re: Tracing Syscalls under Fedora 9
> On Fri, Jun 06, 2008 at 04:38:34PM +0200, Martin Süßkraut wrote: > > has the tracing of system calls changed in utrace between Fedora 8 and 9? > For some reason the only way to trace the syscall is to trace also > UTRACE_EVENT(SIGNAL_TERM) > or CORE. > > I added an empty report_signal function and now it works. Martin told me by an E-mail message that the change proposed above solved his problem. This is for the people on the ML concerned with the same trouble. renzo
Utrace and process (partial) virtualization
Dear Roland and dear utrace developers, I am already having some problems regarding utrace, and more specifically the utrace interface for (partial) virtual machines and (again) the support for utrace engines nesting. I am writing my point of view here for a general discussion. This is the summary: 1- Virtual Machines may need to change the system call 2- UTRACE_SYSCALL_ABORT: is it really useful as a return value for report_syscall_entry? 3- Nesting, is it really useful to run all the reports in a row and (eventually) stop and the end waiting for all the engines? 4- report_syscall_entry engines evaluation order should be reversed 1- This is the simplest suggestion/request. sometimes virtual machine engines need to change the system call (e.g. the process calls a "creat", the kernel must run "open" instead). I suggest to add some useful inline functions in arch/*/include/asm/syscall.h: syscall_set_nr // to set the system call number syscall_get_pc // to get/set the program counter syscall_set_pc syscall_get_sp // to get/set the stack pointer syscall_set_sp These inline calls would help to create architecture independent virtual machine engines. Now the "hard" part: 2- Which is the scenario of virtual machines based on utrace? In my mind there are two or three actors. K- At the lowest layer there is the kernel providing utrace M- There is a module which uses utrace and virtualize something. M can do all the virtualization at kernel level but maybe it uses also: U- A userland Virtual Machine Monitor. So we have K,M and U. When a virtualized process does a syscall, K calls the report_syscall_entry function of M. If M is entirely at kernel level it can decide whether to abort the syscall (setting UTRACE_SYSCALL_ABORT) or not but there is no (clean) way to forward the request to U and wait for U's decision about the syscall. SYSEMU can be implemented with utrace current interface as it aborts *all* the syscalls. View-OS cannot use it. In fact km-view is a userland VM which need to decide which system calls must be skipped and which executed. It is not for View-OS only, whoever tries to implement similar features will run into the same problem. Maybe even VMMs entirely implemented in the kernel module need to delay the decision about the action. I think UTRACE_STOP has exactly this meaning: in Roland's ptrace implementation UTRACE_STOP is used in this way. User-mode Linux running on ptrace do change the registers of the process status while the process in in STOP state. I am currently trying to implement a new kmview module using UTRACE_STOP. When I need to skip the syscall I change the syscall (orig_ax in x86) number to -1 while the process is stopped. Utrace believes that the syscall is *not* aborted then it passes orig_ax (return ret ?: regs->orig_ax; in arch/x86/kernel/ptrace.c) to the "entry_{32/64}.s" layer, causing the syscall to be skipped. This is a dirty workaround. I think that the specific actions (for syscalls, signals) should be accepted during a utrace_control(..., UTRACE_RESUME). In this way: ** K calls report_syscall_entry ** M sends the request to U and returns UTRACE_STOP. (M can then process requests for many other processes and many userland VMM) ** U receives the request, decides syscall abort or execute ** U sends its reply to M ** M calls utrace_control UTRACE_RESUME setting the action flag needed (e.g. UTRACE_SYSCALL_ABORT). The same scenario can apply to userland management of signals, the VMM or debugger could need to delay the decision among UTRACE_SIGNAL* cases, and it is hard to keep the monitor inside the report_signal upcall waiting to return a value. It would need another implementation of some kind of process stop/quiescence inside the module. 3- Following the KMU schema above, let us now depict a scenario where there are multiple M engines and multiple U VMMs on the same process. If I have correctly understood the code, the current implementation runs all the report upcalls in a row. If some ot the report upcalls return UTRACE_STOP, utrace waits for all the stopped engine to send a UTRACE_RESUME. (from utrace.c: If another engine is keeping @target stopped, then it remains stopped until all engines let it resume.) All the M engines may try to change the status of the process concurrently, as each engine thinks the process has been stopped for its manamengent. Maybe we have two different ideas of the STOP state and of process virtualization. For me a process in STOP state is blocked for inspection. During the STOP state a module M can change the process status. With "virtualized process" I mean a process that "sees" an environment different from that provided by the hosting kernel. A user-mode linux process is a virtualized process. In my mind several engines working on a process implement several layers of virtualization. The first engine provides the process a modified virtual world. If a second engine gets loaded on the same process
utrace@FOSDEM
I am at FOSDEM in Brussels. (I'll give a talk tomorrow 11:00, not directly related to utrace). If there are other utrace developers here araund we can meet in person for some brainstorming renzo
UTRACE_STOP race condition?
Dear Roland and dear utrace developers, please help me. Either I have not understood the meaning of UTRACE_STOP or it is completely useless due to a race condition. There are always two entities in a utrace interaction: the traced process and the tracing module. When a traced event occurs in the traced process the correspondent report function gets called in the module. If the report function returns UTRACE_STOP the traced process stays in a quiescent state and the module wakes it up by a utrace_control(...,UTRACE_RESUME) call *later*. This *later* is the problem. If the module wakes the traced process too quickly, utrace has not yet put it into a "stopped" state, therefore UTRACE_RESUME gets lost. As a consequence, the execution is blocked. IMHO, given the current utrace code, there is no way to set up some kind of synchronization in the module to prevent this error. --- For the sake of simplicity let us assume one engine attached to the traced process (the problem is the same for more engines). The point is: when a report function returns UTRACE_STOP and later calls utrace_control(...,UTRACE_RESUME) the traced process must not stop t=0: Before the report function calling loop utrace->stopped=0; (In start_report: BUG_ON(utrace->stopped);) t=1: REPORT FUNCTION CALL(no lock!): t=2: When the report function returns UTRACE_STOP In finish_callback: t=3: spin_lock(&utrace->lock); mark_engine_wants_stop(engine); spin_unlock(&utrace->lock); t=4: in utrace_stop(..): spin_lock(&utrace->lock); utrace->stopped=1; __set_current_state(TASK_TRACED); spin_unlock(&utrace->lock); schedule(); --> now the traced process is blocked. The module has "decided" UTRACE_STOP at t=1, then the module can call utrace_control(...,UTRACE_RESUME) at any t>1. If the resume call takes place before t=4 the request is lost and the race condition causes the traced process to stop anyway. In fact for 1stopped; ... and therefore it does nothing. /* * Let the thread resume running. If it's not stopped now, * there is nothing more we need to do. */ if (resume) utrace_reset(target, utrace, NULL); else spin_unlock(&utrace->lock); - There are two solutions: 1- (slow & dirty): some sort of synchronization: no ptrace_control (or ptrace_set_events) should take place during all the sequence including from the report function call to the utrace->stopped=1. 2- (the nice one): add another flag named ENGINE_RESUME (like ENGINE_STOP). that flag must be cleared before calling the report function: t=0.5: clear_engine_wants_resume(engine); utrace_control(...,UTRACE_RESUME) should set the flag: spin_lock(&utrace->lock); mark_engine_wants_resume(engine); spin_unlock(&utrace->lock); utrace_stop at t=4 (inside the lock) must check if the traced process has been already resumed. spin_lock(&utrace->lock); spin_lock_irq(&task->sighand->siglock); /* final check: is really needed to stop? */ list_for_each_entry_safe(engine, next, &utrace->attached, entry) { if ((engine->ops != &utrace_detached_ops) && engine_wants_stop(engine)) { if (engine_wants_resume(engine)) clear_engine_wants_stop(engine); else utrace->stopped = 1; } } if (unlikely(!utrace->stopped)) { spin_unlock_irq(&task->sighand->siglock); spin_unlock(&utrace->lock); return false; } In this way the race condition should be eliminated. (it was eliminated in my proof-of-concept utrace patched implementation) If utrace_stop discovers that a resume request is already pending the traced process is not blocked. - Ptrace on utrace works because there is a workaround: the notification to the ptracer is called from within the utrace_stop function *after utrace->stopped has been set*. Ptrace would suffer from the same race condition otherwise. I am looking forward to hearing some comments on this. From what I see, Kmview cannot be implemented on the current utrace implementation. renzo
Re: UTRACE_STOP race condition?
On Wed, Feb 11, 2009 at 09:45:15AM -0500, Frank Ch. Eigler wrote: > This may not answer your question, but I believe it is not proper to > to make this call at any time t>1, only once you receive the quiesce > callback. Maybe I am wrong but the quiesce callback gets called *before* the other report_* (say syscall_entry). So when I capture UTRACE_QUIESCE, I got the report call before t=1. Some communication from utrace to the module should happen *after* utrace->stopped is set to 1 (something similar to the code Roland added for ptrace). Even if it worked this way (i.e. return STOP and wait for report_quiesce, I think the race condition there is in any case) the interface to the module would be horrible. When the module receives a report callback, it returns UTRACE_STOP and then it needs to use some data structure to wait for a report_quiesce to restart the traced process. With the idea of patch included in my previous mail there is no need of such a complexity. Thank you for taking part to this discussion renzo
[PATCH] UTRACE_STOP race condition?
Dear Roland, dear utrace developers, I have now a complete patch that seems to be quite stable. At least Kmview have passed through the tests without getting stuck randomly for the race condition. All the other comments about utrace&virtualization (see my message of Feb 04) are already pending 1- Virtual Machines may need to change the system call 2- UTRACE_SYSCALL_ABORT: is it really useful as a return value for report_syscall_entry? 3- Nesting, is it really useful to run all the reports in a row and (eventually) stop and the end waiting for all the engines? 4- report_syscall_entry engines evaluation order should be reversed ciao renzo --- linux-2.6.29-rc4-utrace/kernel/utrace.c.mcgrath 2009-02-13 18:28:25.0 +0100 +++ linux-2.6.29-rc4-utrace/kernel/utrace.c 2009-02-13 19:14:18.0 +0100 @@ -491,6 +491,13 @@ #define DEAD_FLAGS_MASK(UTRACE_EVENT(REAP)) #define LIVE_FLAGS_MASK(~0UL) +static void mark_engine_wants_stop(struct utrace_attached_engine *engine); +static void clear_engine_wants_stop(struct utrace_attached_engine *engine); +static bool engine_wants_stop(struct utrace_attached_engine *engine); +static void mark_engine_wants_resume(struct utrace_attached_engine *engine); +static void clear_engine_wants_resume(struct utrace_attached_engine *engine); +static bool engine_wants_resume(struct utrace_attached_engine *engine); + /* * Perform %UTRACE_STOP, i.e. block in TASK_TRACED until woken up. * @task == current, @utrace == current->utrace, which is not locked. @@ -500,6 +507,7 @@ static bool utrace_stop(struct task_struct *task, struct utrace *utrace) { bool killed; + struct utrace_attached_engine *engine, *next; /* * @utrace->stopped is the flag that says we are safely @@ -521,6 +529,23 @@ return true; } + /* final check: it is really needed to stop? */ + list_for_each_entry_safe(engine, next, &utrace->attached, entry) { + if ((engine->ops != &utrace_detached_ops) && engine_wants_stop(engine)) { + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } + else + utrace->stopped = 1; + } + } + if (unlikely(!utrace->stopped)) { + spin_unlock_irq(&task->sighand->siglock); + spin_unlock(&utrace->lock); + return false; + } + utrace->stopped = 1; __set_current_state(TASK_TRACED); @@ -784,6 +809,7 @@ * to record whether the engine is keeping the target thread stopped. */ #define ENGINE_STOP(1UL << _UTRACE_NEVENTS) +#define ENGINE_RESUME (1UL << (_UTRACE_NEVENTS+1)) static void mark_engine_wants_stop(struct utrace_attached_engine *engine) { @@ -800,6 +826,21 @@ return (engine->flags & ENGINE_STOP) != 0; } +static void mark_engine_wants_resume(struct utrace_attached_engine *engine) +{ + engine->flags |= ENGINE_RESUME; +} + +static void clear_engine_wants_resume(struct utrace_attached_engine *engine) +{ + engine->flags &= ~ENGINE_RESUME; +} + +static bool engine_wants_resume(struct utrace_attached_engine *engine) +{ + return (engine->flags & ENGINE_RESUME) != 0; +} + /** * utrace_set_events - choose which event reports a tracing engine gets * @target:thread to affect @@ -1050,6 +1091,10 @@ list_move(&engine->entry, &detached); } else { flags |= engine->flags | UTRACE_EVENT(REAP); + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } wake = wake && !engine_wants_stop(engine); } } @@ -1282,6 +1327,7 @@ * There might not be another report before it just * resumes, so make sure single-step is not left set. */ + mark_engine_wants_resume(engine); if (likely(resume)) user_disable_single_step(target); break;
[PATCH] #2 UTRACE_STOP race condition & nesting
Dear Roland, dear utrace developers, This is an updated patch. It solves the race condition + it gives a quick (a bit dirty) solution to issues 3&4. 3- Nesting, is it really useful to run all the reports in a row and (eventually) stop and the end waiting for all the engines? The patch waits for each engine to resume before notifying the next registered engine. 4- report_syscall_entry engines evaluation order should be reversed REPORT macros have an extra "reverse" argument. The macros append this string to the list_for_each_entry_safe function name. All the macro calls skip this argument except the one in report_syscall_entry where it is set to _reverse. With this patch it is possible to run nested kmview machines and ptrace works inside the virtual machines. This patch is "a bit dirty" because variables and sections of code needed to count and test the stopped engines are useless here: a task can be kept stopped for at most one engine at a time. This patch is a proof-of concept to show what I meant in my previous message. For what concerns 1&2 (not included in this patch): 1- Virtual Machines may need to change the system call THis is just to simplify the implementation of arch. independent virtual machine. I have kept the definition of missing functions in the kmview module code. 2- UTRACE_SYSCALL_ABORT: is it really useful as a return value for report_syscall_entry? It is useless for kmview as the decision of aborting the system call is taken while the process is stopped, I am currently setting the syscall number to -1 to skip the syscall. For the sake of completeness there is another way to implement the partial virtual machine stuff by introducing another "quiescence" state inside the report upcalls. I mean: when utrace calls a report function (say for example report_syscall_entry), the function in the module puts the process in a stopped state (maybe its TASK_TRACED and calls the schedule). >From utrace's point of view the report function does not return until all the >changes in the task state have been completed and the decision UTRACE_RESUME/UTRACE_SYSCALL_ABORT has been taken. In this way UTRACE_STOP is never used because the module has to implement another feature similar to UTRACE_STOP on its own. So what is UTRACE_STOP for? ciao renzo --- linux-2.6.29-rc4-utrace/kernel/utrace.c.mcgrath 2009-02-13 18:28:25.0 +0100 +++ linux-2.6.29-rc4-utrace/kernel/utrace.c 2009-02-14 09:17:31.0 +0100 @@ -491,6 +491,13 @@ #define DEAD_FLAGS_MASK(UTRACE_EVENT(REAP)) #define LIVE_FLAGS_MASK(~0UL) +static void mark_engine_wants_stop(struct utrace_attached_engine *engine); +static void clear_engine_wants_stop(struct utrace_attached_engine *engine); +static bool engine_wants_stop(struct utrace_attached_engine *engine); +static void mark_engine_wants_resume(struct utrace_attached_engine *engine); +static void clear_engine_wants_resume(struct utrace_attached_engine *engine); +static bool engine_wants_resume(struct utrace_attached_engine *engine); + /* * Perform %UTRACE_STOP, i.e. block in TASK_TRACED until woken up. * @task == current, @utrace == current->utrace, which is not locked. @@ -500,6 +507,7 @@ static bool utrace_stop(struct task_struct *task, struct utrace *utrace) { bool killed; + struct utrace_attached_engine *engine, *next; /* * @utrace->stopped is the flag that says we are safely @@ -521,6 +529,23 @@ return true; } + /* final check: is really needed to stop? */ + list_for_each_entry_safe(engine, next, &utrace->attached, entry) { + if ((engine->ops != &utrace_detached_ops) && engine_wants_stop(engine)) { + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } + else + utrace->stopped = 1; + } + } + if (unlikely(!utrace->stopped)) { + spin_unlock_irq(&task->sighand->siglock); + spin_unlock(&utrace->lock); + return false; + } + utrace->stopped = 1; __set_current_state(TASK_TRACED); @@ -784,6 +809,7 @@ * to record whether the engine is keeping the target thread stopped. */ #define ENGINE_STOP(1UL << _UTRACE_NEVENTS) +#define ENGINE_RESUME (1UL << (_UTRACE_NEVENTS+1)) static void mark_engine_wants_stop(struct utrace_attached_engine *engine) { @@ -800,6 +826,21 @@ return (engine->flags & ENGINE_STOP) != 0; } +static void mark_engine_wants_resume(struct utrace_attached_engine *engine) +{ + engine->flags |= ENGINE_RESUME; +} + +static void clear_engine_wants_resume(struct utrace_attached_engine *engine) +{ + engine->flags &= ~ENGINE_RESUME;
Re: [PATCH] UTRACE_STOP race condition?
Dear Roland, dear utrace developers, I have updated my patch #1 (it solves the race condition on utrace_stop but not the nesting issue) for the latest version of utrace. renzo On Fri, Feb 13, 2009 at 09:29:25PM +0100, Renzo Davoli wrote: > I have now a complete patch that seems to be quite stable. > At least Kmview have passed through the tests without getting stuck randomly > for the race condition. > --- --- kernel/utrace.c.mcgrath 2009-03-05 15:09:57.0 +0100 +++ kernel/utrace.c 2009-03-06 11:20:48.0 +0100 @@ -369,6 +369,13 @@ return killed; } +static void mark_engine_wants_stop(struct utrace_engine *engine); +static void clear_engine_wants_stop(struct utrace_engine *engine); +static bool engine_wants_stop(struct utrace_engine *engine); +static void mark_engine_wants_resume(struct utrace_engine *engine); +static void clear_engine_wants_resume(struct utrace_engine *engine); +static bool engine_wants_resume(struct utrace_engine *engine); + /* * Perform %UTRACE_STOP, i.e. block in TASK_TRACED until woken up. * @task == current, @utrace == current->utrace, which is not locked. @@ -378,6 +385,7 @@ static bool utrace_stop(struct task_struct *task, struct utrace *utrace) { bool killed; + struct utrace_engine *engine, *next; /* * @utrace->stopped is the flag that says we are safely @@ -399,7 +407,23 @@ return true; } - utrace->stopped = 1; + /* final check: it is really needed to stop? */ + list_for_each_entry_safe(engine, next, &utrace->attached, entry) { + if ((engine->ops != &utrace_detached_ops) && engine_wants_stop(engine)) { + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } + else + utrace->stopped = 1; + } + } + if (unlikely(!utrace->stopped)) { + spin_unlock_irq(&task->sighand->siglock); + spin_unlock(&utrace->lock); + return false; + } + __set_current_state(TASK_TRACED); /* @@ -625,6 +649,7 @@ * to record whether the engine is keeping the target thread stopped. */ #define ENGINE_STOP(1UL << _UTRACE_NEVENTS) +#define ENGINE_RESUME (1UL << (_UTRACE_NEVENTS+1)) static void mark_engine_wants_stop(struct utrace_engine *engine) { @@ -641,6 +666,21 @@ return (engine->flags & ENGINE_STOP) != 0; } +static void mark_engine_wants_resume(struct utrace_engine *engine) +{ + engine->flags |= ENGINE_RESUME; +} + +static void clear_engine_wants_resume(struct utrace_engine *engine) +{ + engine->flags &= ~ENGINE_RESUME; +} + +static bool engine_wants_resume(struct utrace_engine *engine) +{ + return (engine->flags & ENGINE_RESUME) != 0; +} + /** * utrace_set_events - choose which event reports a tracing engine gets * @target:thread to affect @@ -891,6 +931,10 @@ list_move(&engine->entry, &detached); } else { flags |= engine->flags | UTRACE_EVENT(REAP); + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } wake = wake && !engine_wants_stop(engine); } } @@ -1110,6 +1154,7 @@ * There might not be another report before it just * resumes, so make sure single-step is not left set. */ + mark_engine_wants_resume(engine); if (likely(resume)) user_disable_single_step(target); break;
Re: [PATCH] #2 UTRACE_STOP race condition & nesting
Dear Roland, dear utrace developers, I have update also the second patch (which includes the first). This patch fixes the utrace_stop race condition and implements a consistent model of tracing engine nesting. renzo On Sat, Feb 14, 2009 at 10:11:55AM +0100, Renzo Davoli wrote: > > This is an updated patch. It solves the race condition + it gives a quick (a > bit dirty) > solution to issues 3&4. > 3- Nesting, is it really useful to run all the reports in a row and > (eventually) stop and the end waiting for all the engines? > The patch waits for each engine to resume before notifying the next > registered engine. > 4- report_syscall_entry engines evaluation order should be reversed > REPORT macros have an extra "reverse" argument. The macros append this string > to the > list_for_each_entry_safe function name. All the macro calls skip this > argument except > the one in report_syscall_entry where it is set to _reverse. > > With this patch it is possible to run nested kmview machines and ptrace works > inside > the virtual machines. > > This patch is "a bit dirty" because variables and sections of code needed to > count and test > the stopped engines are useless here: a task can be kept stopped for at most > one engine at > a time. > > This patch is a proof-of concept to show what I meant in my previous message. > > For what concerns 1&2 (not included in this patch): > 1- Virtual Machines may need to change the system call > THis is just to simplify the implementation of arch. independent virtual > machine. > I have kept the definition of missing functions in the kmview module code. > 2- UTRACE_SYSCALL_ABORT: is it really useful as a return value for > report_syscall_entry? > It is useless for kmview as the decision of aborting the system call is taken > while > the process is stopped, I am currently setting the syscall number to -1 to > skip the syscall. > > For the sake of completeness there is another way to implement the partial > virtual machine > stuff by introducing another "quiescence" state inside the report upcalls. > I mean: when utrace calls a report function (say for example > report_syscall_entry), the function > in the module puts the process in a stopped state (maybe its TASK_TRACED and > calls the schedule). > >From utrace's point of view the report function does not return until all > >the changes in > the task state have been completed and the decision > UTRACE_RESUME/UTRACE_SYSCALL_ABORT has been taken. > In this way UTRACE_STOP is never used because the module has to implement > another feature > similar to UTRACE_STOP on its own. So what is UTRACE_STOP for? > > ciao > renzo --- --- kernel/utrace.c.mcgrath 2009-03-05 15:09:57.0 +0100 +++ kernel/utrace.c 2009-03-06 11:49:15.0 +0100 @@ -369,6 +369,13 @@ return killed; } +static void mark_engine_wants_stop(struct utrace_engine *engine); +static void clear_engine_wants_stop(struct utrace_engine *engine); +static bool engine_wants_stop(struct utrace_engine *engine); +static void mark_engine_wants_resume(struct utrace_engine *engine); +static void clear_engine_wants_resume(struct utrace_engine *engine); +static bool engine_wants_resume(struct utrace_engine *engine); + /* * Perform %UTRACE_STOP, i.e. block in TASK_TRACED until woken up. * @task == current, @utrace == current->utrace, which is not locked. @@ -378,6 +385,7 @@ static bool utrace_stop(struct task_struct *task, struct utrace *utrace) { bool killed; + struct utrace_engine *engine, *next; /* * @utrace->stopped is the flag that says we are safely @@ -399,7 +407,23 @@ return true; } - utrace->stopped = 1; + /* final check: is really needed to stop? */ + list_for_each_entry_safe(engine, next, &utrace->attached, entry) { + if ((engine->ops != &utrace_detached_ops) && engine_wants_stop(engine)) { + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } + else + utrace->stopped = 1; + } + } + if (unlikely(!utrace->stopped)) { + spin_unlock_irq(&task->sighand->siglock); + spin_unlock(&utrace->lock); + return false; + } + __set_current_state(TASK_TRACED); /* @@ -625,6 +649,7 @@ * to record whether the engine is keeping the target thread stopped. */ #define ENGINE_STOP(1UL << _UTRACE_NEVENTS
[PATCH 1/2] UTRACE_STOP race condition (updated)
Dear Roland, dear utrace developers, I have updated my patch #1 (it solves the race condition on utrace_stop but not the nesting issue) for the latest version of utrace. I am trying to get the patches updated downloading, compiling and testing the fixes every week or so... Things would be easier if these patch could be merged in the mainstream ;-) renzo diff -Naur linux-2.6.29-rc7-git5-utrace/kernel/utrace.c linux-2.6.29-rc7-git5-utrace-p1/kernel/utrace.c --- linux-2.6.29-rc7-git5-utrace/kernel/utrace.c2009-03-12 11:00:09.0 +0100 +++ linux-2.6.29-rc7-git5-utrace-p1/kernel/utrace.c 2009-03-12 11:05:50.0 +0100 @@ -376,6 +376,13 @@ return killed; } +static void mark_engine_wants_stop(struct utrace_engine *engine); +static void clear_engine_wants_stop(struct utrace_engine *engine); +static bool engine_wants_stop(struct utrace_engine *engine); +static void mark_engine_wants_resume(struct utrace_engine *engine); +static void clear_engine_wants_resume(struct utrace_engine *engine); +static bool engine_wants_resume(struct utrace_engine *engine); + /* * Perform %UTRACE_STOP, i.e. block in TASK_TRACED until woken up. * @task == current, @utrace == current->utrace, which is not locked. @@ -385,6 +392,7 @@ static bool utrace_stop(struct task_struct *task, struct utrace *utrace) { bool killed; + struct utrace_engine *engine, *next; /* * @utrace->stopped is the flag that says we are safely @@ -406,7 +414,23 @@ return true; } - utrace->stopped = 1; + /* final check: it is really needed to stop? */ + list_for_each_entry_safe(engine, next, &utrace->attached, entry) { + if ((engine->ops != &utrace_detached_ops) && engine_wants_stop(engine)) { + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } + else + utrace->stopped = 1; + } + } + if (unlikely(!utrace->stopped)) { + spin_unlock_irq(&task->sighand->siglock); + spin_unlock(&utrace->lock); + return false; + } + __set_current_state(TASK_TRACED); /* @@ -632,6 +656,7 @@ * to record whether the engine is keeping the target thread stopped. */ #define ENGINE_STOP(1UL << _UTRACE_NEVENTS) +#define ENGINE_RESUME (1UL << (_UTRACE_NEVENTS+1)) static void mark_engine_wants_stop(struct utrace_engine *engine) { @@ -648,6 +673,21 @@ return (engine->flags & ENGINE_STOP) != 0; } +static void mark_engine_wants_resume(struct utrace_engine *engine) +{ + engine->flags |= ENGINE_RESUME; +} + +static void clear_engine_wants_resume(struct utrace_engine *engine) +{ + engine->flags &= ~ENGINE_RESUME; +} + +static bool engine_wants_resume(struct utrace_engine *engine) +{ + return (engine->flags & ENGINE_RESUME) != 0; +} + /** * utrace_set_events - choose which event reports a tracing engine gets * @target:thread to affect @@ -906,6 +946,10 @@ list_move(&engine->entry, &detached); } else { flags |= engine->flags | UTRACE_EVENT(REAP); + if (engine_wants_resume(engine)) { + clear_engine_wants_stop(engine); + clear_engine_wants_resume(engine); + } wake = wake && !engine_wants_stop(engine); } } @@ -1133,6 +1177,7 @@ * There might not be another report before it just * resumes, so make sure single-step is not left set. */ + mark_engine_wants_resume(engine); if (likely(resume)) user_disable_single_step(target); break;
[PATCH 2/2] UTRACE_STOP: nesting engine management (updated)
Dear Roland, dear utrace developers, I have update also the second patch. Please note that now this patch must be applied after the first one. This patch implements a consistent nesting model for utrace machines. (There is a full description in the messages I sent on Feb. 14 and Mar. 6) renzo --- diff -Naur linux-2.6.29-rc7-git5-utrace-p1/kernel/utrace.c linux-2.6.29-rc7-git5-utrace-p2/kernel/utrace.c --- linux-2.6.29-rc7-git5-utrace-p1/kernel/utrace.c 2009-03-12 11:05:50.0 +0100 +++ linux-2.6.29-rc7-git5-utrace-p2/kernel/utrace.c 2009-03-12 13:37:27.0 +0100 @@ -1405,6 +1405,7 @@ static bool finish_callback(struct utrace *utrace, struct utrace_report *report, struct utrace_engine *engine, + struct task_struct *task, u32 ret) { enum utrace_resume_action action = utrace_resume_action(ret); @@ -1426,6 +1427,7 @@ spin_lock(&utrace->lock); mark_engine_wants_stop(engine); spin_unlock(&utrace->lock); + utrace_stop(task, utrace); } } else if (engine_wants_stop(engine)) { spin_lock(&utrace->lock); @@ -1492,7 +1494,7 @@ ops = engine->ops; if (want & UTRACE_EVENT(QUIESCE)) { - if (finish_callback(utrace, report, engine, + if (finish_callback(utrace, report, engine, task, (*ops->report_quiesce)(report->action, engine, task, event))) @@ -1526,24 +1528,24 @@ * @callback is the name of the member in the ops vector, and remaining * args are the extras it takes after the standard three args. */ -#define REPORT(task, utrace, report, event, callback, ...) \ +#define REPORT(reverse, task, utrace, report, event, callback, ...) \ do { \ start_report(utrace); \ - REPORT_CALLBACKS(task, utrace, report, event, callback, \ + REPORT_CALLBACKS(reverse, task, utrace, report, event, callback, \ (report)->action, engine, current, \ ## __VA_ARGS__); \ finish_report(report, task, utrace); \ } while (0) -#define REPORT_CALLBACKS(task, utrace, report, event, callback, ...) \ +#define REPORT_CALLBACKS(reverse, task, utrace, report, event, callback, ...) \ do { \ struct utrace_engine *engine; \ const struct utrace_engine_ops *ops; \ - list_for_each_entry(engine, &utrace->attached, entry) { \ + list_for_each_entry ## reverse(engine, &utrace->attached, entry) {\ ops = start_callback(utrace, report, engine, task,\ event); \ if (!ops) \ continue; \ - finish_callback(utrace, report, engine, \ + finish_callback(utrace, report, engine, task, \ (*ops->callback)(__VA_ARGS__)); \ } \ } while (0) @@ -1558,7 +1560,7 @@ struct utrace *utrace = task_utrace_struct(task); INIT_REPORT(report); - REPORT(task, utrace, &report, UTRACE_EVENT(EXEC), + REPORT(, task, utrace, &report, UTRACE_EVENT(EXEC), report_exec, fmt, bprm, regs); } @@ -1573,7 +1575,7 @@ INIT_REPORT(report); start_report(utrace); - REPORT_CALLBACKS(task, utrace, &report, UTRACE_EVENT(SYSCALL_ENTRY), + REPORT_CALLBACKS(_reverse, task, utrace, &report, UTRACE_EVENT(SYSCALL_ENTRY), report_syscall_entry, report.result | report.action, engine, current, regs); finish_report(&report, task, utrace); @@ -1615,7 +1617,7 @@ struct utrace *utrace = task_utrace_struct(task); INIT_REPORT(report); - REPORT(task, utrace, &report, UTRACE_EVENT(SYSCALL_EXIT), + REPORT(, task, utrace, &report, UTRACE_EVENT(SYSCALL_EXIT), report_syscall_exit, regs); } @@ -1640,7 +1642,7 @@ start_re
Re: [PATCH 2/2] UTRACE_STOP: nesting engine management (updated)
> Again, we need Roland's opinion, but could you explain why it would > be better to use _reverse in utrace_report_syscall_entry() ? I refer to this posting: http://www.mail-archive.com/utrace-devel@redhat.com/msg00579.html Item #4 explains why it is *needed* to reverse the order in utrace_report_syscall_entry to have a consistent implementation of nested virtualization. > I don't think this is safe. If we do utrace_stop() here, the next engine > can be detached before we return (UTRACE_DETACH assumes it it safe to > unlink the engine when the target is stopped). This means we can't > continue list_for_each_entry(engine, &utrace->attached, entry) after > return from finish_callback(). Maybe this is not the best patch, maybe we can solve the problem in a better way. The point is explained in #3 in the same posting cited above. When a report function of an engine returns UTRACE_STOP, it means (may mean) that it wants to change the status of the process before resuming it. VM monitors often change the status, sometimes debugger users want to set some variables too. IMHO, utrace should stop it *before* calling the report function of the next engine, otherwise we need to set up another structure to synchronize the engines (that may even be unknown one to the other). If there is a tracer/debugger among the engines, it is not even possible to know which snapshot it gets, after or before the modification created by the VM monitor? With these patches it is possible to run nested virtual machines based on utrace, it is also possbile to strace (use ptrace) on processes running inside a VM. renzo
Re: [PATCH 2/3] utrace core
Tracing does not mean only debug. Some tracing facilities can be used for virtualization. For example User-Mode Linux is based on ptrace. I have a prototype of kernel module for virtualization (kmview) based on utrace. Using kmview (module+VMM) it is possible for a user (not root) to mount a filesystem just for a process (or a hierarchy of processes), or it is possible for some processes to use different networking stacks or virtual devices. It is something like user-mode containers. kmview provides the same features of umview, based on ptrace, in a (very) faster way. (umview is in Debian lenny,squeeze,sid if you want to test it) *Utrace is really what I wanted* to support kmview (apart from some minor issues about the support of nested virtualizations). Other virtualizations now based on ptrace could move part of their implementation at kernel level by utrace and several speedups become possible. For example kmview is a partial virtual machine monitor: some system calls are forwarded to the kernel, some others virtualized. When a user mounts a filesystem, all the system calls which use pathnames inside the mountpoint subtree get virtualized while the others are forwarded to the kernel. With utrace the kmview kernel module handles many system calls at kernel level. I mean, if an "open" system call was sent to the kernel because the path is outside the virtualized part of the file system, all the system calls on the same file descriptors can be forwarded to the kernel without any request to the VMM at user level. This is just one example of speedup, several others are possible. Other virtualizations like user-mode linux or fakeroot-ng could use utrace to speedup their virtualization, too. As far as I have seen, systemtap is a wonderful tool for debugging, expecially for kernel debugging but it has not been designed for virtualization. Ptrace provide a standard set of features and all the implementations of VMM must be in userland. Utrace provides the flexibility to split a VMM and move part of it to a kernel module. Utrace provides a unified interface to kernel modules for tracing/virtualization. kmview can be implemented as a client of utrace or by spreading code around the kernel and like kmview other virtualizations based on ptrace could need to move some of their logic to the kernel to speedup their execution. These VMMs will use utrace based modules instead of kernel patches. renzo On Sat, Mar 21, 2009 at 01:49:09AM -0700, Andrew Morton wrote: > I'd be interested in seeing a bit of discussion regarding the overall value > of utrace - it has been quite a while since it floated past. > > I assume that redoing ptrace to be a client of utrace _will_ happen, and > that this is merely a cleanup exercise with no new user-visible features? > > The "prototype utrace-ftrace interface" seems to be more a cool toy rather > than a serious new kernel feature (yes?) > > If so, what are the new killer utrace clients which would justify all these > changes? > > Also, is it still the case that RH are shipping utrace? If so, for what > reasons and what benefits are users seeing from it? > > And I recall that there were real problems wiring up the Feb 2007 version > of utrace to the ARM architecture. Have those issues been resolved? Are > any problems expected for any architectures?
Re: [PATCH 2/3] utrace core
On Sat, Mar 21, 2009 at 03:34:57PM +0100, Ingo Molnar wrote: > > * Renzo Davoli wrote: > > > Tracing does not mean only debug. Some tracing facilities can be > > used for virtualization. For example User-Mode Linux is based on > > ptrace. > > > > I have a prototype of kernel module for virtualization (kmview) > > based on utrace. [...] > > Hm, i cannot find the source code. Can it be downloaded from > somewhere? Sure! kmview is not included in our Debian packages yet as it relies on (still) non mainstream features (utrace), but the code is available on our view-os svn repository. Check out: svn co https://view-os.svn.sourceforge.net/svnroot/view-os view-os More specifically to browse the code/specifications: The kmview device protocol is here: http://wiki.virtualsquare.org/index.php/KMview_module_interface_specifications The kernel module itself is here: http://view-os.svn.sourceforge.net/viewvc/view-os/trunk/kmview-kernel-module/ The VMM userland application share most of the code with umview, the source code for both is here: http://view-os.svn.sourceforge.net/viewvc/view-os/trunk/xmview-os/xmview/ kmview kernel module (current version) needs the following patches: utrace http://www.mail-archive.com/utrace-devel@redhat.com/msg00654.html http://www.mail-archive.com/utrace-devel@redhat.com/msg00655.html I am trying to keep everything up to date, but the whole stuff is evolving in a quite fast way. Everything has been released under GPLv2. renzo
utrace-kmview contract
Dear Roland, You are right when you say that the interface specification is a contract between utrace and the module writers. My goal is to use utrace for my virtual machines, your goal is to design utrace as a support for a wide range of applications. I hope your "wide range of applications" will include kmview. In my perception utrace's support of multiple engines needs a supplement of investigation. I do not want my patches enter utrace code provided there is another fast/clean/easy to code way to reach the same results. It is not for kmview alone, I think this is an example for a range of virtualization application based on utrace. When utrace is used for debugging, "the faster, the better" invariant holds, but when you are dealing with virtualization the rule changes to "the slower, the useless!". Debugging is a temporary state of an application, while virtualization must be designed to be used as a standard environment. Sometimes a picture worth thousands of words. http://www.cs.unibo.it/~renzo/4roland20090323.pdf I have drawn some examples. This is actually a simplified view just to show the problems. The module unreal is a test module for kmview that virtualizes the /unreal subtree as a "copy" of the file system ("/unreal/x/y/z is the file /x/y/z). I know that a so simple transformation could have been implemented directly inside the report_syscall function but kmview is a general support for virtualization. unreal is just a simple test for it. kmview is composed by a kernel module and the "agent" in user space. In the first slide a user runs kmview and inside the vm he/she loads the unreal module and runs a cat command. When cat tries to open "/unreal/etc/passwd", unreal rewrites the path to /etc/passwd, the kernel runs an "open" system call but the arguments have been modified. The report_syscall_entry routine must send the path to kmview in userland and wait for the answer. The number on the arrows show the sequence of actions. The second slide shows a tracing/debugging tool used with virtualization. This is an example of multiple engines working on the same process. strace must read its data before the virtualization for report_syscall_entry. On the contrary the return value shown by strace must be the one returned by the kmview virtualization engine, thus the order for report_syscall_entry is the reverse of that used by report_syscall_exit. Note that if instead of "strace cat /unreal/etc/passwd" our user wrote "strace -f -o /tmp/xxx kmview bash" as the first command the order of the engine would have been inverted. strace in fact should show the system call trace as they appear "outside the virtualization" as one may expect from the command. The third slide shows a nested virtualization and the forth a debug tool running inside a nested virtualization. In all these examples I'd use UTRACE_STOP. Now let us discuss the details of the contract ;-) I set up two different implementations of kmview kernel module. In the standard one (#undefine KMVIEW_NEWSTOP) the report_syscall function returns UTRACE_STOP waiting for the answer from kmview application. The new one (#define KMVIEW_NEWSTOP) uses a semaphore to stop the execution inside the report_syscall function which always returns UTRACE_RESUME. If you decide that the right implementation is the former (#undefine KMVIEW_NEWSTOP): - please tell me how to implement the example of page 3 if in the management of syscall_entry for kmview2 does not stop prior to call kmview1. Okay, you say kmview1's module receives a notification that another engine wants to stop reading its @action argument but it needs the state as modified by kmview2. - I could set up some kind of synchronization among kmview machines but the solution would be extremely weak. What about if kmview run nested with another virtualization/tracing application based on utracei e.g. strace? - You say "use UTRACE_REPORT" to wait for the other machines are done fiddling with it. The comment you wrote about UTRACE_REPORT says: * This is like %UTRACE_RESUME, but also ensures that there will be * a @report_quiesce or @report_signal callback made soon. If * @target had been stopped, then there will be a callback before it * resumes running normally. If another engine is keeping @target * stopped, then there might be no callbacks until all engines let * it resume. But if kmview1 and 2 have both stopped the report_syscall so no callback will be called until both finishes. Otherwise you may mean that kmview1 returns UTRACE_RESUME and when kmview1's report quiesce get called it returns UTRACE_STOP. In this way the management of the system call should be moved from the report_syscall_entry to report_quiesce but just for kmview1. Which one is the cleaner way to implement a service on utrace in you opinion? In my opinion the possibility to have the process blocked before calling the next report function leads to s
Re: resuming after stop at syscall_entry
> Enter Renzo Davoli. Here I am! I have spent my time testing the latest version and trying to figure out how to implement "nested Renzo's engines" with the support you propose. Comments on the latest version of utrace: - 1- syscall_entry report reversed. wonderful, thank you. Now kmview.ko runs on vanilla utrace provided KMVIEW_NEWSTOP is defined. KMVIEW_NEWSTOP stops the process inside the syscall report function so it is a undesirable workaround, not a solution. Anyway this can be used as a proof-of-concept: the problem related to the order of callbacks for syscall_entry is solved. - 2- utrace_control(.., UTRACE_RESUME) can arrive too early, before ENGINE_STOP is set (in engine->flags by mark_engine_wants_stop). Let us name p the traced process and vm the tracer. t=10: p reports a system call. during the report function, p communicates with vm the report function returns UTRACE_STOP utrace is unlocked during the report function. t=20: p records its need to stop: (lock) engine->flags |= ENGINE_STOP; (unlock) later (time t' > 10) vm calls utrace_control(p, engine, ENGINE_RESUME): if t' < 20 the request gets lost! in fact: t=15: utrace_control gets the lock resume=utrace->stopped IS ZERO! clear_engine_wants_stopped clears ENGINE_STOP which has not been set yet at t=20 ENGINE_STOP is set and the task blocked. There are two "clean" "non-baroque" approaches to solve this problem: 2A- interface approach: long time ago utrace had a utrace_set_flags call to set ENGINE_STOP flag before p communicates with vm. In this way ENGINE STOP will always be cleared after it has been set. 2B- implementation approach: use two bits: ENGINE_STOP and ENGINE_RESUME. before t=10 ENGINE_STOP and ENGINE_RESUME are unset. utrace_control(p, engine, UTRACE_RESUME) must set ENGINE_RESUME and clear ENGINE_STOP. at t=20 p can check if there has been a fast resume request. In this case ENGINE_STOP is not set. It is possible to create other workarounds, barriers, fake reports, busy wait loops... If we want something effective, we must implement solutions not workarounds. If a engine say UTRACE_STOP and later UTRACE_RESUME, the task must be resumed. The simplest, the better. My patch in: http://view-os.svn.sourceforge.net/viewvc/view-os/trunk/kmview-kernel-module/kernel_patches/linux-2.6.29-patch1?revision=637&view=markup implements 2B and works with the latest utrace implementation. -- Comments on the proposal. Roland, let me say frankly that the repeated report scan for system call is just a step towards a solution, but I do not like it so much. Problem #1: when each engine receives the same syscall_entry report several times, each engine must discover if: - a previous engine has already stopped this task ( utrace_resume_action(action) == UTRACE_STOP) - this is a repeated scan and the current engine has already processed this report (there is the risk to process it twice). - this is a real new report Maybe I can keep the address of the engine which stopped the task somewhere (say in a task private variable stopengine). During the repeated scan: - if stopengine is NULL is a fresh call. - else (stopengine != NULL) means that the current engine has already processed this report - if stopengine == this engine then set stopengine to NULL. A more portable approach follows (*) : Each engine records if it stopped the task. During the repeated scan: - if ! (action & UTRACE_SYSCALL_RESUMED) this is a fresh call - else the current engine has already processed this report - if this engine stopped the task then clear UTRACE_SYSCALL_RESUMED in the action returned. This is not a nice solution: this "protocol" must be consistently applied by all the modules using utrace otherwise they cannot interoperate. If a report_syscall_entry does not behave in the same way it may receive repeated reports or force other engines to skip some reports. All the programmers of utrace modules should always agree on these details: not a good interface for a long term interoperability. Problem #2: syscall exit may need to modify the return value/errno. The need for stop&go at each engine applies not only to syscall_entry. I really do not understand why is so unaccetable to have a UTRACE_STOP_NOW tag to stop a process *before* reporting to the next engine. The interface would be clean, interoperability between tracing and virtualizing guaranteed. It is not a matter of performance. If your engine need to see the system call that is going to be done by the kernel as you say: if (utrace_resume_action(action) == UTRACE_STOP) return UTRACE_REPORT it has to wait all the virtualize
Bug: report_reap is never called
Hi Ronald & utrace developers I am back... I am upgrading my kmview support and I have stepped into a clear bug. in utrace_reap: -- list_for_each_entry_safe(engine, next, &utrace->attached, entry) { ops = engine->ops; engine->ops = NULL; engine->flags = 0; list_move(&engine->entry, &detached); /* * If it didn't need a callback, we don't need to drop * the lock. Now nothing else refers to this engine. */ if (!(engine->flags & UTRACE_EVENT(REAP))) continue; The code following this 'if' is never executed (i.e. the reap callback never called). In fact it is impossible for (engine->flags & UTRACE_EVENT(REAP)) to be true given that a few statement above engine->flags has been set to 0! To fix the bug: clean all the events but reap: engine->flags &= UTRACE_EVENT(REAP); or save the flag in a temporary var before cleaning it, as you do for engine->ops. ciao renzo
Re: linux-next: add utrace tree
Let me add my two euro-cents to this discussion. Mark Wielaard : > Unfortunately ptrace does all that magic already (badly). People don't > just use it for (s)tracing syscalls, but also for tracing signals, for > single step debugging and poking at memory, register state, for process > jailing and virtualization (uml) through syscall emulation. > So when they are talking about these fancy things that is because that > is what ptrace gives them currently. And they hate it, because the > ptrace interface is such a pain to work with. And all these things don't > really work together. You cannot trace, emulate, debug, jail at the same > time. I support Mark's words. I don't use ptrace for debugging/tracing and I have experienced severe limitations of ptrace interface. (I have tried to post some extensions for ptrace to overcome some constraints see my posts on ptrace_vm or ptrace_multi on LKML). Oleg Nesterov, writing to Andrew Morton said: > First of all, utrace makes other things possible. gdbstub, > nondestructive core dump, uprobes, kmview, hopefully more. I didn't > look at these projects closely, perhaps other people can tell more. As > for their merge status, until utrace itself is merged it is very hard to > develop them out of tree. In the list above there is also kmview, which is a creature of mines. umview and kmview are partial virtual machines, processes running in a [uk]mview machine can have their own view for the file system, networking support, user-id, system-name, etc. A [uk]mview machine virtualizes just what the user need: the filesystem or just a subtree/some subtrees or networking or define one/some virtual devices, etc. The "view" provided by a [uk]mview machine can be a composition of real resources (provided by the Linux kernel) and virtual resources. Each system call request gets hijacked to a module of [uk]mview when it refers to a virtual resource. The request is forwarded to the kernel otherwise. umview is based on ptrace, kmview uses a kernel module based on utrace. (umview is included in debian lenny (to sid), tutorial and manuals in wiki.virtualsquare.org) IMHO utrace is better than ptrace (or an optimized version of it): 1 - "Frank Ch. Eigler" wrote: > At least one reason is that ptrace is single-usage-only, so for > example you cannot concurrently debug & strace the same program. - exactly. utrace allows multiple tracing engines, this means that kmview machines can be nested (in a natural way, no extra code is needed for this feature). In the same way strace/gdb can run on virtualized processes, too. 2 - kmview kernel module implements several optimizations to minimize the number of requests forwarded to the kmview process (the virtual machine monitor). kmview is just a module using the utrace interface, prior attempts of optimized umview required kernel patches. Like kmview any other service requiring process tracing can include specific optimizations in its own kernel module. On the other hand, all these services could use the standardized utrace interface for their optimizations, instead asking for messy patches to change code all around the kernel source. 3 - ptrace takes SIGSTOP/SIGCONT for its own management. Strace/gdb and umview cannot be transparent for programs using these signals. Oleg Nesterov talking about Ptrace said: > Of course they can't use other interfaces, we don't have them. And > without the new abstraction layer we will never have, I think. I agree. THe following list includes the execution times I got in a recent test (make vde-2, see http://www.cs.unibo.it/~renzo/view-os-lk2009.pdf) plain kernel 22.7s, kmview (no modules) 23.9s (+5.5%), full kmview (modules loaded, all syscall virtualized) 38.5s (+70%) optimized umview 51.0 (+124%), umview on vanilla kernel 75.7s (+233%). utrace can be used to speedup virtualization (at least in my case it worked in this way). Performance can be useful for debugging but it is a main issue for virtualization. Kmview module provides optimizations to select the system call requests depending on the syscall number, the pathnames or the file descriptors. http://wiki.virtualsquare.org/index.php/KMview_module_interface_specifications Trying to add all the optimizations needed by different projects to ptrace is a never-ending nightmare: the LKML will continue to receive patch proposals for ptrace... The solution is that everybody can code his/her optimized kernel/user interface for tracing in his/her kernel module, i.e. utrace. renzo
Re: Tracing with utrace, some questions
On Mon, Oct 11, 2010 at 10:19:40AM +0300, Ali Polatel wrote: > > Renzo Davoli's umview/kmview is just such an animal. > > See http://wiki.virtualsquare.org for details. > Looks like really nice example for me! I'm reading it now :) And I am here listening on this ML if you need further info on it. ciao renzo
Call for utrace survival
My project kmview is based on utrace. Utrace is a wonderful tool to support partial virtualization, I have found no other tools providing a tracing interface of user processes by kernel modules. -systemtap, dtrace: are mainly for kernel debug -LTTng: creates traces for off-line debugging (ust needs the program to be compiled for tracing, it does not work on existing binaries). utrace can be a fast and smart replacement for the old/slow ptrace. Now the kernel patches are getting obsolete... This is a call for utrace users and developers to see if there are enough (human) resources to continue the project. I am considering to fork the subset of the project needed by kmview, but if there are enough other projects and developers interested to utrace survival we can work together. renzo davoli virtualsquare labs University of Bologna
Utrace for 2.6.39.1 on View-OS/VirtualSquare
I need utrace for kmview so I have updated the utrace support for 2.6.39.1. The code is here: http://view-os.svn.sourceforge.net/viewvc/view-os/trunk/utrace/ It seems to work. I have tested kmview on it. renzo
Re: [RFC v2 00/19] utrace for 3.0 kernel
On Mon, Jul 11, 2011 at 06:19:33PM -0700, Josh Stone wrote: > On 06/30/2011 05:20 PM, Oleg Nesterov wrote: > > TODO: > > > > - Testing. > > I ran the whole systemtap testsuite with a kernel built from your git > tree, and did not see any utrace-specific issues. Thanks! I have got the git tree, too. I can confirm that also my kmview works on this version of utrace. Thank you Oleg. renzo