[RFC][PATCH] kvm: fix a race when closing irq eventfd
While trying to fix a race when closing cgroup eventfd, I took a look at how kvm deals with this problem, and I found it doesn't. I may be wrong, as I don't know kvm code, so correct me if I'm. /* * Race-free decouple logic (ordering is critical) */ static void irqfd_shutdown(struct work_struct *work) I don't think it's race-free! static int irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) { ... * We cannot race against the irqfd going away since the * other side is required to acquire wqh->lock, which we hold */ if (irqfd_is_active(irqfd)) irqfd_deactivate(irqfd); } In kvm_irqfd_deassign() and kvm_irqfd_release() where irqfds are freed, wqh->lock is not acquired! So here is the race: CPU0CPU1 --- - kvm_irqfd_release() spin_lock(kvm->irqfds.lock); ... irqfd_deactivate(irqfd); list_del_init(&irqfd->list); spin_unlock(kvm->irqfd.lock); ... close(eventfd) irqfd_wakeup(); irqfd_shutdown(); remove_waitqueue(irqfd->wait); kfree(irqfd); spin_lock(kvm->irqfd.lock); if (!list_empty(&irqfd->list)) irqfd_deactivate(irqfd); list_del_init(&irqfd->list); spin_unlock(kvm->irqfd.lock); Look, we're accessing irqfd though it has already been freed! Here's a fix I have. Please enlighten me with a better fix. Signed-off-by: Li Zefan --- fs/eventfd.c| 34 ++ include/linux/eventfd.h | 17 + virt/kvm/eventfd.c | 45 +++-- 3 files changed, 66 insertions(+), 30 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 35470d9..cda8a4c 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -22,21 +22,6 @@ #include #include -struct eventfd_ctx { - struct kref kref; - wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; - unsigned int flags; -}; - /** * eventfd_signal - Adds @n to the eventfd counter. * @ctx: [in] Pointer to the eventfd context. @@ -153,20 +138,29 @@ static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) * This is used to atomically remove a wait queue entry from the eventfd wait * queue head, and read/reset the counter value. */ -int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, +int eventfd_ctx_remove_wait_queue_locked(struct eventfd_ctx *ctx, wait_queue_t *wait, __u64 *cnt) { - unsigned long flags; - - spin_lock_irqsave(&ctx->wqh.lock, flags); eventfd_ctx_do_read(ctx, cnt); __remove_wait_queue(&ctx->wqh, wait); if (*cnt != 0 && waitqueue_active(&ctx->wqh)) wake_up_locked_poll(&ctx->wqh, POLLOUT); - spin_unlock_irqrestore(&ctx->wqh.lock, flags); return *cnt != 0 ? 0 : -EAGAIN; } +EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue_locked); + +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, + __u64 *cnt) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&ctx->wqh.lock, flags); + ret = eventfd_ctx_remove_wait_queue_locked(ctx, wait, cnt); + spin_unlock_irqrestore(&ctx->wqh.lock, flags); + return ret; +} EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue); /** diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 3c3ef19..61085ac 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -26,6 +26,21 @@ #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) +struct eventfd_ctx { + struct kref kref; + wait_queue_head_t wqh; + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count"
Re: [RFC][PATCH] kvm: fix a race when closing irq eventfd
On 2013/2/18 12:02, Alex Williamson wrote: > On Mon, 2013-02-18 at 11:13 +0800, Li Zefan wrote: >> While trying to fix a race when closing cgroup eventfd, I took a look >> at how kvm deals with this problem, and I found it doesn't. >> >> I may be wrong, as I don't know kvm code, so correct me if I'm. >> >> /* >> * Race-free decouple logic (ordering is critical) >> */ >> static void >> irqfd_shutdown(struct work_struct *work) >> >> I don't think it's race-free! >> >> static int >> irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) >> { >> ... >> * We cannot race against the irqfd going away since the >> * other side is required to acquire wqh->lock, which >> we hold >> */ >> if (irqfd_is_active(irqfd)) >> irqfd_deactivate(irqfd); >> } >> >> In kvm_irqfd_deassign() and kvm_irqfd_release() where irqfds are freed, >> wqh->lock is not acquired! >> >> So here is the race: >> >> CPU0CPU1 >> --- - >> kvm_irqfd_release() >> spin_lock(kvm->irqfds.lock); >> ... >> irqfd_deactivate(irqfd); >> list_del_init(&irqfd->list); >> spin_unlock(kvm->irqfd.lock); >> ... >> close(eventfd) >>irqfd_wakeup(); > > irqfd_wakeup is assumed to be called with wqh->lock held > I'm aware of this. As I said, kvm_irqfd_deassign() and kvm_irqfd_release() are not acquiring wqh->lock. >> irqfd_shutdown(); > > eventfd_ctx_remove_wait_queue has to acquire wqh->lock to complete or > else irqfd_shutdown never makes it to the kfree. So in your scenario > this cpu0 spins here until cpu1 completes. > >> remove_waitqueue(irqfd->wait); >> kfree(irqfd); >> spin_lock(kvm->irqfd.lock); >>if (!list_empty(&irqfd->list)) > > We don't take this branch because we already did list_del_init above, > which makes irqfd->list empty. > It doesn't matter if the list is empty or not. The point is, irqfd has been kfreed, so the if statement is simply not safe! >> irqfd_deactivate(irqfd); >>list_del_init(&irqfd->list); >> spin_unlock(kvm->irqfd.lock); >> >> Look, we're accessing irqfd though it has already been freed! > > Unless the irqfd_wakeup path isn't acquiring wqh->lock, it looks > race-free to me. Thanks, > > Alex > > . > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] kvm: fix a race when closing irq eventfd
On 2013/2/18 12:09, Li Zefan wrote: > On 2013/2/18 12:02, Alex Williamson wrote: >> On Mon, 2013-02-18 at 11:13 +0800, Li Zefan wrote: >>> While trying to fix a race when closing cgroup eventfd, I took a look >>> at how kvm deals with this problem, and I found it doesn't. >>> >>> I may be wrong, as I don't know kvm code, so correct me if I'm. >>> >>> /* >>> * Race-free decouple logic (ordering is critical) >>> */ >>> static void >>> irqfd_shutdown(struct work_struct *work) >>> >>> I don't think it's race-free! >>> >>> static int >>> irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) >>> { >>> ... >>> * We cannot race against the irqfd going away since the >>> * other side is required to acquire wqh->lock, which >>> we hold >>> */ >>> if (irqfd_is_active(irqfd)) >>> irqfd_deactivate(irqfd); >>> } >>> >>> In kvm_irqfd_deassign() and kvm_irqfd_release() where irqfds are freed, >>> wqh->lock is not acquired! >>> >>> So here is the race: >>> >>> CPU0CPU1 >>> --- - >>> kvm_irqfd_release() >>> spin_lock(kvm->irqfds.lock); >>> ... >>> irqfd_deactivate(irqfd); >>> list_del_init(&irqfd->list); >>> spin_unlock(kvm->irqfd.lock); >>> ... >>> close(eventfd) >>> irqfd_wakeup(); >> >> irqfd_wakeup is assumed to be called with wqh->lock held >> > > I'm aware of this. > > As I said, kvm_irqfd_deassign() and kvm_irqfd_release() are not acquiring > wqh->lock. > >>> irqfd_shutdown(); >> >> eventfd_ctx_remove_wait_queue has to acquire wqh->lock to complete or >> else irqfd_shutdown never makes it to the kfree. So in your scenario >> this cpu0 spins here until cpu1 completes. >> Oh you're right, this is not obvious. Thanks for the explanation. Now I'll go to see how to fix cgroup. >>> remove_waitqueue(irqfd->wait); >>> kfree(irqfd); >>> spin_lock(kvm->irqfd.lock); >>> if (!list_empty(&irqfd->list)) >> >> We don't take this branch because we already did list_del_init above, >> which makes irqfd->list empty. >> > > It doesn't matter if the list is empty or not. > > The point is, irqfd has been kfreed, so the if statement is simply not safe! > >>> irqfd_deactivate(irqfd); >>> list_del_init(&irqfd->list); >>> spin_unlock(kvm->irqfd.lock); >>> >>> Look, we're accessing irqfd though it has already been freed! >> >> Unless the irqfd_wakeup path isn't acquiring wqh->lock, it looks >> race-free to me. Thanks, >> >> Alex >> >> . >> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip -v11 11/11] tracing: Add kprobes event profiling interface
> +Event Profiling > +--- > + You can check the total number of probe hits and probe miss-hits via > +/sys/kernel/debug/tracing/kprobe_profile. > + The fist column is event name, the second is the number of probe hits, s/fist/first > +the third is the number of probe miss-hits. > + > + ... > +/* Probes profiling interfaces */ > +static int profile_seq_show(struct seq_file *m, void *v) > +{ > + struct trace_probe *tp = v; > + > + if (tp == NULL) > + return 0; > + tp will never be NULL, which is guaranteed by seq_file > + seq_printf(m, "%s", tp->call.name); > + > + seq_printf(m, "\t%8lu %8lu\n", tp->nhits, > +probe_is_return(tp) ? tp->rp.kp.nmissed : tp->kp.nmissed); > + > + return 0; > +} -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip -v11 08/11] tracing: add kprobe-based event tracer
> +static __kprobes unsigned long fetch_memory(struct pt_regs *regs, void *addr) > +{ > + unsigned long retval; need a space after local variable declarations. > + if (probe_kernel_address(addr, retval)) > + return 0; > + return retval; > +} > + > +static __kprobes unsigned long fetch_argument(struct pt_regs *regs, void > *num) > +{ > + return regs_get_argument_nth(regs, (unsigned int)((unsigned long)num)); > +} > + > +static __kprobes unsigned long fetch_retvalue(struct pt_regs *regs, > + void *dummy) > +{ > + return regs_return_value(regs); > +} > + > +static __kprobes unsigned long fetch_ip(struct pt_regs *regs, void *dummy) > +{ > + return instruction_pointer(regs); > +} > + > +/* Memory fetching by symbol */ > +struct symbol_cache { > + char *symbol; > + long offset; > + unsigned long addr; > +}; > + > +static unsigned long update_symbol_cache(struct symbol_cache *sc) > +{ > + sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol); > + if (sc->addr) > + sc->addr += sc->offset; > + return sc->addr; > +} > + > +static void free_symbol_cache(struct symbol_cache *sc) > +{ > + kfree(sc->symbol); > + kfree(sc); > +} > + > +static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset) > +{ > + struct symbol_cache *sc; ditto. and in some other places > + if (!sym || strlen(sym) == 0) > + return NULL; > + sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL); > + if (!sc) > + return NULL; > + > + sc->symbol = kstrdup(sym, GFP_KERNEL); > + if (!sc->symbol) { > + kfree(sc); > + return NULL; > + } > + sc->offset = offset; > + > + update_symbol_cache(sc); > + return sc; > +} ... > +static int probes_seq_show(struct seq_file *m, void *v) > +{ > + struct trace_probe *tp = v; > + int i, ret; > + char buf[MAX_ARGSTR_LEN + 1]; > + > + if (tp == NULL) > + return 0; > + redundant check. tp won't be NULL. > + seq_printf(m, "%c", probe_is_return(tp) ? 'r' : 'p'); > + if (tp->call.name) > + seq_printf(m, ":%s", tp->call.name); > + > + if (tp->symbol) > + seq_printf(m, " %s%+ld", probe_symbol(tp), probe_offset(tp)); > + else > + seq_printf(m, " 0x%p", probe_address(tp)); > + > + for (i = 0; i < tp->nr_args; i++) { > + ret = trace_arg_string(buf, MAX_ARGSTR_LEN, &tp->args[i]); > + if (ret) { > + pr_warning("Argument%d is too long.\n", i); > + break; > + } > + seq_printf(m, " %s", buf); > + } > + seq_printf(m, "\n"); > + return 0; > +} ... > +static ssize_t probes_write(struct file *file, const char __user *buffer, > + size_t count, loff_t *ppos) > +{ > + char *kbuf, *tmp; > + int ret; > + size_t done; > + size_t size; > + > + if (!count || count < 0) > + return 0; count is unsigned, so won't < 0. Also I don't think you need to treat (count == 0) specially. > + > + kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL); should be kmalloc(WRITE_BUFSIZE+1), or... > + if (!kbuf) > + return -ENOMEM; > + > + ret = done = 0; > + do { > + size = count - done; > + if (size > WRITE_BUFSIZE) > + size = WRITE_BUFSIZE; if (size >= WRITE_BUFSIZE) size = WRITE_BUFSIZE - 1; > + if (copy_from_user(kbuf, buffer + done, size)) { > + ret = -EFAULT; > + goto out; > + } > + kbuf[size] = '\0'; > + tmp = strchr(kbuf, '\n'); > + if (!tmp) { > + pr_warning("Line length is too long: " > +"Should be less than %d.", WRITE_BUFSIZE); > + ret = -EINVAL; > + goto out; > + } > + *tmp = '\0'; > + size = tmp - kbuf + 1; > + done += size; > + /* Remove comments */ > + tmp = strchr(kbuf, '#'); > + if (tmp) > + *tmp = '\0'; > + > + ret = command_trace_probe(kbuf); > + if (ret) > + goto out; > + > + } while (done < count); > + ret = done; > +out: > + kfree(kbuf); > + return ret; > +} ... > +enum print_line_t > +print_kretprobe_event(struct trace_iterator *iter, int flags) > +{ > + struct kretprobe_trace_entry *field; > + struct trace_seq *s = &iter->seq; > + int i; > + > + trace_assign_type(field, iter->ent); > + > + if (!seq_print_ip_sym(s, field->ret_ip, flags | TRACE_ITER_SYM_OFFSET)) > + goto partial; > + can't we use %pF? > + if (!trace_seq_puts(s, " <- ")) > + goto partial; > + > + if (!seq_print_ip_sym(s, field
Re: [PATCH 2/3] cgroups: Add an API to attach a task to current task's cgroup
04:24, Tejun Heo wrote: > From: Sridhar Samudrala > > Add a new kernel API to attach a task to current task's cgroup > in all the active hierarchies. > > Signed-off-by: Sridhar Samudrala Acked-by: Li Zefan btw: you lost the reviewed-by tag given by Paul Menage. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers
Tejun Heo wrote: > Apply the cpumask and cgroup of the initializing task to the created > vhost poller. > > Based on Sridhar Samudrala's patch. > > Cc: Michael S. Tsirkin > Cc: Sridhar Samudrala > --- > drivers/vhost/vhost.c | 36 +++- > 1 file changed, 31 insertions(+), 5 deletions(-) > > Index: work/drivers/vhost/vhost.c > === > --- work.orig/drivers/vhost/vhost.c > +++ work/drivers/vhost/vhost.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -176,12 +177,30 @@ repeat: > long vhost_dev_init(struct vhost_dev *dev, > struct vhost_virtqueue *vqs, int nvqs) > { > - struct task_struct *poller; > - int i; > + struct task_struct *poller = NULL; > + cpumask_var_t mask; > + int i, ret = -ENOMEM; > + > + if (!alloc_cpumask_var(&mask, GFP_KERNEL)) > + goto out; > If we "goto out", we will end up calling kthread_stop(poller), but seems kthread_stop() requires the task_struct pointer != NULL. > poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid); > - if (IS_ERR(poller)) > - return PTR_ERR(poller); > + if (IS_ERR(poller)) { > + ret = PTR_ERR(poller); > + goto out; > + } > + > + ret = sched_getaffinity(current->pid, mask); > + if (ret) > + goto out; > + > + ret = sched_setaffinity(poller->pid, mask); > + if (ret) > + goto out; > + > + ret = cgroup_attach_task_current_cg(poller); > + if (ret) > + goto out; > > dev->vqs = vqs; > dev->nvqs = nvqs; > @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de > vhost_poll_init(&dev->vqs[i].poll, > dev->vqs[i].handle_kick, POLLIN, dev); > } > - return 0; > + > + wake_up_process(poller);/* avoid contributing to loadavg */ > + ret = 0; > +out: > + if (ret) > + kthread_stop(poller); > + free_cpumask_var(mask); > + return ret; > } > > /* Caller should have device mutex */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH UPDATED 3/3] vhost: apply cpumask and cgroup to vhost pollers
> poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid); > - if (IS_ERR(poller)) > - return PTR_ERR(poller); > + if (IS_ERR(poller)) { > + ret = PTR_ERR(poller); > + goto out; here... > + } > + > + ret = sched_getaffinity(current->pid, mask); > + if (ret) > + goto out; > + > + ret = sched_setaffinity(poller->pid, mask); > + if (ret) > + goto out; > + > + ret = cgroup_attach_task_current_cg(poller); > + if (ret) > + goto out; > > dev->vqs = vqs; > dev->nvqs = nvqs; > @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de > vhost_poll_init(&dev->vqs[i].poll, > dev->vqs[i].handle_kick, POLLIN, dev); > } > - return 0; > + > + wake_up_process(poller);/* avoid contributing to loadavg */ > + ret = 0; > +out: > + if (ret && poller) It's still buggy.. > + kthread_stop(poller); > + free_cpumask_var(mask); > + return ret; > } > > /* Caller should have device mutex */ > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] perf events: Change perf parameter --pid to process-wide collection instead of thread-wide
Zhang, Yanmin wrote: > From: Zhang, Yanmin > > Parameter --pid (or -p) of perf currently means a thread-wide collection. > For exmaple, if a process whose id is has 10 threads, 'perf top -p ' > just collects the main thread statistics. That's misleading. Users are > used to attach a whole process when debugging a process by gdb. To follow > normal usage style, the patch change --pid to process-wide collection and > add --tid (-t) to mean a thread-wide collection. > > Usage example is: > #perf top -p > #perf record -p -f sleep 10 > #perf stat -p -f sleep 10 > Above commands collect the statistics of all threads of process . > > Signed-off-by: Zhang Yanmin > Seems this patch causes seg faults: # ./perf sched record Segmentation fault # ./perf kmem record Segmentation fault # ./perf timechart record Segmentation fault -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] perf events: Change perf parameter --pid to process-wide collection instead of thread-wide
>>> Parameter --pid (or -p) of perf currently means a thread-wide collection. >>> For exmaple, if a process whose id is has 10 threads, 'perf top -p >>> ' >>> just collects the main thread statistics. That's misleading. Users are >>> used to attach a whole process when debugging a process by gdb. To follow >>> normal usage style, the patch change --pid to process-wide collection and >>> add --tid (-t) to mean a thread-wide collection. >>> >>> Usage example is: >>> #perf top -p >>> #perf record -p -f sleep 10 >>> #perf stat -p -f sleep 10 >>> Above commands collect the statistics of all threads of process . >>> >>> Signed-off-by: Zhang Yanmin >>> >> Seems this patch causes seg faults: >> >> # ./perf sched record >> Segmentation fault >> # ./perf kmem record >> Segmentation fault >> # ./perf timechart record >> Segmentation fault > > Thanks for reporting it. Arnaldo, could you pick up below patch? > Zefan, Could you try it? > The fix works. Thanks! > mmap_array[][][] is not reset to 0 after malloc. Below patch against > tip/master of March 24th fixes it with a zalloc. > > Reported-by: Li Zefan > Signed-off-by:Zhang Yanmin > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] cgroup - removing superfluous rcu_read_lock_held check
On 2010年11月02日 03:15, Jiri Olsa wrote: > hi, This.. > the rcu_dereference_check is defined as > > #define rcu_dereference_check(p, c) \ > __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) > > so the caller does not need to specify rcu_read_lock_held() condition. > > wbr, > jirka and this should be excluded from the changelog. > > > Signed-off-by: Jiri Olsa Reviewed-by: Li Zefan However a nitpick: > --- > include/linux/cgroup.h |1 - > kernel/cgroup.c|6 ++ > 2 files changed, 2 insertions(+), 5 deletions(-) ... > @@ -4544,7 +4542,7 @@ unsigned short css_id(struct cgroup_subsys_state *css) >* it's unchanged until freed. >*/ > cssid = rcu_dereference_check(css->id, > - rcu_read_lock_held() || atomic_read(&css->refcnt)); > + atomic_read(&css->refcnt)); Now the 2 lines can be made into one line and still fit into 80 chars. > > if (cssid) > return cssid->id; > @@ -4557,7 +4555,7 @@ unsigned short css_depth(struct cgroup_subsys_state > *css) > struct css_id *cssid; > > cssid = rcu_dereference_check(css->id, > - rcu_read_lock_held() || atomic_read(&css->refcnt)); > + atomic_read(&css->refcnt)); dito > > if (cssid) > return cssid->depth; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cgroups: fix API thinko
(Just came back from vacation) Michael S. Tsirkin wrote: > cgroup_attach_task_current_cg API that have upstream is backwards: we > really need an API to attach to the cgroups from another process A to > the current one. > > In our case (vhost), a priveledged user wants to attach it's task to cgroups > from a less priveledged one, the API makes us run it in the other > task's context, and this fails. > > So let's make the API generic and just pass in 'from' and 'to' tasks. > Add an inline wrapper for cgroup_attach_task_current_cg to avoid > breaking bisect. > > Signed-off-by: Michael S. Tsirkin Acked-by: Li Zefan I also don't like the name, but I'm not good at English or naming. ;) > --- > > Paul, Li, Sridhar, could you please review the following > patch? > > I only compile-tested it due to travel, but looks > straight-forward to me. > Alex Williamson volunteered to test and report the results. > Sending out now for review as I might be offline for a bit. > Will only try to merge when done, obviously. > > If OK, I would like to merge this through -net tree, > together with the patch fixing vhost-net. > Let me know if that sounds ok. > That's Ok. ... > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 43b2072..b38ec60 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -525,7 +525,11 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp, > void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it); > int cgroup_scan_tasks(struct cgroup_scanner *scan); > int cgroup_attach_task(struct cgroup *, struct task_struct *); > -int cgroup_attach_task_current_cg(struct task_struct *); > +int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); a nitpick: better add a blank line here. > +static inline int cgroup_attach_task_current_cg(struct task_struct *tsk) > +{ > + return cgroup_attach_task_all(current, tsk); > +} -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
Avi Kivity wrote: > On 09/01/2010 12:04 PM, Peter Zijlstra wrote: >> >> Does something like the below cure that? >> > > Unfortunately not. > Then try this: The bug should be caused by commit 1c024eca51fdc965290acf342ae16a476c2189d0. --- diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 8a2b73f..9d1d1f2 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -161,6 +161,7 @@ void perf_trace_destroy(struct perf_event *p_event) } } out: + module_put(tp_event->mod); mutex_unlock(&event_mutex); } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
>> Subject: perf, trace: Fix module leak >> From: Li Zefan >> Date: Wed Sep 01 12:58:43 CEST 2010 >> >> Commit 1c024eca (perf, trace: Optimize tracepoints by using >> per-tracepoint-per-cpu hlist to track events) caused a module refcount >> leak. >> >> Tested-by: Avi Kivity >> Signed-off-by: Peter Zijlstra >> LKML-Reference: <4c7e1f12.8030...@cn.fujitsu.com> >> --- >> kernel/trace/trace_event_perf.c |3 +++ >> 1 file changed, 3 insertions(+) >> >> Index: linux-2.6/kernel/trace/trace_event_perf.c >> === >> --- linux-2.6.orig/kernel/trace/trace_event_perf.c >> +++ linux-2.6/kernel/trace/trace_event_perf.c >> @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p >> tp_event->class && tp_event->class->reg && >> try_module_get(tp_event->mod)) { >> ret = perf_trace_event_init(tp_event, p_event); >> +if (ret) >> +module_put(tp_event->mod); >> break; >> } >> } >> @@ -147,6 +149,7 @@ void perf_trace_destroy(struct perf_even >> } >> } >> out: >> +module_put(tp_event->mod); >> mutex_unlock(&event_mutex); >> } >> >> > > Thanks for fixing this. > > However, can we split this in two patches to ease the backport? > > The lack of a module_put() after perf_trace_init() failure is there for a > while > (the backport needs to start in 2.6.32). The failure should be a rare case, I don't think this has to be backported? > > But the lack of a module_put in the destroy path needs a .35 backport only. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html