Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 16, 2017, at 4:08 PM, Thomas Gleixner t...@linutronix.de wrote: > On Tue, 14 Nov 2017, Mathieu Desnoyers wrote: >> +#ifdef __KERNEL__ >> +# include >> +#else /* #ifdef __KERNEL__ */ > > Please drop these comments. They are distracting and not helpful at > all. They are valuable for long #ideffed sections but then the normal form > is: > > /* __KERNEL__ */ > > /* !__KERNEL__ */ > ok >> +# include >> +#endif /* #else #ifdef __KERNEL__ */ >> + >> +#include >> + >> +#ifdef __LP64__ >> +# define RSEQ_FIELD_u32_u64(field) uint64_t field >> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v >> +#elif defined(__BYTE_ORDER) ? \ >> +__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) > > Can you please make this decision separate and propagate the result ? Something like this ? #ifdef __BYTE_ORDER # if (__BYTE_ORDER == __BIG_ENDIAN) # define RSEQ_BYTE_ORDER_BIG_ENDIAN # else # define RSEQ_BYTE_ORDER_LITTLE_ENDIAN # endif #else # ifdef __BIG_ENDIAN # define RSEQ_BYTE_ORDER_BIG_ENDIAN # else # define RSEQ_BYTE_ORDER_LITTLE_ENDIAN # endif #endif #ifdef __LP64__ # define RSEQ_FIELD_u32_u64(field) uint64_t field # define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v #else # ifdef RSEQ_BYTE_ORDER_BIG_ENDIAN # define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field # define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ field ## _padding = 0, field = (intptr_t)v # else # define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding # define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ field = (intptr_t)v, field ## _padding = 0 # endif #endif > >> +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field >> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ >> +field ## _padding = 0, field = (intptr_t)v >> +#else >> +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding >> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ >> +field = (intptr_t)v, field ## _padding = 0 >> +#endif >> + >> +enum rseq_flags { >> +RSEQ_FLAG_UNREGISTER = (1 << 0), >> +}; >> + >> +enum rseq_cs_flags { >> +RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT = (1U << 0), >> +RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL = (1U << 1), >> +RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = (1U << 2), >> +}; >> + >> +/* >> + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always >> + * contained within a single cache-line. It is usually declared as >> + * link-time constant data. >> + */ >> +struct rseq_cs { >> +uint32_t version; /* Version of this structure. */ >> +uint32_t flags; /* enum rseq_cs_flags */ >> +RSEQ_FIELD_u32_u64(start_ip); >> +RSEQ_FIELD_u32_u64(post_commit_offset); /* From start_ip */ I'll move the tail comments to their own line. >> +RSEQ_FIELD_u32_u64(abort_ip); >> +} __attribute__((aligned(4 * sizeof(uint64_t; >> + >> +/* >> + * struct rseq is aligned on 4 * 8 bytes to ensure it is always >> + * contained within a single cache-line. >> + * >> + * A single struct rseq per thread is allowed. >> + */ >> +struct rseq { >> +/* >> + * Restartable sequences cpu_id_start field. Updated by the >> + * kernel, and read by user-space with single-copy atomicity >> + * semantics. Aligned on 32-bit. Always contain a value in the > > contains ok > >> + * range of possible CPUs, although the value may not be the >> + * actual current CPU (e.g. if rseq is not initialized). This >> + * CPU number value should always be confirmed against the value >> + * of the cpu_id field. > > Who is supposed to confirm that? I think I know what the purpose of the > field is, but from that comment it's not obvious at all. Proposed update: /* * Restartable sequences cpu_id_start field. Updated by the * kernel, and read by user-space with single-copy atomicity * semantics. Aligned on 32-bit. Always contains a value in the * range of possible CPUs, although the value may not be the * actual current CPU (e.g. if rseq is not initialized). This * CPU number value should always be compared against the value * of the cpu_id field before performing a rseq commit or * returning a value read from a data structure indexed using * the cpu_id_start value. */ uint32_t cpu_id_start; > >> + */ >> +uint32_t cpu_id_start; >> +/* >> + * Restartable sequences cpu_id field. Updated by the kernel, >> + * and read by user-space with single-copy atomicity semantics. > > Again. What's the purpose of reading it. > Update: /* * Restartable sequences cpu_id field. Updated by the kernel, * and read by user-space with single-copy atomicity semantics. * Aligned on 32-bit. Values RSEQ_CPU_ID_UNINITIALIZED and * RSEQ_CPU_ID_REGIS
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Tue, 14 Nov 2017, Mathieu Desnoyers wrote: > +#ifdef __KERNEL__ > +# include > +#else/* #ifdef __KERNEL__ */ Please drop these comments. They are distracting and not helpful at all. They are valuable for long #ideffed sections but then the normal form is: /* __KERNEL__ */ /* !__KERNEL__ */ > +# include > +#endif /* #else #ifdef __KERNEL__ */ > + > +#include > + > +#ifdef __LP64__ > +# define RSEQ_FIELD_u32_u64(field) uint64_t field > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v > +#elif defined(__BYTE_ORDER) ? \ > + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) Can you please make this decision separate and propagate the result ? > +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field ## _padding = 0, field = (intptr_t)v > +#else > +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field = (intptr_t)v, field ## _padding = 0 > +#endif > + > +enum rseq_flags { > + RSEQ_FLAG_UNREGISTER = (1 << 0), > +}; > + > +enum rseq_cs_flags { > + RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT = (1U << 0), > + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL = (1U << 1), > + RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = (1U << 2), > +}; > + > +/* > + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always > + * contained within a single cache-line. It is usually declared as > + * link-time constant data. > + */ > +struct rseq_cs { > + uint32_t version; /* Version of this structure. */ > + uint32_t flags; /* enum rseq_cs_flags */ > + RSEQ_FIELD_u32_u64(start_ip); > + RSEQ_FIELD_u32_u64(post_commit_offset); /* From start_ip */ > + RSEQ_FIELD_u32_u64(abort_ip); > +} __attribute__((aligned(4 * sizeof(uint64_t; > + > +/* > + * struct rseq is aligned on 4 * 8 bytes to ensure it is always > + * contained within a single cache-line. > + * > + * A single struct rseq per thread is allowed. > + */ > +struct rseq { > + /* > + * Restartable sequences cpu_id_start field. Updated by the > + * kernel, and read by user-space with single-copy atomicity > + * semantics. Aligned on 32-bit. Always contain a value in the contains > + * range of possible CPUs, although the value may not be the > + * actual current CPU (e.g. if rseq is not initialized). This > + * CPU number value should always be confirmed against the value > + * of the cpu_id field. Who is supposed to confirm that? I think I know what the purpose of the field is, but from that comment it's not obvious at all. > + */ > + uint32_t cpu_id_start; > + /* > + * Restartable sequences cpu_id field. Updated by the kernel, > + * and read by user-space with single-copy atomicity semantics. Again. What's the purpose of reading it. > + * Aligned on 32-bit. Values -1U and -2U have a special > + * semantic: -1U means "rseq uninitialized", and -2U means "rseq > + * initialization failed". > + */ > + uint32_t cpu_id; > + /* > + * Restartable sequences rseq_cs field. > + * > + * Contains NULL when no critical section is active for the current > + * thread, or holds a pointer to the currently active struct rseq_cs. > + * > + * Updated by user-space at the beginning of assembly instruction > + * sequence block, and by the kernel when it restarts an assembly > + * instruction sequence block, and when the kernel detects that it > + * is preempting or delivering a signal outside of the range > + * targeted by the rseq_cs. Also needs to be cleared by user-space > + * before reclaiming memory that contains the targeted struct > + * rseq_cs. This paragraph is pretty convoluted and it's not really clear what the actual purpose is and how it is supposed to be used. It's NULL when no critical section is active. It holds a pointer to a struct rseq_cs when a critical section is active. Fine Now the update rules: - By user space at the start of the critical section, i.e. user space sets the pointer to rseq_cs - By the kernel when it restarts a sequence block etc What happens to this field? Is the pointer updated or cleared or what? How is the kernel supposed to fiddle with the pointer? > + * > + * Read and set by the kernel with single-copy atomicity semantics. This looks like it's purely kernel owned, but above you say it's written by user space. There are no rules for user space? > + * Aligned on 64-bit. > + */ > + RSEQ_FIELD_u32_u64(rseq_cs); > + /* > + * - RSEQ_DISABLE flag: > + * > + * Fallback fast-track flag for single-stepping. > + * Set by user-space if lack of progress is detected. > + * Cleared by user-space after rseq finish. > + * Read
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Thu, Nov 16, 2017 at 08:37:58PM +, Mathieu Desnoyers wrote: > I usually never space-align with open parenthesis "(". Is it a coding > style requirement of the kernel for multi-line if () conditions ? Not sure, but it is the predominant pattern in most of the code. > Would the following replatement code be ok ? > > if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) { > if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE > | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) != > (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE > | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > return -EINVAL; I really prefer the operator at the end, git grep "&&$" | wc -l 40708 git grep "^[[:space:]]*&&" | wc -l 3901
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 16, 2017, at 2:14 PM, Peter Zijlstra pet...@infradead.org wrote: > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: >> +static bool rseq_update_cpu_id(struct task_struct *t) >> +{ >> +uint32_t cpu_id = raw_smp_processor_id(); >> + >> +if (__put_user(cpu_id, &t->rseq->cpu_id_start)) >> +return false; >> +if (__put_user(cpu_id, &t->rseq->cpu_id)) >> +return false; > > For LP64 this _could_ be a single 64bit store, right? It would save some > stac/clac noise on x86_64. Yes it could, but last time I checked a __put_user of a u64 did not guarantee single-copy atomicity of each of the two 32-bit words on 32-bit architectures, so I figured that it would be better to postpone that optimization to a point where architectures would provide a u64 __put_user that guarantee single-copy atomicity of each 32-bit word on 32-bit architectures. > >> +trace_rseq_update(t); >> +return true; >> +} > >> +static bool rseq_get_rseq_cs(struct task_struct *t, > > bool return value, but is used as a C int error value later (it works, > but is inconsistent). I can do the following on the caller side instead: if (!rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip, &cs_flags)) return -EFAULT; > >> +void __user **start_ip, >> +unsigned long *post_commit_offset, >> +void __user **abort_ip, >> +uint32_t *cs_flags) > > That's a fair amount of arguments, and I suppose that isn't a problem > because there's only the one callsite and it all gets inlined anyway. Yep. > >> +{ >> +unsigned long ptr; >> +struct rseq_cs __user *urseq_cs; >> +struct rseq_cs rseq_cs; >> +u32 __user *usig; >> +u32 sig; >> + >> +if (__get_user(ptr, &t->rseq->rseq_cs)) >> +return false; >> +if (!ptr) >> +return true; >> +urseq_cs = (struct rseq_cs __user *)ptr; >> +if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs))) >> +return false; >> +/* >> + * We need to clear rseq_cs upon entry into a signal handler >> + * nested on top of a rseq assembly block, so the signal handler >> + * will not be fixed up if itself interrupted by a nested signal >> + * handler or preempted. We also need to clear rseq_cs if we >> + * preempt or deliver a signal on top of code outside of the >> + * rseq assembly block, to ensure that a following preemption or >> + * signal delivery will not try to perform a fixup needlessly. >> + */ >> +if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs))) >> +return false; >> +if (rseq_cs.version > 0) >> +return false; > >> +*cs_flags = rseq_cs.flags; >> +*start_ip = (void __user *)rseq_cs.start_ip; >> +*post_commit_offset = (unsigned long)rseq_cs.post_commit_offset; >> +*abort_ip = (void __user *)rseq_cs.abort_ip; > >> +usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32)); >> +if (get_user(sig, usig)) >> +return false; > ok for adding newlines. >> +if (current->rseq_sig != sig) { >> +printk_ratelimited(KERN_WARNING >> +"Possible attack attempt. Unexpected rseq signature >> 0x%x, expecting 0x%x >> (pid=%d, addr=%p).\n", >> +sig, current->rseq_sig, current->pid, usig); >> +return false; >> +} >> +return true; >> +} >> + >> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags) >> +{ >> +bool need_restart = false; >> +uint32_t flags; >> + >> +/* Get thread flags. */ >> +if (__get_user(flags, &t->rseq->flags)) >> +return -EFAULT; >> + >> +/* Take into account critical section flags. */ >> +flags |= cs_flags; >> + >> +/* >> + * Restart on signal can only be inhibited when restart on >> + * preempt and restart on migrate are inhibited too. Otherwise, >> + * a preempted signal handler could fail to restart the prior >> + * execution context on sigreturn. >> + */ >> +if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) { >> +if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) >> +return -EINVAL; >> +if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) >> +return -EINVAL; >> +} >> +if (t->rseq_migrate >> +&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > > That's a horrible code form, please put the && at the end of the > previous line and begin the next line aligned with the (, like: > > if (t->rseq_migrate && > !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > > Luckily you've already killed this code, but try and remember for a next > time ;-) I usually never space-align with open parenthesis "(". Is it a coding style requirement of the kernel for multi-line if () conditions ? Would the following replatement code be ok
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 16, 2017, at 2:06 PM, Thomas Gleixner t...@linutronix.de wrote: > On Thu, 16 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 16, 2017, at 1:43 PM, Peter Zijlstra pet...@infradead.org wrote: >> >> > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: >> >> +/* >> >> + * If parent process has a registered restartable sequences area, the >> >> + * child inherits. Only applies when forking a process, not a thread. In >> >> + * case a parent fork() in the middle of a restartable sequence, set the >> >> + * resume notifier to force the child to retry. >> >> + */ >> >> +static inline void rseq_fork(struct task_struct *t, unsigned long >> >> clone_flags) >> >> +{ >> >> + if (clone_flags & CLONE_THREAD) { >> >> + t->rseq = NULL; >> >> + t->rseq_len = 0; >> >> + t->rseq_sig = 0; >> >> + } else { >> >> + t->rseq = current->rseq; >> >> + t->rseq_len = current->rseq_len; >> >> + t->rseq_sig = current->rseq_sig; >> >> + rseq_set_notify_resume(t); >> >> + } >> >> +} >> > >> > This hurts my brain... what happens if you fork a multi-threaded >> > process? >> > >> > Do we fully inherit the TLS state of the calling thread? >> >> Yes, exactly. The user-space TLS should be inherited from that of >> the calling thread. >> >> At kernel-level, the only thing that's not inherited here is the >> task struct rseq_event_mask, which tracks whether a restart is >> needed. But this would only be relevant if fork() can be invoked >> from a signal handler, or if fork() could be invoked from a >> rseq critical section (which really makes little sense). > > Whether it makes sense or not does not matter much, especially in context > of user space. You cannot make assumptions like that. When something can be > done, then it's bound to happen sooner than later because somebody thinks > he is extra clever. > > The first priority is robustness in any aspect which has to do with user > space. > >> Should I copy the current->rseq_event_mask on process fork just to >> be on the safe side though ? > > I think so, unless you let fork() fail when invoked from a rseq critical > section. Allright, I'll set the rseq_event_mask to 0 explicitly on exec() and thread-fork, and copy it from the parent on process-fork. Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: > +static bool rseq_update_cpu_id(struct task_struct *t) > +{ > + uint32_t cpu_id = raw_smp_processor_id(); > + > + if (__put_user(cpu_id, &t->rseq->cpu_id_start)) > + return false; > + if (__put_user(cpu_id, &t->rseq->cpu_id)) > + return false; For LP64 this _could_ be a single 64bit store, right? It would save some stac/clac noise on x86_64. > + trace_rseq_update(t); > + return true; > +} > +static bool rseq_get_rseq_cs(struct task_struct *t, bool return value, but is used as a C int error value later (it works, but is inconsistent). > + void __user **start_ip, > + unsigned long *post_commit_offset, > + void __user **abort_ip, > + uint32_t *cs_flags) That's a fair amount of arguments, and I suppose that isn't a problem because there's only the one callsite and it all gets inlined anyway. > +{ > + unsigned long ptr; > + struct rseq_cs __user *urseq_cs; > + struct rseq_cs rseq_cs; > + u32 __user *usig; > + u32 sig; > + > + if (__get_user(ptr, &t->rseq->rseq_cs)) > + return false; > + if (!ptr) > + return true; > + urseq_cs = (struct rseq_cs __user *)ptr; > + if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs))) > + return false; > + /* > + * We need to clear rseq_cs upon entry into a signal handler > + * nested on top of a rseq assembly block, so the signal handler > + * will not be fixed up if itself interrupted by a nested signal > + * handler or preempted. We also need to clear rseq_cs if we > + * preempt or deliver a signal on top of code outside of the > + * rseq assembly block, to ensure that a following preemption or > + * signal delivery will not try to perform a fixup needlessly. > + */ > + if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs))) > + return false; > + if (rseq_cs.version > 0) > + return false; > + *cs_flags = rseq_cs.flags; > + *start_ip = (void __user *)rseq_cs.start_ip; > + *post_commit_offset = (unsigned long)rseq_cs.post_commit_offset; > + *abort_ip = (void __user *)rseq_cs.abort_ip; > + usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32)); > + if (get_user(sig, usig)) > + return false; > + if (current->rseq_sig != sig) { > + printk_ratelimited(KERN_WARNING > + "Possible attack attempt. Unexpected rseq signature > 0x%x, expecting 0x%x (pid=%d, addr=%p).\n", > + sig, current->rseq_sig, current->pid, usig); > + return false; > + } > + return true; > +} > + > +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags) > +{ > + bool need_restart = false; > + uint32_t flags; > + > + /* Get thread flags. */ > + if (__get_user(flags, &t->rseq->flags)) > + return -EFAULT; > + > + /* Take into account critical section flags. */ > + flags |= cs_flags; > + > + /* > + * Restart on signal can only be inhibited when restart on > + * preempt and restart on migrate are inhibited too. Otherwise, > + * a preempted signal handler could fail to restart the prior > + * execution context on sigreturn. > + */ > + if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) { > + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > + return -EINVAL; > + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > + return -EINVAL; > + } > + if (t->rseq_migrate > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) That's a horrible code form, please put the && at the end of the previous line and begin the next line aligned with the (, like: if (t->rseq_migrate && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) Luckily you've already killed this code, but try and remember for a next time ;-) > + need_restart = true; > + else if (t->rseq_preempt > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > + need_restart = true; > + else if (t->rseq_signal > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) > + need_restart = true; > + > + t->rseq_preempt = false; > + t->rseq_signal = false; > + t->rseq_migrate = false; > + if (need_restart) > + return 1; > + return 0; > +} > + > +static int rseq_ip_fixup(struct pt_regs *regs) > +{ > + struct task_struct *t = current; > + void __user *start_ip = NULL; > + unsigned long post_commit_offset = 0; > + void __user *abort_ip = NULL; > + uint32_t cs_flags = 0; > + int ret; unsigned long ip = instruction_pointer(regs); > + > + ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip, > +
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Thu, 16 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 16, 2017, at 1:43 PM, Peter Zijlstra pet...@infradead.org wrote: > > > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: > >> +/* > >> + * If parent process has a registered restartable sequences area, the > >> + * child inherits. Only applies when forking a process, not a thread. In > >> + * case a parent fork() in the middle of a restartable sequence, set the > >> + * resume notifier to force the child to retry. > >> + */ > >> +static inline void rseq_fork(struct task_struct *t, unsigned long > >> clone_flags) > >> +{ > >> + if (clone_flags & CLONE_THREAD) { > >> + t->rseq = NULL; > >> + t->rseq_len = 0; > >> + t->rseq_sig = 0; > >> + } else { > >> + t->rseq = current->rseq; > >> + t->rseq_len = current->rseq_len; > >> + t->rseq_sig = current->rseq_sig; > >> + rseq_set_notify_resume(t); > >> + } > >> +} > > > > This hurts my brain... what happens if you fork a multi-threaded > > process? > > > > Do we fully inherit the TLS state of the calling thread? > > Yes, exactly. The user-space TLS should be inherited from that of > the calling thread. > > At kernel-level, the only thing that's not inherited here is the > task struct rseq_event_mask, which tracks whether a restart is > needed. But this would only be relevant if fork() can be invoked > from a signal handler, or if fork() could be invoked from a > rseq critical section (which really makes little sense). Whether it makes sense or not does not matter much, especially in context of user space. You cannot make assumptions like that. When something can be done, then it's bound to happen sooner than later because somebody thinks he is extra clever. The first priority is robustness in any aspect which has to do with user space. > Should I copy the current->rseq_event_mask on process fork just to > be on the safe side though ? I think so, unless you let fork() fail when invoked from a rseq critical section. Thanks, tglx
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 16, 2017, at 1:43 PM, Peter Zijlstra pet...@infradead.org wrote: > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: >> +/* >> + * If parent process has a registered restartable sequences area, the >> + * child inherits. Only applies when forking a process, not a thread. In >> + * case a parent fork() in the middle of a restartable sequence, set the >> + * resume notifier to force the child to retry. >> + */ >> +static inline void rseq_fork(struct task_struct *t, unsigned long >> clone_flags) >> +{ >> +if (clone_flags & CLONE_THREAD) { >> +t->rseq = NULL; >> +t->rseq_len = 0; >> +t->rseq_sig = 0; >> +} else { >> +t->rseq = current->rseq; >> +t->rseq_len = current->rseq_len; >> +t->rseq_sig = current->rseq_sig; >> +rseq_set_notify_resume(t); >> +} >> +} > > This hurts my brain... what happens if you fork a multi-threaded > process? > > Do we fully inherit the TLS state of the calling thread? Yes, exactly. The user-space TLS should be inherited from that of the calling thread. At kernel-level, the only thing that's not inherited here is the task struct rseq_event_mask, which tracks whether a restart is needed. But this would only be relevant if fork() can be invoked from a signal handler, or if fork() could be invoked from a rseq critical section (which really makes little sense). Should I copy the current->rseq_event_mask on process fork just to be on the safe side though ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: > +/* > + * If parent process has a registered restartable sequences area, the > + * child inherits. Only applies when forking a process, not a thread. In > + * case a parent fork() in the middle of a restartable sequence, set the > + * resume notifier to force the child to retry. > + */ > +static inline void rseq_fork(struct task_struct *t, unsigned long > clone_flags) > +{ > + if (clone_flags & CLONE_THREAD) { > + t->rseq = NULL; > + t->rseq_len = 0; > + t->rseq_sig = 0; > + } else { > + t->rseq = current->rseq; > + t->rseq_len = current->rseq_len; > + t->rseq_sig = current->rseq_sig; > + rseq_set_notify_resume(t); > + } > +} This hurts my brain... what happens if you fork a multi-threaded process? Do we fully inherit the TLS state of the calling thread?
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 16, 2017, at 11:32 AM, Peter Zijlstra pet...@infradead.org wrote: > On Thu, Nov 16, 2017 at 04:27:07PM +, Mathieu Desnoyers wrote: >> - On Nov 16, 2017, at 11:18 AM, Peter Zijlstra pet...@infradead.org >> wrote: >> >> > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: >> >> @@ -977,6 +978,13 @@ struct task_struct { >> >> unsigned long numa_pages_migrated; >> >> #endif /* CONFIG_NUMA_BALANCING */ >> >> >> >> +#ifdef CONFIG_RSEQ >> >> + struct rseq __user *rseq; >> >> + u32 rseq_len; >> >> + u32 rseq_sig; >> >> + bool rseq_preempt, rseq_signal, rseq_migrate; >> > >> > No bool please. Use something that has a defined size in ILP32/LP64. >> > _Bool makes it absolutely impossible to speculate on structure layout >> > across architectures. >> >> I should as well make all those a bitmask within a "u32 rseq_event_mask" >> then, >> sounds fair ? > > Sure, whatever works and isn't _Bool ;-) So something along those lines should do the trick (including the mask request from Ben Maurer): diff --git a/include/linux/sched.h b/include/linux/sched.h index b995a3b..44aef30 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -982,7 +982,7 @@ struct task_struct { struct rseq __user *rseq; u32 rseq_len; u32 rseq_sig; - bool rseq_preempt, rseq_signal, rseq_migrate; + u32 rseq_event_mask; #endif struct tlbflush_unmap_batch tlb_ubc; @@ -1676,6 +1676,16 @@ static inline void set_task_cpu(struct task_struct *p, un #endif #ifdef CONFIG_RSEQ +/* + * Map the event mask on the user-space ABI enum rseq_cs_flags + * for direct mask checks. + */ +enum rseq_event_mask { + RSEQ_EVENT_PREEMPT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT, + RSEQ_EVENT_SIGNAL = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, + RSEQ_EVENT_MIGRATE = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE, +}; + static inline void rseq_set_notify_resume(struct task_struct *t) { if (t->rseq) @@ -1718,16 +1728,16 @@ static inline void rseq_sched_out(struct task_struct *t) } static inline void rseq_signal_deliver(struct pt_regs *regs) { - current->rseq_signal = true; + current->rseq_event_mask |= RSEQ_EVENT_SIGNAL; rseq_handle_notify_resume(regs); } static inline void rseq_preempt(struct task_struct *t) { - t->rseq_preempt = true; + t->rseq_event_mask |= RSEQ_EVENT_PREEMPT; } static inline void rseq_migrate(struct task_struct *t) { - t->rseq_migrate = true; + t->rseq_event_mask |= RSEQ_EVENT_MIGRATE; } #else static inline void rseq_set_notify_resume(struct task_struct *t) diff --git a/kernel/rseq.c b/kernel/rseq.c index 6f0d48c..d773003 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -159,7 +159,7 @@ static bool rseq_get_rseq_cs(struct task_struct *t, static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags) { bool need_restart = false; - uint32_t flags; + uint32_t flags, event_mask; /* Get thread flags. */ if (__get_user(flags, &t->rseq->flags)) @@ -174,26 +174,17 @@ static int rseq_need_restart(struct task_struct *t, uint32 * a preempted signal handler could fail to restart the prior * execution context on sigreturn. */ - if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) { - if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) - return -EINVAL; - if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) + if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) { + if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE + | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) + != (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE +| RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) return -EINVAL; } - if (t->rseq_migrate - && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) - need_restart = true; - else if (t->rseq_preempt - && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) - need_restart = true; - else if (t->rseq_signal - && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) - need_restart = true; - - t->rseq_preempt = false; - t->rseq_signal = false; - t->rseq_migrate = false; - if (need_restart) + event_mask = t->rseq_event_mask; + t->rseq_event_mask = 0; + event_mask &= ~flags; + if (event_mask) return 1; return 0; } -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Thu, Nov 16, 2017 at 04:27:07PM +, Mathieu Desnoyers wrote: > - On Nov 16, 2017, at 11:18 AM, Peter Zijlstra pet...@infradead.org wrote: > > > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: > >> @@ -977,6 +978,13 @@ struct task_struct { > >>unsigned long numa_pages_migrated; > >> #endif /* CONFIG_NUMA_BALANCING */ > >> > >> +#ifdef CONFIG_RSEQ > >> + struct rseq __user *rseq; > >> + u32 rseq_len; > >> + u32 rseq_sig; > >> + bool rseq_preempt, rseq_signal, rseq_migrate; > > > > No bool please. Use something that has a defined size in ILP32/LP64. > > _Bool makes it absolutely impossible to speculate on structure layout > > across architectures. > > I should as well make all those a bitmask within a "u32 rseq_event_mask" then, > sounds fair ? Sure, whatever works and isn't _Bool ;-)
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 16, 2017, at 11:18 AM, Peter Zijlstra pet...@infradead.org wrote: > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: >> @@ -977,6 +978,13 @@ struct task_struct { >> unsigned long numa_pages_migrated; >> #endif /* CONFIG_NUMA_BALANCING */ >> >> +#ifdef CONFIG_RSEQ >> +struct rseq __user *rseq; >> +u32 rseq_len; >> +u32 rseq_sig; >> +bool rseq_preempt, rseq_signal, rseq_migrate; > > No bool please. Use something that has a defined size in ILP32/LP64. > _Bool makes it absolutely impossible to speculate on structure layout > across architectures. I should as well make all those a bitmask within a "u32 rseq_event_mask" then, sounds fair ? Thanks, Mathieu > >> +#endif >> + >> struct tlbflush_unmap_batch tlb_ubc; >> > > struct rcu_head rcu; -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote: > @@ -977,6 +978,13 @@ struct task_struct { > unsigned long numa_pages_migrated; > #endif /* CONFIG_NUMA_BALANCING */ > > +#ifdef CONFIG_RSEQ > + struct rseq __user *rseq; > + u32 rseq_len; > + u32 rseq_sig; > + bool rseq_preempt, rseq_signal, rseq_migrate; No bool please. Use something that has a defined size in ILP32/LP64. _Bool makes it absolutely impossible to speculate on structure layout across architectures. > +#endif > + > struct tlbflush_unmap_batch tlb_ubc; > > struct rcu_head rcu;
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
>>> int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t >>>sig); >> >> Really dumb question -- and one I'm sorry to bring up at the last minute. >> Should >> we consider making the syscall name something more generic >> "register_tls_abi"? > I proposed that approach back in 2016 ("tls abi" system call), and the > feedback > I received back then is that it was preferred to have a dedicated "rseq" > system > call than an "open ended" and generic "tls abi" system call. Ultimately I'm fine either way. I do think that in the past few months of review it has become clear that creating this tls abi requires a fair bit of work. It'd be a shame to see a future attempt to use such an ABI made difficult by forcing the author to figure out the registration process yet again. I assume the maintainers of glibc would also like to avoid the need to register multiple ABIs. -b
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 14, 2017, at 3:49 PM, Ben Maurer bmau...@fb.com wrote: > (apologies for the duplicate email, the previous one bounced as it was > accidentally using HTML formatting) > > If I understand correctly this is run on every context switch so we probably > want to make it really fast Yes, more precisely, it runs on return to user-space, after every context switch going back to a registered rseq thread. > >> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags) >> +{ >> + bool need_restart = false; >> + uint32_t flags; >> + >> + /* Get thread flags. */ >> + if (__get_user(flags, &t->rseq->flags)) >> + return -EFAULT; >> + >> + /* Take into account critical section flags. */ >> + flags |= cs_flags; >> + >> + /* >> +* Restart on signal can only be inhibited when restart on >> +* preempt and restart on migrate are inhibited too. Otherwise, >> +* a preempted signal handler could fail to restart the prior >> +* execution context on sigreturn. >> +*/ >> + if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) { >> + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) >> + return -EINVAL; >> + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) >> + return -EINVAL; >> + } > > How does this error even get to userspace? Is it worth doing this switch on > every execution? If we detect this situation, the rseq_need_restart caller will end up sending a SIGSEGV signal to user-space. Note that the two nested if() checks are only executing in the unlikely case where the NO_RESTART_ON_SIGNAL flag is set. > > >> + if (t->rseq_migrate >> + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) >> + need_restart = true; >> + else if (t->rseq_preempt >> + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) >> + need_restart = true; >> + else if (t->rseq_signal >> + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) >> + need_restart = true; > > This could potentially be sped up by having the rseq_* fields in t use a > single > bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit > operations to check the appropriate overlap. Given that those are not requests impacting the ABI presented to user-space, I'm tempted to keep these optimizations for the following 4.16 merge window. Is that ok with you ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
(apologies for the duplicate email, the previous one bounced as it was accidentally using HTML formatting) If I understand correctly this is run on every context switch so we probably want to make it really fast > +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags) > +{ > + bool need_restart = false; > + uint32_t flags; > + > + /* Get thread flags. */ > + if (__get_user(flags, &t->rseq->flags)) > + return -EFAULT; > + > + /* Take into account critical section flags. */ > + flags |= cs_flags; > + > + /* > +* Restart on signal can only be inhibited when restart on > +* preempt and restart on migrate are inhibited too. Otherwise, > +* a preempted signal handler could fail to restart the prior > +* execution context on sigreturn. > +*/ > + if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) { > + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > + return -EINVAL; > + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > + return -EINVAL; > + } How does this error even get to userspace? Is it worth doing this switch on every execution? > + if (t->rseq_migrate > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > + need_restart = true; > + else if (t->rseq_preempt > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > + need_restart = true; > + else if (t->rseq_signal > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) > + need_restart = true; This could potentially be sped up by having the rseq_* fields in t use a single bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit operations to check the appropriate overlap.
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
- On Nov 14, 2017, at 3:39 PM, Ben Maurer bmau...@fb.com wrote: >> int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t >>sig); > > Really dumb question -- and one I'm sorry to bring up at the last minute. > Should > we consider making the syscall name something more generic "register_tls_abi"? > I'm assuming that if we ever want to use a per-thread userspace/kernel ABI > we'll want to use this field given the difficulty of getting adoption of > registration, the need to involve glibc, etc. It seems like there could be > future use cases of this TLS area that have nothing to do with rseq. I proposed that approach back in 2016 ("tls abi" system call), and the feedback I received back then is that it was preferred to have a dedicated "rseq" system call than an "open ended" and generic "tls abi" system call. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
> int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t >sig); Really dumb question -- and one I'm sorry to bring up at the last minute. Should we consider making the syscall name something more generic "register_tls_abi"? I'm assuming that if we ever want to use a per-thread userspace/kernel ABI we'll want to use this field given the difficulty of getting adoption of registration, the need to involve glibc, etc. It seems like there could be future use cases of this TLS area that have nothing to do with rseq.
[RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. A second system call, cpu_opv(), is proposed as fallback to deal with debugger single-stepping. cpu_opv() executes a sequence of operations on behalf of user-space with preemption disabled. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op)rseq (ns/op)speedup arm32:344.0 31.4 11.0 x86-64:15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op)rseq (ns/op)speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op)rseq (ns/op)speedup arm32:751.0 128.5 5.8 x86-64:53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop):8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop):0.8 ns - Read CPU from rseq cpu_id:0.8 ns - Read CPU from rseq cpu_id (lazy register):0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 10 seems to indicate, contrary to expectations, that enabling CONFIG_R