Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner t...@linutronix.de wrote: > On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote: >> The use-case for 4k memcpy operation is a per-cpu ring buffer where >> the rseq fast-path does the following: >> >> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a >> given structure (limited to 4k in size) into a ring buffer, >> followed by the final commit instruction which increments the current >> position offset by the number of bytes pushed. >> >> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of >> a given structure (up to 4k) from the ring buffer, at "position" offset. >> The final commit instruction decrements the current position offset by >> the number of bytes pop'd. >> >> Having cpu_opv do a 4k memcpy allow it to handle scenarios where >> rseq fails to progress. > > I'm still confused. Before you talked about the sequence: > >1) Reserve > >2) Commit > > and both use rseq, but obviously you cannot do two "atomic" operation in > one section. > > So now you talk about push. Is that what you described earlier as commit? > > Assumed that it is, I still have no idea why the memcpy needs to happen in > that preempt disabled region. > > If you have space reserved, then the copy does not have any dependencies > neither on the cpu it runs on nor on anything else. So the only > interresting operation is the final commit instruction which tells the > consumer that its ready. > > So what is the part I am missing here aside of having difficulties to map > the constantly changing names of this stuff? Let's clear up some confusion: those are two different use-cases. The ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer with memcpy+position update is a LIFO queue. Let me explain the various use-cases here, so we know what we're talking about. rseq and cpu_opv use-cases 1) per-cpu spinlock A per-cpu spinlock can be implemented as a rseq consisting of a comparison operation (== 0) on a word, and a word store (1), followed by an acquire barrier after control dependency. The unlock path can be performed with a simple store-release of 0 to the word, which does not require rseq. The cpu_opv fallback requires a single-word comparison (== 0) and a single-word store (1). 2) per-cpu statistics counters A per-cpu statistics counters can be implemented as a rseq consisting of a final "add" instruction on a word as commit. The cpu_opv fallback can be implemented as a "ADD" operation. Besides statistics tracking, these counters can be used to implement user-space RCU per-cpu grace period tracking for both single and multi-process user-space RCU. 3) per-cpu LIFO linked-list (unlimited size stack) A per-cpu LIFO linked-list has a "push" and "pop" operation, which respectively adds an item to the list, and removes an item from the list. The "push" operation can be implemented as a rseq consisting of a word comparison instruction against head followed by a word store (commit) to head. Its cpu_opv fallback can be implemented as a word-compare followed by word-store as well. The "pop" operation can be implemented as a rseq consisting of loading head, comparing it against NULL, loading the next pointer at the right offset within the head item, and the next pointer as a new head, returning the old head on success. The cpu_opv fallback for "pop" differs from its rseq algorithm: considering that cpu_opv requires to know all pointers at system call entry so it can pin all pages, so cpu_opv cannot simply load head and then load the head->next address within the preempt-off critical section. User-space needs to pass the head and head->next addresses to the kernel, and the kernel needs to check that the head address is unchanged since it has been loaded by user-space. However, when accessing head->next in a ABA situation, it's possible that head is unchanged, but loading head->next can result in a page fault due to a concurrently freed head object. This is why the "expect_fault" operation field is introduced: if a fault is triggered by this access, "-EAGAIN" will be returned by cpu_opv rather than -EFAULT, thus indicating the the operation vector should be attempted again. The "pop" operation can thus be implemented as a word comparison of head against the head loaded by user-space, followed by a load of the head->next pointer (which may fault), and a store of that pointer as a new head. 4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack) This structure is useful for passing around allocated objects by passing pointers through per-cpu fixed-sized stack. The "push" side can be implemented with a check of the current offset against the maximum buffer length, followed by a rseq consisting of a comparison of the previously loaded offset against the current offset, a word "try store" operation
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner t...@linutronix.de wrote: > On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote: >> The use-case for 4k memcpy operation is a per-cpu ring buffer where >> the rseq fast-path does the following: >> >> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a >> given structure (limited to 4k in size) into a ring buffer, >> followed by the final commit instruction which increments the current >> position offset by the number of bytes pushed. >> >> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of >> a given structure (up to 4k) from the ring buffer, at "position" offset. >> The final commit instruction decrements the current position offset by >> the number of bytes pop'd. >> >> Having cpu_opv do a 4k memcpy allow it to handle scenarios where >> rseq fails to progress. > > I'm still confused. Before you talked about the sequence: > >1) Reserve > >2) Commit > > and both use rseq, but obviously you cannot do two "atomic" operation in > one section. > > So now you talk about push. Is that what you described earlier as commit? > > Assumed that it is, I still have no idea why the memcpy needs to happen in > that preempt disabled region. > > If you have space reserved, then the copy does not have any dependencies > neither on the cpu it runs on nor on anything else. So the only > interresting operation is the final commit instruction which tells the > consumer that its ready. > > So what is the part I am missing here aside of having difficulties to map > the constantly changing names of this stuff? Let's clear up some confusion: those are two different use-cases. The ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer with memcpy+position update is a LIFO queue. Let me explain the various use-cases here, so we know what we're talking about. rseq and cpu_opv use-cases 1) per-cpu spinlock A per-cpu spinlock can be implemented as a rseq consisting of a comparison operation (== 0) on a word, and a word store (1), followed by an acquire barrier after control dependency. The unlock path can be performed with a simple store-release of 0 to the word, which does not require rseq. The cpu_opv fallback requires a single-word comparison (== 0) and a single-word store (1). 2) per-cpu statistics counters A per-cpu statistics counters can be implemented as a rseq consisting of a final "add" instruction on a word as commit. The cpu_opv fallback can be implemented as a "ADD" operation. Besides statistics tracking, these counters can be used to implement user-space RCU per-cpu grace period tracking for both single and multi-process user-space RCU. 3) per-cpu LIFO linked-list (unlimited size stack) A per-cpu LIFO linked-list has a "push" and "pop" operation, which respectively adds an item to the list, and removes an item from the list. The "push" operation can be implemented as a rseq consisting of a word comparison instruction against head followed by a word store (commit) to head. Its cpu_opv fallback can be implemented as a word-compare followed by word-store as well. The "pop" operation can be implemented as a rseq consisting of loading head, comparing it against NULL, loading the next pointer at the right offset within the head item, and the next pointer as a new head, returning the old head on success. The cpu_opv fallback for "pop" differs from its rseq algorithm: considering that cpu_opv requires to know all pointers at system call entry so it can pin all pages, so cpu_opv cannot simply load head and then load the head->next address within the preempt-off critical section. User-space needs to pass the head and head->next addresses to the kernel, and the kernel needs to check that the head address is unchanged since it has been loaded by user-space. However, when accessing head->next in a ABA situation, it's possible that head is unchanged, but loading head->next can result in a page fault due to a concurrently freed head object. This is why the "expect_fault" operation field is introduced: if a fault is triggered by this access, "-EAGAIN" will be returned by cpu_opv rather than -EFAULT, thus indicating the the operation vector should be attempted again. The "pop" operation can thus be implemented as a word comparison of head against the head loaded by user-space, followed by a load of the head->next pointer (which may fault), and a store of that pointer as a new head. 4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack) This structure is useful for passing around allocated objects by passing pointers through per-cpu fixed-sized stack. The "push" side can be implemented with a check of the current offset against the maximum buffer length, followed by a rseq consisting of a comparison of the previously loaded offset against the current offset, a word "try store" operation
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 1:49 PM, Andi Kleen a...@firstfloor.org wrote: >> Having cpu_opv do a 4k memcpy allow it to handle scenarios where >> rseq fails to progress. > > If anybody ever gets that right. It will be really hard to just > test such a path. > > It also seems fairly theoretical to me. Do you even have a > test case where the normal path stops making forward progress? We expect the following loop to progress, typically after a single iteration: do { cpu = rseq_cpu_start(); ret = rseq_addv(, 1, cpu); attempts++; } while (ret); Now runnig this in gdb, break on "main", run, and single-step execution with "next", the program is stuck in an infinite loop. What solution do you have in mind to handle this kind of scenario without breaking pre-existing debuggers ? Looking at vDSO examples of vgetcpu and vclock_gettime under gdb 7.7.1 (debian) with glibc 2.19: sched_getcpu behavior under single-stepping per source line with "step" seems to only see the ../sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S source lines, which makes it skip single-stepping of the vDSO. sched_getcpu under "stepi": it does go through the vDSO instruction addresses. It does progress, given that there is no loop there. clock_gettime under "step": it only sees source lines of ../sysdeps/unix/clock_gettime.c. clock_gettime under "stepi": it's stuck in an infinite loop. So instruction-level stepping from gdb turns clock_gettime vDSO into a never-ending loop, which is already bad. But with rseq, the situation is even worse, because it turns source line level single-stepping into infinite loops. My understanding from https://sourceware.org/bugzilla/show_bug.cgi?id=14466 is that GDB currently simply removes the vDSO from its list of library mappings, which is probably why it skips over vDSO for the source lines single-stepping case. We cannot do that with rseq, because we _want_ the rseq critical section to be inlined into the application or library. A function call costs more than most rseq critical sections. I plan to have the rseq user-space code provide a "__rseq_table" section so debuggers can eventually figure out that they need to skip over the rseq critical sections. However, it won't help the fact that pre-existing debugger single-stepping will start turning perfectly working programs into never-ending loops simply by having glibc use rseq for memory allocation. Using the cpu_opv system call on rseq failure solves this problem entirely. I would even go further and recommend to take a similar approach when lack of progress is detected in a vDSO, and invoke the equivalent system call. The current implementation of the clock_gettime() vDSO turns instruction-level single-stepping into never ending loops, which is far from being elegant. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 1:49 PM, Andi Kleen a...@firstfloor.org wrote: >> Having cpu_opv do a 4k memcpy allow it to handle scenarios where >> rseq fails to progress. > > If anybody ever gets that right. It will be really hard to just > test such a path. > > It also seems fairly theoretical to me. Do you even have a > test case where the normal path stops making forward progress? We expect the following loop to progress, typically after a single iteration: do { cpu = rseq_cpu_start(); ret = rseq_addv(, 1, cpu); attempts++; } while (ret); Now runnig this in gdb, break on "main", run, and single-step execution with "next", the program is stuck in an infinite loop. What solution do you have in mind to handle this kind of scenario without breaking pre-existing debuggers ? Looking at vDSO examples of vgetcpu and vclock_gettime under gdb 7.7.1 (debian) with glibc 2.19: sched_getcpu behavior under single-stepping per source line with "step" seems to only see the ../sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S source lines, which makes it skip single-stepping of the vDSO. sched_getcpu under "stepi": it does go through the vDSO instruction addresses. It does progress, given that there is no loop there. clock_gettime under "step": it only sees source lines of ../sysdeps/unix/clock_gettime.c. clock_gettime under "stepi": it's stuck in an infinite loop. So instruction-level stepping from gdb turns clock_gettime vDSO into a never-ending loop, which is already bad. But with rseq, the situation is even worse, because it turns source line level single-stepping into infinite loops. My understanding from https://sourceware.org/bugzilla/show_bug.cgi?id=14466 is that GDB currently simply removes the vDSO from its list of library mappings, which is probably why it skips over vDSO for the source lines single-stepping case. We cannot do that with rseq, because we _want_ the rseq critical section to be inlined into the application or library. A function call costs more than most rseq critical sections. I plan to have the rseq user-space code provide a "__rseq_table" section so debuggers can eventually figure out that they need to skip over the rseq critical sections. However, it won't help the fact that pre-existing debugger single-stepping will start turning perfectly working programs into never-ending loops simply by having glibc use rseq for memory allocation. Using the cpu_opv system call on rseq failure solves this problem entirely. I would even go further and recommend to take a similar approach when lack of progress is detected in a vDSO, and invoke the equivalent system call. The current implementation of the clock_gettime() vDSO turns instruction-level single-stepping into never ending loops, which is far from being elegant. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote: > The use-case for 4k memcpy operation is a per-cpu ring buffer where > the rseq fast-path does the following: > > - ring buffer push: in the rseq asm instruction sequence, a memcpy of a > given structure (limited to 4k in size) into a ring buffer, > followed by the final commit instruction which increments the current > position offset by the number of bytes pushed. > > - ring buffer pop: in the rseq asm instruction sequence, a memcpy of > a given structure (up to 4k) from the ring buffer, at "position" offset. > The final commit instruction decrements the current position offset by > the number of bytes pop'd. > > Having cpu_opv do a 4k memcpy allow it to handle scenarios where > rseq fails to progress. I'm still confused. Before you talked about the sequence: 1) Reserve 2) Commit and both use rseq, but obviously you cannot do two "atomic" operation in one section. So now you talk about push. Is that what you described earlier as commit? Assumed that it is, I still have no idea why the memcpy needs to happen in that preempt disabled region. If you have space reserved, then the copy does not have any dependencies neither on the cpu it runs on nor on anything else. So the only interresting operation is the final commit instruction which tells the consumer that its ready. So what is the part I am missing here aside of having difficulties to map the constantly changing names of this stuff? Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote: > The use-case for 4k memcpy operation is a per-cpu ring buffer where > the rseq fast-path does the following: > > - ring buffer push: in the rseq asm instruction sequence, a memcpy of a > given structure (limited to 4k in size) into a ring buffer, > followed by the final commit instruction which increments the current > position offset by the number of bytes pushed. > > - ring buffer pop: in the rseq asm instruction sequence, a memcpy of > a given structure (up to 4k) from the ring buffer, at "position" offset. > The final commit instruction decrements the current position offset by > the number of bytes pop'd. > > Having cpu_opv do a 4k memcpy allow it to handle scenarios where > rseq fails to progress. I'm still confused. Before you talked about the sequence: 1) Reserve 2) Commit and both use rseq, but obviously you cannot do two "atomic" operation in one section. So now you talk about push. Is that what you described earlier as commit? Assumed that it is, I still have no idea why the memcpy needs to happen in that preempt disabled region. If you have space reserved, then the copy does not have any dependencies neither on the cpu it runs on nor on anything else. So the only interresting operation is the final commit instruction which tells the consumer that its ready. So what is the part I am missing here aside of having difficulties to map the constantly changing names of this stuff? Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where > rseq fails to progress. If anybody ever gets that right. It will be really hard to just test such a path. It also seems fairly theoretical to me. Do you even have a test case where the normal path stops making forward progress? -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where > rseq fails to progress. If anybody ever gets that right. It will be really hard to just test such a path. It also seems fairly theoretical to me. Do you even have a test case where the normal path stops making forward progress? -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 1:03 PM, Thomas Gleixner t...@linutronix.de wrote: > On Mon, 20 Nov 2017, Thomas Gleixner wrote: >> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: >> > >> + * The reason why we require all pointer offsets to be calculated by >> > >> + * user-space beforehand is because we need to use >> > >> get_user_pages_fast() >> > >> + * to first pin all pages touched by each operation. This takes care of >> > > >> > > That doesnt explain it either. >> > >> > What kind of explication are you looking for here ? Perhaps being too close >> > to the implementation prevents me from understanding what is unclear from >> > your perspective. >> >> What the heck are pointer offsets? >> >> The ops have one or two pointer(s) to a lump of memory. So if a pointer >> points to the wrong lump of memory then you're screwed, but that's true for >> all pointers handed to the kernel. > > So I think you mix here the 'user space programmer guide - how to set up > the magic ops - into the kernel side documentation. The kernel side does > not care about the pointers, as long as they are valid to access. > > Again. Try to explain things at the conceptual level. Yes, I took the sentence you suggested for that reason: it removes usage details that are meant for user-space implementer, which do not belong in those comments. Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 1:03 PM, Thomas Gleixner t...@linutronix.de wrote: > On Mon, 20 Nov 2017, Thomas Gleixner wrote: >> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: >> > >> + * The reason why we require all pointer offsets to be calculated by >> > >> + * user-space beforehand is because we need to use >> > >> get_user_pages_fast() >> > >> + * to first pin all pages touched by each operation. This takes care of >> > > >> > > That doesnt explain it either. >> > >> > What kind of explication are you looking for here ? Perhaps being too close >> > to the implementation prevents me from understanding what is unclear from >> > your perspective. >> >> What the heck are pointer offsets? >> >> The ops have one or two pointer(s) to a lump of memory. So if a pointer >> points to the wrong lump of memory then you're screwed, but that's true for >> all pointers handed to the kernel. > > So I think you mix here the 'user space programmer guide - how to set up > the magic ops - into the kernel side documentation. The kernel side does > not care about the pointers, as long as they are valid to access. > > Again. Try to explain things at the conceptual level. Yes, I took the sentence you suggested for that reason: it removes usage details that are meant for user-space implementer, which do not belong in those comments. Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote: > On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote: >> >> +#define NR_PINNED_PAGES_ON_STACK 8 >> > >> > 8 pinned pages on stack? Which stack? >> >> The common cases need to touch few pages, and we can keep the >> pointers in an array on the kernel stack within the cpu_opv system >> call. >> >> Updating to: >> >> /* >> * Typical invocation of cpu_opv need few pages. Keep struct page >> * pointers in an array on the stack of the cpu_opv system call up to >> * this limit, beyond which the array is dynamically allocated. >> */ >> #define NR_PIN_PAGES_ON_STACK8 > > That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious. fixed. > >> >> + * The operations available are: comparison, memcpy, add, or, and, xor, >> >> + * left shift, and right shift. The system call receives a CPU number >> >> + * from user-space as argument, which is the CPU on which those >> >> + * operations need to be performed. All preparation steps such as >> >> + * loading pointers, and applying offsets to arrays, need to be >> >> + * performed by user-space before invoking the system call. The >> > >> > loading pointers and applying offsets? That makes no sense. >> >> Updating to: >> >> * All preparation steps such as >> * loading base pointers, and adding offsets derived from the current >> * CPU number, need to be performed by user-space before invoking the >> * system call. > > This still does not explain anything, really. > > Which base pointer is loaded? I nowhere see a reference to a base > pointer. > > And what are the offsets about? > > derived from current cpu number? What is current CPU number? The one on > which the task executes now or the one which it should execute on? > > I assume what you want to say is: > > All pointers in the ops must have been set up to point to the per CPU > memory of the CPU on which the operations should be executed. > > At least that's what I oracle in to that. Exactly that. Will update to use this description instead. > >> >> + * "comparison" operation can be used to check that the data used in the >> >> + * preparation step did not change between preparation of system call >> >> + * inputs and operation execution within the preempt-off critical >> >> + * section. >> >> + * >> >> + * The reason why we require all pointer offsets to be calculated by >> >> + * user-space beforehand is because we need to use get_user_pages_fast() >> >> + * to first pin all pages touched by each operation. This takes care of >> > >> > That doesnt explain it either. >> >> What kind of explication are you looking for here ? Perhaps being too close >> to the implementation prevents me from understanding what is unclear from >> your perspective. > > What the heck are pointer offsets? > > The ops have one or two pointer(s) to a lump of memory. So if a pointer > points to the wrong lump of memory then you're screwed, but that's true for > all pointers handed to the kernel. I think the sentence you suggested above is clear enough. I'll simply use it. > >> Sorry, that paragraph was unclear. Updated: >> >> * An overall maximum of 4216 bytes in enforced on the sum of operation >> * length within an operation vector, so user-space cannot generate a >> * too long preempt-off critical section (cache cold critical section >> * duration measured as 4.7µs on x86-64). Each operation is also limited >> * a length of PAGE_SIZE bytes, > > Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want > is a hard limit of 4K. And because there is no alignment requiremnt the > rest of the sentence is stating the obvious. I can make that a 4K limit if you prefer. This presumes that no architecture has pages smaller than 4K, which is true on Linux. > >> * meaning that an operation can touch a >> * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for >> * destination if addresses are not aligned on page boundaries). > > I still have to understand why the 4K copy is necessary in the first place. > >> > What's the critical section duration for operations which go to the limits >> > of this on a average x86 64 machine? >> >> When cache-cold, I measure 4.7 µs per critical section doing a >> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an >> acceptable preempt-off latency for RT ? > > Depends on the use case as always The use-case for 4k memcpy operation is a per-cpu ring buffer where the rseq fast-path does the following: - ring buffer push: in the rseq asm instruction sequence, a memcpy of a given structure (limited to 4k in size) into a ring buffer, followed by the final commit instruction which increments the current position offset by the number of bytes pushed. - ring buffer pop: in the rseq asm instruction sequence, a memcpy of a given structure (up to 4k)
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote: > On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote: >> >> +#define NR_PINNED_PAGES_ON_STACK 8 >> > >> > 8 pinned pages on stack? Which stack? >> >> The common cases need to touch few pages, and we can keep the >> pointers in an array on the kernel stack within the cpu_opv system >> call. >> >> Updating to: >> >> /* >> * Typical invocation of cpu_opv need few pages. Keep struct page >> * pointers in an array on the stack of the cpu_opv system call up to >> * this limit, beyond which the array is dynamically allocated. >> */ >> #define NR_PIN_PAGES_ON_STACK8 > > That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious. fixed. > >> >> + * The operations available are: comparison, memcpy, add, or, and, xor, >> >> + * left shift, and right shift. The system call receives a CPU number >> >> + * from user-space as argument, which is the CPU on which those >> >> + * operations need to be performed. All preparation steps such as >> >> + * loading pointers, and applying offsets to arrays, need to be >> >> + * performed by user-space before invoking the system call. The >> > >> > loading pointers and applying offsets? That makes no sense. >> >> Updating to: >> >> * All preparation steps such as >> * loading base pointers, and adding offsets derived from the current >> * CPU number, need to be performed by user-space before invoking the >> * system call. > > This still does not explain anything, really. > > Which base pointer is loaded? I nowhere see a reference to a base > pointer. > > And what are the offsets about? > > derived from current cpu number? What is current CPU number? The one on > which the task executes now or the one which it should execute on? > > I assume what you want to say is: > > All pointers in the ops must have been set up to point to the per CPU > memory of the CPU on which the operations should be executed. > > At least that's what I oracle in to that. Exactly that. Will update to use this description instead. > >> >> + * "comparison" operation can be used to check that the data used in the >> >> + * preparation step did not change between preparation of system call >> >> + * inputs and operation execution within the preempt-off critical >> >> + * section. >> >> + * >> >> + * The reason why we require all pointer offsets to be calculated by >> >> + * user-space beforehand is because we need to use get_user_pages_fast() >> >> + * to first pin all pages touched by each operation. This takes care of >> > >> > That doesnt explain it either. >> >> What kind of explication are you looking for here ? Perhaps being too close >> to the implementation prevents me from understanding what is unclear from >> your perspective. > > What the heck are pointer offsets? > > The ops have one or two pointer(s) to a lump of memory. So if a pointer > points to the wrong lump of memory then you're screwed, but that's true for > all pointers handed to the kernel. I think the sentence you suggested above is clear enough. I'll simply use it. > >> Sorry, that paragraph was unclear. Updated: >> >> * An overall maximum of 4216 bytes in enforced on the sum of operation >> * length within an operation vector, so user-space cannot generate a >> * too long preempt-off critical section (cache cold critical section >> * duration measured as 4.7µs on x86-64). Each operation is also limited >> * a length of PAGE_SIZE bytes, > > Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want > is a hard limit of 4K. And because there is no alignment requiremnt the > rest of the sentence is stating the obvious. I can make that a 4K limit if you prefer. This presumes that no architecture has pages smaller than 4K, which is true on Linux. > >> * meaning that an operation can touch a >> * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for >> * destination if addresses are not aligned on page boundaries). > > I still have to understand why the 4K copy is necessary in the first place. > >> > What's the critical section duration for operations which go to the limits >> > of this on a average x86 64 machine? >> >> When cache-cold, I measure 4.7 µs per critical section doing a >> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an >> acceptable preempt-off latency for RT ? > > Depends on the use case as always The use-case for 4k memcpy operation is a per-cpu ring buffer where the rseq fast-path does the following: - ring buffer push: in the rseq asm instruction sequence, a memcpy of a given structure (limited to 4k in size) into a ring buffer, followed by the final commit instruction which increments the current position offset by the number of bytes pushed. - ring buffer pop: in the rseq asm instruction sequence, a memcpy of a given structure (up to 4k)
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Mon, 20 Nov 2017, Thomas Gleixner wrote: > On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: > > >> + * The reason why we require all pointer offsets to be calculated by > > >> + * user-space beforehand is because we need to use get_user_pages_fast() > > >> + * to first pin all pages touched by each operation. This takes care of > > > > > > That doesnt explain it either. > > > > What kind of explication are you looking for here ? Perhaps being too close > > to the implementation prevents me from understanding what is unclear from > > your perspective. > > What the heck are pointer offsets? > > The ops have one or two pointer(s) to a lump of memory. So if a pointer > points to the wrong lump of memory then you're screwed, but that's true for > all pointers handed to the kernel. So I think you mix here the 'user space programmer guide - how to set up the magic ops - into the kernel side documentation. The kernel side does not care about the pointers, as long as they are valid to access. Again. Try to explain things at the conceptual level. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Mon, 20 Nov 2017, Thomas Gleixner wrote: > On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: > > >> + * The reason why we require all pointer offsets to be calculated by > > >> + * user-space beforehand is because we need to use get_user_pages_fast() > > >> + * to first pin all pages touched by each operation. This takes care of > > > > > > That doesnt explain it either. > > > > What kind of explication are you looking for here ? Perhaps being too close > > to the implementation prevents me from understanding what is unclear from > > your perspective. > > What the heck are pointer offsets? > > The ops have one or two pointer(s) to a lump of memory. So if a pointer > points to the wrong lump of memory then you're screwed, but that's true for > all pointers handed to the kernel. So I think you mix here the 'user space programmer guide - how to set up the magic ops - into the kernel side documentation. The kernel side does not care about the pointers, as long as they are valid to access. Again. Try to explain things at the conceptual level. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote: > >> +#define NR_PINNED_PAGES_ON_STACK 8 > > > > 8 pinned pages on stack? Which stack? > > The common cases need to touch few pages, and we can keep the > pointers in an array on the kernel stack within the cpu_opv system > call. > > Updating to: > > /* > * Typical invocation of cpu_opv need few pages. Keep struct page > * pointers in an array on the stack of the cpu_opv system call up to > * this limit, beyond which the array is dynamically allocated. > */ > #define NR_PIN_PAGES_ON_STACK8 That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious. > >> + * The operations available are: comparison, memcpy, add, or, and, xor, > >> + * left shift, and right shift. The system call receives a CPU number > >> + * from user-space as argument, which is the CPU on which those > >> + * operations need to be performed. All preparation steps such as > >> + * loading pointers, and applying offsets to arrays, need to be > >> + * performed by user-space before invoking the system call. The > > > > loading pointers and applying offsets? That makes no sense. > > Updating to: > > * All preparation steps such as > * loading base pointers, and adding offsets derived from the current > * CPU number, need to be performed by user-space before invoking the > * system call. This still does not explain anything, really. Which base pointer is loaded? I nowhere see a reference to a base pointer. And what are the offsets about? derived from current cpu number? What is current CPU number? The one on which the task executes now or the one which it should execute on? I assume what you want to say is: All pointers in the ops must have been set up to point to the per CPU memory of the CPU on which the operations should be executed. At least that's what I oracle in to that. > >> + * "comparison" operation can be used to check that the data used in the > >> + * preparation step did not change between preparation of system call > >> + * inputs and operation execution within the preempt-off critical > >> + * section. > >> + * > >> + * The reason why we require all pointer offsets to be calculated by > >> + * user-space beforehand is because we need to use get_user_pages_fast() > >> + * to first pin all pages touched by each operation. This takes care of > > > > That doesnt explain it either. > > What kind of explication are you looking for here ? Perhaps being too close > to the implementation prevents me from understanding what is unclear from > your perspective. What the heck are pointer offsets? The ops have one or two pointer(s) to a lump of memory. So if a pointer points to the wrong lump of memory then you're screwed, but that's true for all pointers handed to the kernel. > Sorry, that paragraph was unclear. Updated: > > * An overall maximum of 4216 bytes in enforced on the sum of operation > * length within an operation vector, so user-space cannot generate a > * too long preempt-off critical section (cache cold critical section > * duration measured as 4.7µs on x86-64). Each operation is also limited > * a length of PAGE_SIZE bytes, Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want is a hard limit of 4K. And because there is no alignment requiremnt the rest of the sentence is stating the obvious. > * meaning that an operation can touch a > * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for > * destination if addresses are not aligned on page boundaries). I still have to understand why the 4K copy is necessary in the first place. > > What's the critical section duration for operations which go to the limits > > of this on a average x86 64 machine? > > When cache-cold, I measure 4.7 µs per critical section doing a > 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an > acceptable preempt-off latency for RT ? Depends on the use case as always Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote: > >> +#define NR_PINNED_PAGES_ON_STACK 8 > > > > 8 pinned pages on stack? Which stack? > > The common cases need to touch few pages, and we can keep the > pointers in an array on the kernel stack within the cpu_opv system > call. > > Updating to: > > /* > * Typical invocation of cpu_opv need few pages. Keep struct page > * pointers in an array on the stack of the cpu_opv system call up to > * this limit, beyond which the array is dynamically allocated. > */ > #define NR_PIN_PAGES_ON_STACK8 That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious. > >> + * The operations available are: comparison, memcpy, add, or, and, xor, > >> + * left shift, and right shift. The system call receives a CPU number > >> + * from user-space as argument, which is the CPU on which those > >> + * operations need to be performed. All preparation steps such as > >> + * loading pointers, and applying offsets to arrays, need to be > >> + * performed by user-space before invoking the system call. The > > > > loading pointers and applying offsets? That makes no sense. > > Updating to: > > * All preparation steps such as > * loading base pointers, and adding offsets derived from the current > * CPU number, need to be performed by user-space before invoking the > * system call. This still does not explain anything, really. Which base pointer is loaded? I nowhere see a reference to a base pointer. And what are the offsets about? derived from current cpu number? What is current CPU number? The one on which the task executes now or the one which it should execute on? I assume what you want to say is: All pointers in the ops must have been set up to point to the per CPU memory of the CPU on which the operations should be executed. At least that's what I oracle in to that. > >> + * "comparison" operation can be used to check that the data used in the > >> + * preparation step did not change between preparation of system call > >> + * inputs and operation execution within the preempt-off critical > >> + * section. > >> + * > >> + * The reason why we require all pointer offsets to be calculated by > >> + * user-space beforehand is because we need to use get_user_pages_fast() > >> + * to first pin all pages touched by each operation. This takes care of > > > > That doesnt explain it either. > > What kind of explication are you looking for here ? Perhaps being too close > to the implementation prevents me from understanding what is unclear from > your perspective. What the heck are pointer offsets? The ops have one or two pointer(s) to a lump of memory. So if a pointer points to the wrong lump of memory then you're screwed, but that's true for all pointers handed to the kernel. > Sorry, that paragraph was unclear. Updated: > > * An overall maximum of 4216 bytes in enforced on the sum of operation > * length within an operation vector, so user-space cannot generate a > * too long preempt-off critical section (cache cold critical section > * duration measured as 4.7µs on x86-64). Each operation is also limited > * a length of PAGE_SIZE bytes, Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want is a hard limit of 4K. And because there is no alignment requiremnt the rest of the sentence is stating the obvious. > * meaning that an operation can touch a > * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for > * destination if addresses are not aligned on page boundaries). I still have to understand why the 4K copy is necessary in the first place. > > What's the critical section duration for operations which go to the limits > > of this on a average x86 64 machine? > > When cache-cold, I measure 4.7 µs per critical section doing a > 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an > acceptable preempt-off latency for RT ? Depends on the use case as always Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 17, 2017, at 3:22 PM, Thomas Gleixner t...@linutronix.de wrote: > On Fri, 17 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote: >> 7) Allow libraries with multi-part algorithms to work on same per-cpu >>data without affecting the allowed cpu mask >> >> I stumbled on an interesting use-case within the lttng-ust tracer >> per-cpu buffers: the algorithm needs to update a "reserve" counter, >> serialize data into the buffer, and then update a "commit" counter >> _on the same per-cpu buffer_. My goal is to use rseq for both reserve >> and commit. >> >> Clearly, if rseq reserve fails, the algorithm can retry on a different >> per-cpu buffer. However, it's not that easy for the commit. It needs to >> be performed on the same per-cpu buffer as the reserve. >> >> The cpu_opv system call solves that problem by receiving the cpu number >> on which the operation needs to be performed as argument. It can push >> the task to the right CPU if needed, and perform the operations there >> with preemption disabled. > > If your transaction cannot be done in one go, then abusing that byte code > interpreter for concluding it is just hillarious. That whole exercise is a > gazillion times slower than the atomic operations which are neccesary to do > it without all that. > > I'm even more convinced now that this is overengineered beyond repair. The fast-path (typical case) will execute on the right CPU, and rseq will do both the reserve and the commit, and that is faster than atomic ops. However, we need to handle migration between reserve and commit. Unfortunately, concurrent rseq and atomic ops don't mix well on the same per-cpu data, so we cannot fall-back on atomic ops, except in very specific cases where we can use a split-counter strategy. So cpu_opv handles migration for this use-case by ensuring the slow-path is performed with preemption-off after migrating the current task to the right CPU. Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 17, 2017, at 3:22 PM, Thomas Gleixner t...@linutronix.de wrote: > On Fri, 17 Nov 2017, Mathieu Desnoyers wrote: >> - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote: >> 7) Allow libraries with multi-part algorithms to work on same per-cpu >>data without affecting the allowed cpu mask >> >> I stumbled on an interesting use-case within the lttng-ust tracer >> per-cpu buffers: the algorithm needs to update a "reserve" counter, >> serialize data into the buffer, and then update a "commit" counter >> _on the same per-cpu buffer_. My goal is to use rseq for both reserve >> and commit. >> >> Clearly, if rseq reserve fails, the algorithm can retry on a different >> per-cpu buffer. However, it's not that easy for the commit. It needs to >> be performed on the same per-cpu buffer as the reserve. >> >> The cpu_opv system call solves that problem by receiving the cpu number >> on which the operation needs to be performed as argument. It can push >> the task to the right CPU if needed, and perform the operations there >> with preemption disabled. > > If your transaction cannot be done in one go, then abusing that byte code > interpreter for concluding it is just hillarious. That whole exercise is a > gazillion times slower than the atomic operations which are neccesary to do > it without all that. > > I'm even more convinced now that this is overengineered beyond repair. The fast-path (typical case) will execute on the right CPU, and rseq will do both the reserve and the commit, and that is faster than atomic ops. However, we need to handle migration between reserve and commit. Unfortunately, concurrent rseq and atomic ops don't mix well on the same per-cpu data, so we cannot fall-back on atomic ops, except in very specific cases where we can use a split-counter strategy. So cpu_opv handles migration for this use-case by ensuring the slow-path is performed with preemption-off after migrating the current task to the right CPU. Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote: > On Tue, 14 Nov 2017, Mathieu Desnoyers wrote: >> +#ifdef __KERNEL__ >> +# include >> +#else /* #ifdef __KERNEL__ */ > > Sigh. fixed. > >> +# include >> +#endif /* #else #ifdef __KERNEL__ */ >> + >> +#include >> + >> +#ifdef __LP64__ >> +# define CPU_OP_FIELD_u32_u64(field)uint64_t field >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)field = >> (intptr_t)v >> +#elif defined(__BYTE_ORDER) ? \ >> +__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) >> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field ## _padding, >> field >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\ >> +field ## _padding = 0, field = (intptr_t)v >> +#else >> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field, field ## >> _padding >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\ >> +field = (intptr_t)v, field ## _padding = 0 >> +#endif > > So in the rseq patch you have: > > +#ifdef __LP64__ > +# define RSEQ_FIELD_u32_u64(field) uint64_t field > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v > +#elif defined(__BYTE_ORDER) ? \ > + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) > +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field ## _padding = 0, field = (intptr_t)v > +#else > +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field = (intptr_t)v, field ## _padding = 0 > +#endif > > IOW the same macro maze. Please use a separate header file which provides > these macros once and share them between the two facilities. ok. I'll introduce uapi/linux/types_32_64.h, and prefix defines with "LINUX_": LINUX_FIELD_u32_u64() unless other names are preferred. It will be in a separate patch. > >> +#define CPU_OP_VEC_LEN_MAX 16 >> +#define CPU_OP_ARG_LEN_MAX 24 >> +/* Max. data len per operation. */ >> +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE > > That's something between 4K and 256K depending on the architecture. > > You really want to allow up to 256K data copy with preemption disabled? > Shudder. This is the max per operation. Following peterz' comments at KS, I added a limit of 4096 + 15 * 8 on the sum of len for all operations in a vector. This is defined below as CPU_OP_VEC_DATA_LEN_MAX. So each of the 16 op cannot have a len larger than PAGE_SIZE, so we pin at most 4 pages per op (e.g. memcpy 2 pages for src, 2 pages for dst), *and* the sum of all ops len needs to be <= 4216. So the max limit you are interested in here is the 4216 bytes limit. > >> +/* >> + * Max. data len for overall vector. We to restrict the amount of > > We to ? fixed > >> + * user-space data touched by the kernel in non-preemptible context so >> + * we do not introduce long scheduler latencies. >> + * This allows one copy of up to 4096 bytes, and 15 operations touching >> + * 8 bytes each. >> + * This limit is applied to the sum of length specified for all >> + * operations in a vector. >> + */ >> +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8) > > Magic numbers. Please use proper defines for heavens sake. ok, it becomes: #define CPU_OP_MEMCPY_EXPECT_LEN4096 #define CPU_OP_EXPECT_LEN 8 #define CPU_OP_VEC_DATA_LEN_MAX \ (CPU_OP_MEMCPY_EXPECT_LEN + \ (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN) > >> +#define CPU_OP_MAX_PAGES4 /* Max. pages per op. */ Actually, I'll move the CPU_OP_MAX_PAGES define to cpu_opv.c. It's not needed in the uapi header. >> + >> +enum cpu_op_type { >> +CPU_COMPARE_EQ_OP, /* compare */ >> +CPU_COMPARE_NE_OP, /* compare */ >> +CPU_MEMCPY_OP, /* memcpy */ >> +CPU_ADD_OP, /* arithmetic */ >> +CPU_OR_OP, /* bitwise */ >> +CPU_AND_OP, /* bitwise */ >> +CPU_XOR_OP, /* bitwise */ >> +CPU_LSHIFT_OP, /* shift */ >> +CPU_RSHIFT_OP, /* shift */ >> +CPU_MB_OP, /* memory barrier */ >> +}; >> + >> +/* Vector of operations to perform. Limited to 16. */ >> +struct cpu_op { >> +int32_t op; /* enum cpu_op_type. */ >> +uint32_t len; /* data length, in bytes. */ > > Please get rid of these tail comments ok > >> +union { >> +#define TMP_BUFLEN 64 >> +#define NR_PINNED_PAGES_ON_STACK8 > > 8 pinned pages on stack? Which stack? The common cases need to touch few pages, and we can keep the pointers in an array on the kernel stack within the cpu_opv system call. Updating to: /* * Typical invocation of cpu_opv need few pages. Keep struct page * pointers in an array on the stack of the cpu_opv
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote: > On Tue, 14 Nov 2017, Mathieu Desnoyers wrote: >> +#ifdef __KERNEL__ >> +# include >> +#else /* #ifdef __KERNEL__ */ > > Sigh. fixed. > >> +# include >> +#endif /* #else #ifdef __KERNEL__ */ >> + >> +#include >> + >> +#ifdef __LP64__ >> +# define CPU_OP_FIELD_u32_u64(field)uint64_t field >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)field = >> (intptr_t)v >> +#elif defined(__BYTE_ORDER) ? \ >> +__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) >> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field ## _padding, >> field >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\ >> +field ## _padding = 0, field = (intptr_t)v >> +#else >> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field, field ## >> _padding >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\ >> +field = (intptr_t)v, field ## _padding = 0 >> +#endif > > So in the rseq patch you have: > > +#ifdef __LP64__ > +# define RSEQ_FIELD_u32_u64(field) uint64_t field > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v > +#elif defined(__BYTE_ORDER) ? \ > + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) > +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field ## _padding = 0, field = (intptr_t)v > +#else > +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding > +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field = (intptr_t)v, field ## _padding = 0 > +#endif > > IOW the same macro maze. Please use a separate header file which provides > these macros once and share them between the two facilities. ok. I'll introduce uapi/linux/types_32_64.h, and prefix defines with "LINUX_": LINUX_FIELD_u32_u64() unless other names are preferred. It will be in a separate patch. > >> +#define CPU_OP_VEC_LEN_MAX 16 >> +#define CPU_OP_ARG_LEN_MAX 24 >> +/* Max. data len per operation. */ >> +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE > > That's something between 4K and 256K depending on the architecture. > > You really want to allow up to 256K data copy with preemption disabled? > Shudder. This is the max per operation. Following peterz' comments at KS, I added a limit of 4096 + 15 * 8 on the sum of len for all operations in a vector. This is defined below as CPU_OP_VEC_DATA_LEN_MAX. So each of the 16 op cannot have a len larger than PAGE_SIZE, so we pin at most 4 pages per op (e.g. memcpy 2 pages for src, 2 pages for dst), *and* the sum of all ops len needs to be <= 4216. So the max limit you are interested in here is the 4216 bytes limit. > >> +/* >> + * Max. data len for overall vector. We to restrict the amount of > > We to ? fixed > >> + * user-space data touched by the kernel in non-preemptible context so >> + * we do not introduce long scheduler latencies. >> + * This allows one copy of up to 4096 bytes, and 15 operations touching >> + * 8 bytes each. >> + * This limit is applied to the sum of length specified for all >> + * operations in a vector. >> + */ >> +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8) > > Magic numbers. Please use proper defines for heavens sake. ok, it becomes: #define CPU_OP_MEMCPY_EXPECT_LEN4096 #define CPU_OP_EXPECT_LEN 8 #define CPU_OP_VEC_DATA_LEN_MAX \ (CPU_OP_MEMCPY_EXPECT_LEN + \ (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN) > >> +#define CPU_OP_MAX_PAGES4 /* Max. pages per op. */ Actually, I'll move the CPU_OP_MAX_PAGES define to cpu_opv.c. It's not needed in the uapi header. >> + >> +enum cpu_op_type { >> +CPU_COMPARE_EQ_OP, /* compare */ >> +CPU_COMPARE_NE_OP, /* compare */ >> +CPU_MEMCPY_OP, /* memcpy */ >> +CPU_ADD_OP, /* arithmetic */ >> +CPU_OR_OP, /* bitwise */ >> +CPU_AND_OP, /* bitwise */ >> +CPU_XOR_OP, /* bitwise */ >> +CPU_LSHIFT_OP, /* shift */ >> +CPU_RSHIFT_OP, /* shift */ >> +CPU_MB_OP, /* memory barrier */ >> +}; >> + >> +/* Vector of operations to perform. Limited to 16. */ >> +struct cpu_op { >> +int32_t op; /* enum cpu_op_type. */ >> +uint32_t len; /* data length, in bytes. */ > > Please get rid of these tail comments ok > >> +union { >> +#define TMP_BUFLEN 64 >> +#define NR_PINNED_PAGES_ON_STACK8 > > 8 pinned pages on stack? Which stack? The common cases need to touch few pages, and we can keep the pointers in an array on the kernel stack within the cpu_opv system call. Updating to: /* * Typical invocation of cpu_opv need few pages. Keep struct page * pointers in an array on the stack of the cpu_opv
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, Nov 17, 2017 at 12:07 PM, Thomas Gleixnerwrote: > On Fri, 17 Nov 2017, Andi Kleen wrote: >> > The most straight forward is to have a mechanism which forces everything >> > into the slow path in case of debugging, lack of progress, etc. The slow >> >> That's the abort address, right? > > Yes. > >> For the generic case the fall back path would require disabling preemption >> unfortunately, for which we don't have a mechanism in user space. >> >> I think that is what Mathieu tried to implement here with this call. > > Yes. preempt disabled execution of byte code to make sure that the > transaction succeeds. > > But, why is disabling preemption mandatory? If stuff fails due to hitting a > breakpoint or because it retried a gazillion times without progress, then > the abort code can detect that and act accordingly. Pseudo code: > > abort: > if (!slowpath_required() && > !breakpoint_caused_abort() && > !stall_detected()) { > do_the_normal_abort_postprocessing(); > goto retry; > } > > lock(slowpath_lock[cpu]); > > if (!slowpath_required()) { > unlock(slowpath_lock[cpu]); > goto retry; > } > > if (rseq_supported) > set_slow_path(); > > /* Same code as inside the actual rseq */ > do_transaction(); > > if (rseq_supported) > unset_slow_path(); > > unlock(slowpath_lock[cpu]); My objection to this approach is that people will get it wrong and not notice until it's too late. TSX has two things going for it: 1. It's part of the ISA, so debuggers have very well-defined semantics to deal with and debuggers will know about it. rseq is a made-up Linux thing and debuggers may not know what to do with it. 2. TSX is slow and crappy, so it may not be that widely used. glibc, OTOH, will probably start using rseq on all machines if the patches are merged.
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, Nov 17, 2017 at 12:07 PM, Thomas Gleixner wrote: > On Fri, 17 Nov 2017, Andi Kleen wrote: >> > The most straight forward is to have a mechanism which forces everything >> > into the slow path in case of debugging, lack of progress, etc. The slow >> >> That's the abort address, right? > > Yes. > >> For the generic case the fall back path would require disabling preemption >> unfortunately, for which we don't have a mechanism in user space. >> >> I think that is what Mathieu tried to implement here with this call. > > Yes. preempt disabled execution of byte code to make sure that the > transaction succeeds. > > But, why is disabling preemption mandatory? If stuff fails due to hitting a > breakpoint or because it retried a gazillion times without progress, then > the abort code can detect that and act accordingly. Pseudo code: > > abort: > if (!slowpath_required() && > !breakpoint_caused_abort() && > !stall_detected()) { > do_the_normal_abort_postprocessing(); > goto retry; > } > > lock(slowpath_lock[cpu]); > > if (!slowpath_required()) { > unlock(slowpath_lock[cpu]); > goto retry; > } > > if (rseq_supported) > set_slow_path(); > > /* Same code as inside the actual rseq */ > do_transaction(); > > if (rseq_supported) > unset_slow_path(); > > unlock(slowpath_lock[cpu]); My objection to this approach is that people will get it wrong and not notice until it's too late. TSX has two things going for it: 1. It's part of the ISA, so debuggers have very well-defined semantics to deal with and debuggers will know about it. rseq is a made-up Linux thing and debuggers may not know what to do with it. 2. TSX is slow and crappy, so it may not be that widely used. glibc, OTOH, will probably start using rseq on all machines if the patches are merged.
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, 17 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote: > 7) Allow libraries with multi-part algorithms to work on same per-cpu >data without affecting the allowed cpu mask > > I stumbled on an interesting use-case within the lttng-ust tracer > per-cpu buffers: the algorithm needs to update a "reserve" counter, > serialize data into the buffer, and then update a "commit" counter > _on the same per-cpu buffer_. My goal is to use rseq for both reserve > and commit. > > Clearly, if rseq reserve fails, the algorithm can retry on a different > per-cpu buffer. However, it's not that easy for the commit. It needs to > be performed on the same per-cpu buffer as the reserve. > > The cpu_opv system call solves that problem by receiving the cpu number > on which the operation needs to be performed as argument. It can push > the task to the right CPU if needed, and perform the operations there > with preemption disabled. If your transaction cannot be done in one go, then abusing that byte code interpreter for concluding it is just hillarious. That whole exercise is a gazillion times slower than the atomic operations which are neccesary to do it without all that. I'm even more convinced now that this is overengineered beyond repair. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, 17 Nov 2017, Mathieu Desnoyers wrote: > - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote: > 7) Allow libraries with multi-part algorithms to work on same per-cpu >data without affecting the allowed cpu mask > > I stumbled on an interesting use-case within the lttng-ust tracer > per-cpu buffers: the algorithm needs to update a "reserve" counter, > serialize data into the buffer, and then update a "commit" counter > _on the same per-cpu buffer_. My goal is to use rseq for both reserve > and commit. > > Clearly, if rseq reserve fails, the algorithm can retry on a different > per-cpu buffer. However, it's not that easy for the commit. It needs to > be performed on the same per-cpu buffer as the reserve. > > The cpu_opv system call solves that problem by receiving the cpu number > on which the operation needs to be performed as argument. It can push > the task to the right CPU if needed, and perform the operations there > with preemption disabled. If your transaction cannot be done in one go, then abusing that byte code interpreter for concluding it is just hillarious. That whole exercise is a gazillion times slower than the atomic operations which are neccesary to do it without all that. I'm even more convinced now that this is overengineered beyond repair. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, 17 Nov 2017, Andi Kleen wrote: > > The most straight forward is to have a mechanism which forces everything > > into the slow path in case of debugging, lack of progress, etc. The slow > > That's the abort address, right? Yes. > For the generic case the fall back path would require disabling preemption > unfortunately, for which we don't have a mechanism in user space. > > I think that is what Mathieu tried to implement here with this call. Yes. preempt disabled execution of byte code to make sure that the transaction succeeds. But, why is disabling preemption mandatory? If stuff fails due to hitting a breakpoint or because it retried a gazillion times without progress, then the abort code can detect that and act accordingly. Pseudo code: abort: if (!slowpath_required() && !breakpoint_caused_abort() && !stall_detected()) { do_the_normal_abort_postprocessing(); goto retry; } lock(slowpath_lock[cpu]); if (!slowpath_required()) { unlock(slowpath_lock[cpu]); goto retry; } if (rseq_supported) set_slow_path(); /* Same code as inside the actual rseq */ do_transaction(); if (rseq_supported) unset_slow_path(); unlock(slowpath_lock[cpu]); The only interesting question is how to make sure that all threads on that CPU see the slowpath required before they execute the commit so they are forced into the slow path. The simplest thing would be atomics, but that's what rseq wants to avoid. I think that this can be solved cleanly with the help of the membarrier syscall or some variant of that without all that 'yet another byte code interpreter' mess. The other question is whether do_transaction() is required to run on that specific CPU. I don't think so because that magic interpreter operates even when the required target cpu is offline and with locking in place there is no reason why running on the target CPU would be required. Sure, that's going to affect performance, but only for two cases: 1) Debugging. That's completely uninteresting 2) No progress at all. Performance is down the drain anyway, so it does not matter at all whether you spend a few more cycles or not to resolve that. I might be missing something as usual :) Thanks tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, 17 Nov 2017, Andi Kleen wrote: > > The most straight forward is to have a mechanism which forces everything > > into the slow path in case of debugging, lack of progress, etc. The slow > > That's the abort address, right? Yes. > For the generic case the fall back path would require disabling preemption > unfortunately, for which we don't have a mechanism in user space. > > I think that is what Mathieu tried to implement here with this call. Yes. preempt disabled execution of byte code to make sure that the transaction succeeds. But, why is disabling preemption mandatory? If stuff fails due to hitting a breakpoint or because it retried a gazillion times without progress, then the abort code can detect that and act accordingly. Pseudo code: abort: if (!slowpath_required() && !breakpoint_caused_abort() && !stall_detected()) { do_the_normal_abort_postprocessing(); goto retry; } lock(slowpath_lock[cpu]); if (!slowpath_required()) { unlock(slowpath_lock[cpu]); goto retry; } if (rseq_supported) set_slow_path(); /* Same code as inside the actual rseq */ do_transaction(); if (rseq_supported) unset_slow_path(); unlock(slowpath_lock[cpu]); The only interesting question is how to make sure that all threads on that CPU see the slowpath required before they execute the commit so they are forced into the slow path. The simplest thing would be atomics, but that's what rseq wants to avoid. I think that this can be solved cleanly with the help of the membarrier syscall or some variant of that without all that 'yet another byte code interpreter' mess. The other question is whether do_transaction() is required to run on that specific CPU. I don't think so because that magic interpreter operates even when the required target cpu is offline and with locking in place there is no reason why running on the target CPU would be required. Sure, that's going to affect performance, but only for two cases: 1) Debugging. That's completely uninteresting 2) No progress at all. Performance is down the drain anyway, so it does not matter at all whether you spend a few more cycles or not to resolve that. I might be missing something as usual :) Thanks tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
> The most straight forward is to have a mechanism which forces everything > into the slow path in case of debugging, lack of progress, etc. The slow That's the abort address, right? For the generic case the fall back path would require disabling preemption unfortunately, for which we don't have a mechanism in user space. I think that is what Mathieu tried to implement here with this call. There may be some special cases where it's possible without preemption control, e.g. a malloc could just not use the per cpu cache. But I doubt that is possible in all cases that Mathieu envisions. But again would be a different code path, and I question the need for it when we can just let the operator of the debugger deal with it. -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
> The most straight forward is to have a mechanism which forces everything > into the slow path in case of debugging, lack of progress, etc. The slow That's the abort address, right? For the generic case the fall back path would require disabling preemption unfortunately, for which we don't have a mechanism in user space. I think that is what Mathieu tried to implement here with this call. There may be some special cases where it's possible without preemption control, e.g. a malloc could just not use the per cpu cache. But I doubt that is possible in all cases that Mathieu envisions. But again would be a different code path, and I question the need for it when we can just let the operator of the debugger deal with it. -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, 17 Nov 2017, Andi Kleen wrote: > > 1) Handling single-stepping from tools > > > > Tools like debuggers, and simulators like record-replay ("rr") use > > single-stepping to run through existing programs. If core libraries start > > No rr doesn't use single stepping. It uses branch stepping based on the > PMU, and those should only happen on external events or syscalls which would > abort the rseq anyways. > > Eventually it will suceeed because not every retry there will be a new > signal. If you put a syscall into your rseq you will just get > what you deserve. > > If it was single stepping it couldn't execute the vDSO (and it would > be incredible slow) > > Yes debuggers have to skip instead of step, but that's easily done > (and needed today already for every gettimeofday, which tends to be > the most common syscall...) The same problem exists with TSX. Hitting a break point inside a transaction section triggers abort with the cause bit 'breakpoint' set. rseq can be looked at as a variant of software transactional memory with a limited feature set, but the underlying problems are exactly the same. Breakpoints are not working either. > > to use restartable sequences for e.g. memory allocation, this means > > pre-existing programs cannot be single-stepped, simply because the > > underlying glibc or jemalloc has changed. > > But how likely is it that the fall back path even works? > > It would never be exercised in normal operation, so it would be a prime > target for bit rot, or not ever being tested and be broken in the first place. What's worse is that the fall back path cannot be debugged at all. It's a magic byte code which is interpreted inside the kernel. I really do not understand why this does not utilize existing design patterns well known from transactional memory. The most straight forward is to have a mechanism which forces everything into the slow path in case of debugging, lack of progress, etc. The slow path uses traditional locking to resolve the situation. That's well known to work and if done correctly then the only difference between slow path and fast path is locking/transaction control, i.e. it's a single implementation of the actual memory operations which can be single stepped, debug traced and whatever. It solves _ALL_ of the problems you describe including support for systems which do not support rseq at all. This syscall is horribly overengineered and creates more problems than it solves. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Fri, 17 Nov 2017, Andi Kleen wrote: > > 1) Handling single-stepping from tools > > > > Tools like debuggers, and simulators like record-replay ("rr") use > > single-stepping to run through existing programs. If core libraries start > > No rr doesn't use single stepping. It uses branch stepping based on the > PMU, and those should only happen on external events or syscalls which would > abort the rseq anyways. > > Eventually it will suceeed because not every retry there will be a new > signal. If you put a syscall into your rseq you will just get > what you deserve. > > If it was single stepping it couldn't execute the vDSO (and it would > be incredible slow) > > Yes debuggers have to skip instead of step, but that's easily done > (and needed today already for every gettimeofday, which tends to be > the most common syscall...) The same problem exists with TSX. Hitting a break point inside a transaction section triggers abort with the cause bit 'breakpoint' set. rseq can be looked at as a variant of software transactional memory with a limited feature set, but the underlying problems are exactly the same. Breakpoints are not working either. > > to use restartable sequences for e.g. memory allocation, this means > > pre-existing programs cannot be single-stepped, simply because the > > underlying glibc or jemalloc has changed. > > But how likely is it that the fall back path even works? > > It would never be exercised in normal operation, so it would be a prime > target for bit rot, or not ever being tested and be broken in the first place. What's worse is that the fall back path cannot be debugged at all. It's a magic byte code which is interpreted inside the kernel. I really do not understand why this does not utilize existing design patterns well known from transactional memory. The most straight forward is to have a mechanism which forces everything into the slow path in case of debugging, lack of progress, etc. The slow path uses traditional locking to resolve the situation. That's well known to work and if done correctly then the only difference between slow path and fast path is locking/transaction control, i.e. it's a single implementation of the actual memory operations which can be single stepped, debug traced and whatever. It solves _ALL_ of the problems you describe including support for systems which do not support rseq at all. This syscall is horribly overengineered and creates more problems than it solves. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Thanks for the detailed write up. That should have been in the changelog... Some comments below. Overall I think the case for the syscall is still very weak. > Let's have a look at why cpu_opv is needed. I'll make sure to enhance the > changelog and documentation to include that information. > > 1) Handling single-stepping from tools > > Tools like debuggers, and simulators like record-replay ("rr") use > single-stepping to run through existing programs. If core libraries start No rr doesn't use single stepping. It uses branch stepping based on the PMU, and those should only happen on external events or syscalls which would abort the rseq anyways. Eventually it will suceeed because not every retry there will be a new signal. If you put a syscall into your rseq you will just get what you deserve. If it was single stepping it couldn't execute the vDSO (and it would be incredible slow) Yes debuggers have to skip instead of step, but that's easily done (and needed today already for every gettimeofday, which tends to be the most common syscall...) > to use restartable sequences for e.g. memory allocation, this means > pre-existing programs cannot be single-stepped, simply because the > underlying glibc or jemalloc has changed. But how likely is it that the fall back path even works? It would never be exercised in normal operation, so it would be a prime target for bit rot, or not ever being tested and be broken in the first place. > Having a performance-related library improvement break tooling is likely to > cause a big push-back against wide adoption of rseq. *I* would not even > be using rseq in liburcu and lttng-ust until gdb gets updated in every > distributions that my users depend on. This will likely happen... never. I suspect your scheme already has a <50% likelihood of working due to the above that it's equivalent. > > > 2) Forward-progress guarantee > > Having a piece of user-space code that stops progressing due to > external conditions is pretty bad. We are used to think of fast-path and > slow-path (e.g. for locking), where the contended vs uncontended cases > have different performance characteristics, but each need to provide some > level of progress guarantees. We already have that in the vDSO. Has never been a problem. > 3) Handling page faults > > If we get creative enough, it's pretty easy to come up with corner-case > scenarios where rseq does not progress without the help from cpu_opv. For > instance, a system with swap enabled which is under high memory pressure > could trigger page faults at pretty much every rseq attempt. I recognize > that this scenario is extremely unlikely, but I'm not comfortable making > rseq the weak link of the chain here. Seems very unlikely. But if this happens the program is dead anyways, so doesn't make much difference. > The main difference between LL/SC and rseq is that debuggers had > to support single-stepping through LL/SC critical sections from the > get go in order to support a given architecture. For rseq, we're > adding critical sections into pre-existing applications/libraries, > so the user expectation is that tools don't break due to a library > optimization. I would argue that debugging some other path that is normally executed is wrong by definition. How would you find the bug if it is in the original path only, but not the fallback. The whole point of debugging is to punch through abstractions, but you're adding another layer of obfuscation here. And worse you have no guarantee that the new layer is actually functionally equivalent. Having less magic and just assume the user can do the right thing seems like a far more practical scheme. > In order to facilitate integration of rseq into user-space, cpu_opv > can provide a (relatively slower) architecture-agnostic implementation > of rseq. This means that user-space code can be ported to all > architectures through use of cpu_opv initially, and have the fast-path > use rseq whenever the asm code is implemented. While that's in theory correct, in practice it will be so slow that it is useless. Nobody will want a system call in their malloc fast path. -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Thanks for the detailed write up. That should have been in the changelog... Some comments below. Overall I think the case for the syscall is still very weak. > Let's have a look at why cpu_opv is needed. I'll make sure to enhance the > changelog and documentation to include that information. > > 1) Handling single-stepping from tools > > Tools like debuggers, and simulators like record-replay ("rr") use > single-stepping to run through existing programs. If core libraries start No rr doesn't use single stepping. It uses branch stepping based on the PMU, and those should only happen on external events or syscalls which would abort the rseq anyways. Eventually it will suceeed because not every retry there will be a new signal. If you put a syscall into your rseq you will just get what you deserve. If it was single stepping it couldn't execute the vDSO (and it would be incredible slow) Yes debuggers have to skip instead of step, but that's easily done (and needed today already for every gettimeofday, which tends to be the most common syscall...) > to use restartable sequences for e.g. memory allocation, this means > pre-existing programs cannot be single-stepped, simply because the > underlying glibc or jemalloc has changed. But how likely is it that the fall back path even works? It would never be exercised in normal operation, so it would be a prime target for bit rot, or not ever being tested and be broken in the first place. > Having a performance-related library improvement break tooling is likely to > cause a big push-back against wide adoption of rseq. *I* would not even > be using rseq in liburcu and lttng-ust until gdb gets updated in every > distributions that my users depend on. This will likely happen... never. I suspect your scheme already has a <50% likelihood of working due to the above that it's equivalent. > > > 2) Forward-progress guarantee > > Having a piece of user-space code that stops progressing due to > external conditions is pretty bad. We are used to think of fast-path and > slow-path (e.g. for locking), where the contended vs uncontended cases > have different performance characteristics, but each need to provide some > level of progress guarantees. We already have that in the vDSO. Has never been a problem. > 3) Handling page faults > > If we get creative enough, it's pretty easy to come up with corner-case > scenarios where rseq does not progress without the help from cpu_opv. For > instance, a system with swap enabled which is under high memory pressure > could trigger page faults at pretty much every rseq attempt. I recognize > that this scenario is extremely unlikely, but I'm not comfortable making > rseq the weak link of the chain here. Seems very unlikely. But if this happens the program is dead anyways, so doesn't make much difference. > The main difference between LL/SC and rseq is that debuggers had > to support single-stepping through LL/SC critical sections from the > get go in order to support a given architecture. For rseq, we're > adding critical sections into pre-existing applications/libraries, > so the user expectation is that tools don't break due to a library > optimization. I would argue that debugging some other path that is normally executed is wrong by definition. How would you find the bug if it is in the original path only, but not the fallback. The whole point of debugging is to punch through abstractions, but you're adding another layer of obfuscation here. And worse you have no guarantee that the new layer is actually functionally equivalent. Having less magic and just assume the user can do the right thing seems like a far more practical scheme. > In order to facilitate integration of rseq into user-space, cpu_opv > can provide a (relatively slower) architecture-agnostic implementation > of rseq. This means that user-space code can be ported to all > architectures through use of cpu_opv initially, and have the fast-path > use rseq whenever the asm code is implemented. While that's in theory correct, in practice it will be so slow that it is useless. Nobody will want a system call in their malloc fast path. -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote: > On Thu, 16 Nov 2017, Andi Kleen wrote: >> My preference would be just to drop this new super ugly system call. >> >> It's also not just the ugliness, but the very large attack surface >> that worries me here. >> >> As far as I know it is only needed to support single stepping, correct? > > I can't figure that out because the changelog describes only WHAT the patch > does and not WHY. Useful, isn't it? > >> Then this whole mess would disappear. > > Agreed. That would be much appreciated. Let's have a look at why cpu_opv is needed. I'll make sure to enhance the changelog and documentation to include that information. 1) Handling single-stepping from tools Tools like debuggers, and simulators like record-replay ("rr") use single-stepping to run through existing programs. If core libraries start to use restartable sequences for e.g. memory allocation, this means pre-existing programs cannot be single-stepped, simply because the underlying glibc or jemalloc has changed. The rseq user-space does expose a __rseq_table section for the sake of debuggers, so they can skip over the rseq critical sections if they want. However, this requires upgrading tools, and still breaks single-stepping in case where glibc or jemalloc is updated, but not the tooling. Having a performance-related library improvement break tooling is likely to cause a big push-back against wide adoption of rseq. *I* would not even be using rseq in liburcu and lttng-ust until gdb gets updated in every distributions that my users depend on. This will likely happen... never. 2) Forward-progress guarantee Having a piece of user-space code that stops progressing due to external conditions is pretty bad. We are used to think of fast-path and slow-path (e.g. for locking), where the contended vs uncontended cases have different performance characteristics, but each need to provide some level of progress guarantees. I'm very concerned about proposing just "rseq" without the associated slow-path (cpu_opv) that guarantees progress. It's just asking for trouble when real-life will happen: page faults, uprobes, and other unforeseen conditions that would seldom cause a rseq fast-path to never progress. 3) Handling page faults If we get creative enough, it's pretty easy to come up with corner-case scenarios where rseq does not progress without the help from cpu_opv. For instance, a system with swap enabled which is under high memory pressure could trigger page faults at pretty much every rseq attempt. I recognize that this scenario is extremely unlikely, but I'm not comfortable making rseq the weak link of the chain here. 4) Comparison with LL/SC The layman versed in the load-link/store-conditional instructions in RISC architectures will notice the similarity between rseq and LL/SC critical sections. The comparison can even be pushed further: since debuggers can handle those LL/SC critical sections, they should be able to handle rseq c.s. in the same way. First, the way gdb recognises LL/SC c.s. patterns is very fragile: it's limited to specific common patterns, and will miss the pattern in all other cases. But fear not, having the rseq c.s. expose a __rseq_table to debuggers removes that guessing part. The main difference between LL/SC and rseq is that debuggers had to support single-stepping through LL/SC critical sections from the get go in order to support a given architecture. For rseq, we're adding critical sections into pre-existing applications/libraries, so the user expectation is that tools don't break due to a library optimization. 5) Perform maintenance operations on per-cpu data rseq c.s. are quite limited feature-wise: they need to end with a *single* commit instruction that updates a memory location. On the other hand, the cpu_opv system call can combine a sequence of operations that need to be executed with preemption disabled. While slower than rseq, this allows for more complex maintenance operations to be performed on per-cpu data concurrently with rseq fast-paths, in cases where it's not possible to map those sequences of ops to a rseq. 6) Use cpu_opv as generic implementation for architectures not implementing rseq assembly code rseq critical sections require architecture-specific user-space code to be crafted in order to port an algorithm to a given architecture. In addition, it requires that the kernel architecture implementation adds hooks into signal delivery and resume to user-space. In order to facilitate integration of rseq into user-space, cpu_opv can provide a (relatively slower) architecture-agnostic implementation of rseq. This means that user-space code can be ported to all architectures through use of cpu_opv initially, and have the fast-path use rseq whenever the asm code is implemented. 7) Allow libraries with multi-part algorithms to work on same per-cpu data without affecting the allowed cpu mask I stumbled on
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote: > On Thu, 16 Nov 2017, Andi Kleen wrote: >> My preference would be just to drop this new super ugly system call. >> >> It's also not just the ugliness, but the very large attack surface >> that worries me here. >> >> As far as I know it is only needed to support single stepping, correct? > > I can't figure that out because the changelog describes only WHAT the patch > does and not WHY. Useful, isn't it? > >> Then this whole mess would disappear. > > Agreed. That would be much appreciated. Let's have a look at why cpu_opv is needed. I'll make sure to enhance the changelog and documentation to include that information. 1) Handling single-stepping from tools Tools like debuggers, and simulators like record-replay ("rr") use single-stepping to run through existing programs. If core libraries start to use restartable sequences for e.g. memory allocation, this means pre-existing programs cannot be single-stepped, simply because the underlying glibc or jemalloc has changed. The rseq user-space does expose a __rseq_table section for the sake of debuggers, so they can skip over the rseq critical sections if they want. However, this requires upgrading tools, and still breaks single-stepping in case where glibc or jemalloc is updated, but not the tooling. Having a performance-related library improvement break tooling is likely to cause a big push-back against wide adoption of rseq. *I* would not even be using rseq in liburcu and lttng-ust until gdb gets updated in every distributions that my users depend on. This will likely happen... never. 2) Forward-progress guarantee Having a piece of user-space code that stops progressing due to external conditions is pretty bad. We are used to think of fast-path and slow-path (e.g. for locking), where the contended vs uncontended cases have different performance characteristics, but each need to provide some level of progress guarantees. I'm very concerned about proposing just "rseq" without the associated slow-path (cpu_opv) that guarantees progress. It's just asking for trouble when real-life will happen: page faults, uprobes, and other unforeseen conditions that would seldom cause a rseq fast-path to never progress. 3) Handling page faults If we get creative enough, it's pretty easy to come up with corner-case scenarios where rseq does not progress without the help from cpu_opv. For instance, a system with swap enabled which is under high memory pressure could trigger page faults at pretty much every rseq attempt. I recognize that this scenario is extremely unlikely, but I'm not comfortable making rseq the weak link of the chain here. 4) Comparison with LL/SC The layman versed in the load-link/store-conditional instructions in RISC architectures will notice the similarity between rseq and LL/SC critical sections. The comparison can even be pushed further: since debuggers can handle those LL/SC critical sections, they should be able to handle rseq c.s. in the same way. First, the way gdb recognises LL/SC c.s. patterns is very fragile: it's limited to specific common patterns, and will miss the pattern in all other cases. But fear not, having the rseq c.s. expose a __rseq_table to debuggers removes that guessing part. The main difference between LL/SC and rseq is that debuggers had to support single-stepping through LL/SC critical sections from the get go in order to support a given architecture. For rseq, we're adding critical sections into pre-existing applications/libraries, so the user expectation is that tools don't break due to a library optimization. 5) Perform maintenance operations on per-cpu data rseq c.s. are quite limited feature-wise: they need to end with a *single* commit instruction that updates a memory location. On the other hand, the cpu_opv system call can combine a sequence of operations that need to be executed with preemption disabled. While slower than rseq, this allows for more complex maintenance operations to be performed on per-cpu data concurrently with rseq fast-paths, in cases where it's not possible to map those sequences of ops to a rseq. 6) Use cpu_opv as generic implementation for architectures not implementing rseq assembly code rseq critical sections require architecture-specific user-space code to be crafted in order to port an algorithm to a given architecture. In addition, it requires that the kernel architecture implementation adds hooks into signal delivery and resume to user-space. In order to facilitate integration of rseq into user-space, cpu_opv can provide a (relatively slower) architecture-agnostic implementation of rseq. This means that user-space code can be ported to all architectures through use of cpu_opv initially, and have the fast-path use rseq whenever the asm code is implemented. 7) Allow libraries with multi-part algorithms to work on same per-cpu data without affecting the allowed cpu mask I stumbled on
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Thu, 16 Nov 2017, Andi Kleen wrote: > My preference would be just to drop this new super ugly system call. > > It's also not just the ugliness, but the very large attack surface > that worries me here. > > As far as I know it is only needed to support single stepping, correct? I can't figure that out because the changelog describes only WHAT the patch does and not WHY. Useful, isn't it? > Then this whole mess would disappear. Agreed. That would be much appreciated. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Thu, 16 Nov 2017, Andi Kleen wrote: > My preference would be just to drop this new super ugly system call. > > It's also not just the ugliness, but the very large attack surface > that worries me here. > > As far as I know it is only needed to support single stepping, correct? I can't figure that out because the changelog describes only WHAT the patch does and not WHY. Useful, isn't it? > Then this whole mess would disappear. Agreed. That would be much appreciated. Thanks, tglx
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
My preference would be just to drop this new super ugly system call. It's also not just the ugliness, but the very large attack surface that worries me here. As far as I know it is only needed to support single stepping, correct? We already have other code that cannot be single stepped, most prominently the ring 3 vdso time functions that rely on seq locks. The code that actually cannot be single stepped is very small and only a few lines. As far as I know this wasn't ever a significant problem for anybody, and there is always a simple workaround (set a temporary break point after it and continue) Same should apply to the new rseq regions. They should be all tiny, and we should just accept that they cannot be single-stepped, but only skipped. Then this whole mess would disappear. -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
My preference would be just to drop this new super ugly system call. It's also not just the ugliness, but the very large attack surface that worries me here. As far as I know it is only needed to support single stepping, correct? We already have other code that cannot be single stepped, most prominently the ring 3 vdso time functions that rely on seq locks. The code that actually cannot be single stepped is very small and only a few lines. As far as I know this wasn't ever a significant problem for anybody, and there is always a simple workaround (set a temporary break point after it and continue) Same should apply to the new rseq regions. They should be all tiny, and we should just accept that they cannot be single-stepped, but only skipped. Then this whole mess would disappear. -Andi
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Tue, 14 Nov 2017, Mathieu Desnoyers wrote: > +#ifdef __KERNEL__ > +# include > +#else/* #ifdef __KERNEL__ */ Sigh. > +# include > +#endif /* #else #ifdef __KERNEL__ */ > + > +#include > + > +#ifdef __LP64__ > +# define CPU_OP_FIELD_u32_u64(field) uint64_t field > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v > +#elif defined(__BYTE_ORDER) ? \ > + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) > +# define CPU_OP_FIELD_u32_u64(field) uint32_t field ## _padding, field > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field ## _padding = 0, field = (intptr_t)v > +#else > +# define CPU_OP_FIELD_u32_u64(field) uint32_t field, field ## _padding > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field = (intptr_t)v, field ## _padding = 0 > +#endif So in the rseq patch you have: +#ifdef __LP64__ +# define RSEQ_FIELD_u32_u64(field) uint64_t field +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v +#elif defined(__BYTE_ORDER) ? \ + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ + field ## _padding = 0, field = (intptr_t)v +#else +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ + field = (intptr_t)v, field ## _padding = 0 +#endif IOW the same macro maze. Please use a separate header file which provides these macros once and share them between the two facilities. > +#define CPU_OP_VEC_LEN_MAX 16 > +#define CPU_OP_ARG_LEN_MAX 24 > +/* Max. data len per operation. */ > +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE That's something between 4K and 256K depending on the architecture. You really want to allow up to 256K data copy with preemption disabled? Shudder. > +/* > + * Max. data len for overall vector. We to restrict the amount of We to ? > + * user-space data touched by the kernel in non-preemptible context so > + * we do not introduce long scheduler latencies. > + * This allows one copy of up to 4096 bytes, and 15 operations touching > + * 8 bytes each. > + * This limit is applied to the sum of length specified for all > + * operations in a vector. > + */ > +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8) Magic numbers. Please use proper defines for heavens sake. > +#define CPU_OP_MAX_PAGES 4 /* Max. pages per op. */ > + > +enum cpu_op_type { > + CPU_COMPARE_EQ_OP, /* compare */ > + CPU_COMPARE_NE_OP, /* compare */ > + CPU_MEMCPY_OP, /* memcpy */ > + CPU_ADD_OP, /* arithmetic */ > + CPU_OR_OP, /* bitwise */ > + CPU_AND_OP, /* bitwise */ > + CPU_XOR_OP, /* bitwise */ > + CPU_LSHIFT_OP, /* shift */ > + CPU_RSHIFT_OP, /* shift */ > + CPU_MB_OP, /* memory barrier */ > +}; > + > +/* Vector of operations to perform. Limited to 16. */ > +struct cpu_op { > + int32_t op; /* enum cpu_op_type. */ > + uint32_t len; /* data length, in bytes. */ Please get rid of these tail comments > + union { > +#define TMP_BUFLEN 64 > +#define NR_PINNED_PAGES_ON_STACK 8 8 pinned pages on stack? Which stack? > +/* > + * The cpu_opv system call executes a vector of operations on behalf of > + * user-space on a specific CPU with preemption disabled. It is inspired > + * from readv() and writev() system calls which take a "struct iovec" s/from/by/ > + * array as argument. > + * > + * The operations available are: comparison, memcpy, add, or, and, xor, > + * left shift, and right shift. The system call receives a CPU number > + * from user-space as argument, which is the CPU on which those > + * operations need to be performed. All preparation steps such as > + * loading pointers, and applying offsets to arrays, need to be > + * performed by user-space before invoking the system call. The loading pointers and applying offsets? That makes no sense. > + * "comparison" operation can be used to check that the data used in the > + * preparation step did not change between preparation of system call > + * inputs and operation execution within the preempt-off critical > + * section. > + * > + * The reason why we require all pointer offsets to be calculated by > + * user-space beforehand is because we need to use get_user_pages_fast() > + * to first pin all pages touched by each operation. This takes care of That doesnt explain it either. > + * faulting-in the pages. Then, preemption is disabled, and the > + * operations are performed atomically with respect to other thread > + * execution on that CPU, without generating any page fault. > + * > + * A maximum limit of 16 operations per cpu_opv syscall
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
On Tue, 14 Nov 2017, Mathieu Desnoyers wrote: > +#ifdef __KERNEL__ > +# include > +#else/* #ifdef __KERNEL__ */ Sigh. > +# include > +#endif /* #else #ifdef __KERNEL__ */ > + > +#include > + > +#ifdef __LP64__ > +# define CPU_OP_FIELD_u32_u64(field) uint64_t field > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v > +#elif defined(__BYTE_ORDER) ? \ > + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) > +# define CPU_OP_FIELD_u32_u64(field) uint32_t field ## _padding, field > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field ## _padding = 0, field = (intptr_t)v > +#else > +# define CPU_OP_FIELD_u32_u64(field) uint32_t field, field ## _padding > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field = (intptr_t)v, field ## _padding = 0 > +#endif So in the rseq patch you have: +#ifdef __LP64__ +# define RSEQ_FIELD_u32_u64(field) uint64_t field +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v +#elif defined(__BYTE_ORDER) ? \ + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ + field ## _padding = 0, field = (intptr_t)v +#else +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \ + field = (intptr_t)v, field ## _padding = 0 +#endif IOW the same macro maze. Please use a separate header file which provides these macros once and share them between the two facilities. > +#define CPU_OP_VEC_LEN_MAX 16 > +#define CPU_OP_ARG_LEN_MAX 24 > +/* Max. data len per operation. */ > +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE That's something between 4K and 256K depending on the architecture. You really want to allow up to 256K data copy with preemption disabled? Shudder. > +/* > + * Max. data len for overall vector. We to restrict the amount of We to ? > + * user-space data touched by the kernel in non-preemptible context so > + * we do not introduce long scheduler latencies. > + * This allows one copy of up to 4096 bytes, and 15 operations touching > + * 8 bytes each. > + * This limit is applied to the sum of length specified for all > + * operations in a vector. > + */ > +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8) Magic numbers. Please use proper defines for heavens sake. > +#define CPU_OP_MAX_PAGES 4 /* Max. pages per op. */ > + > +enum cpu_op_type { > + CPU_COMPARE_EQ_OP, /* compare */ > + CPU_COMPARE_NE_OP, /* compare */ > + CPU_MEMCPY_OP, /* memcpy */ > + CPU_ADD_OP, /* arithmetic */ > + CPU_OR_OP, /* bitwise */ > + CPU_AND_OP, /* bitwise */ > + CPU_XOR_OP, /* bitwise */ > + CPU_LSHIFT_OP, /* shift */ > + CPU_RSHIFT_OP, /* shift */ > + CPU_MB_OP, /* memory barrier */ > +}; > + > +/* Vector of operations to perform. Limited to 16. */ > +struct cpu_op { > + int32_t op; /* enum cpu_op_type. */ > + uint32_t len; /* data length, in bytes. */ Please get rid of these tail comments > + union { > +#define TMP_BUFLEN 64 > +#define NR_PINNED_PAGES_ON_STACK 8 8 pinned pages on stack? Which stack? > +/* > + * The cpu_opv system call executes a vector of operations on behalf of > + * user-space on a specific CPU with preemption disabled. It is inspired > + * from readv() and writev() system calls which take a "struct iovec" s/from/by/ > + * array as argument. > + * > + * The operations available are: comparison, memcpy, add, or, and, xor, > + * left shift, and right shift. The system call receives a CPU number > + * from user-space as argument, which is the CPU on which those > + * operations need to be performed. All preparation steps such as > + * loading pointers, and applying offsets to arrays, need to be > + * performed by user-space before invoking the system call. The loading pointers and applying offsets? That makes no sense. > + * "comparison" operation can be used to check that the data used in the > + * preparation step did not change between preparation of system call > + * inputs and operation execution within the preempt-off critical > + * section. > + * > + * The reason why we require all pointer offsets to be calculated by > + * user-space beforehand is because we need to use get_user_pages_fast() > + * to first pin all pages touched by each operation. This takes care of That doesnt explain it either. > + * faulting-in the pages. Then, preemption is disabled, and the > + * operations are performed atomically with respect to other thread > + * execution on that CPU, without generating any page fault. > + * > + * A maximum limit of 16 operations per cpu_opv syscall
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpa...@gmail.com wrote: > Hi Matthieu > > On 14 November 2017 at 21:03, Mathieu Desnoyers >wrote: >> This new cpu_opv system call executes a vector of operations on behalf >> of user-space on a specific CPU with preemption disabled. It is inspired >> from readv() and writev() system calls which take a "struct iovec" array >> as argument. > > Do you have a man page for this syscall already? Hi Michael, It's the next thing on my roadmap when the syscall reaches mainline. That and membarrier commands man pages updates. Thanks, Mathieu > > Thanks, > > Michael > > >> The operations available are: comparison, memcpy, add, or, and, xor, >> left shift, right shift, and mb. The system call receives a CPU number >> from user-space as argument, which is the CPU on which those operations >> need to be performed. All preparation steps such as loading pointers, >> and applying offsets to arrays, need to be performed by user-space >> before invoking the system call. The "comparison" operation can be used >> to check that the data used in the preparation step did not change >> between preparation of system call inputs and operation execution within >> the preempt-off critical section. >> >> The reason why we require all pointer offsets to be calculated by >> user-space beforehand is because we need to use get_user_pages_fast() to >> first pin all pages touched by each operation. This takes care of >> faulting-in the pages. Then, preemption is disabled, and the operations >> are performed atomically with respect to other thread execution on that >> CPU, without generating any page fault. >> >> A maximum limit of 16 operations per cpu_opv syscall invocation is >> enforced, so user-space cannot generate a too long preempt-off critical >> section. Each operation is also limited a length of PAGE_SIZE bytes, >> meaning that an operation can touch a maximum of 4 pages (memcpy: 2 >> pages for source, 2 pages for destination if addresses are not aligned >> on page boundaries). Moreover, a total limit of 4216 bytes is applied >> to operation lengths. >> >> If the thread is not running on the requested CPU, a new >> push_task_to_cpu() is invoked to migrate the task to the requested CPU. >> If the requested CPU is not part of the cpus allowed mask of the thread, >> the system call fails with EINVAL. After the migration has been >> performed, preemption is disabled, and the current CPU number is checked >> again and compared to the requested CPU number. If it still differs, it >> means the scheduler migrated us away from that CPU. Return EAGAIN to >> user-space in that case, and let user-space retry (either requesting the >> same CPU number, or a different one, depending on the user-space >> algorithm constraints). >> >> Signed-off-by: Mathieu Desnoyers >> CC: "Paul E. McKenney" >> CC: Peter Zijlstra >> CC: Paul Turner >> CC: Thomas Gleixner >> CC: Andrew Hunter >> CC: Andy Lutomirski >> CC: Andi Kleen >> CC: Dave Watson >> CC: Chris Lameter >> CC: Ingo Molnar >> CC: "H. Peter Anvin" >> CC: Ben Maurer >> CC: Steven Rostedt >> CC: Josh Triplett >> CC: Linus Torvalds >> CC: Andrew Morton >> CC: Russell King >> CC: Catalin Marinas >> CC: Will Deacon >> CC: Michael Kerrisk >> CC: Boqun Feng >> CC: linux-...@vger.kernel.org >> --- >> >> Changes since v1: >> - handle CPU hotplug, >> - cleanup implementation using function pointers: We can use function >> pointers to implement the operations rather than duplicating all the >> user-access code. >> - refuse device pages: Performing cpu_opv operations on io map'd pages >> with preemption disabled could generate long preempt-off critical >> sections, which leads to unwanted scheduler latency. Return EFAULT if >> a device page is received as parameter >> - restrict op vector to 4216 bytes length sum: Restrict the operation >> vector to length sum of: >> - 4096 bytes (typical page size on most architectures, should be >> enough for a string, or structures) >> - 15 * 8 bytes (typical operations on integers or pointers). >> The goal here is to keep the duration of preempt off critical section >> short, so we don't add significant scheduler latency. >> - Add INIT_ONSTACK macro: Introduce the >> CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users >> correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their >> stack to 0 on 32-bit architectures. >> -
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpa...@gmail.com wrote: > Hi Matthieu > > On 14 November 2017 at 21:03, Mathieu Desnoyers > wrote: >> This new cpu_opv system call executes a vector of operations on behalf >> of user-space on a specific CPU with preemption disabled. It is inspired >> from readv() and writev() system calls which take a "struct iovec" array >> as argument. > > Do you have a man page for this syscall already? Hi Michael, It's the next thing on my roadmap when the syscall reaches mainline. That and membarrier commands man pages updates. Thanks, Mathieu > > Thanks, > > Michael > > >> The operations available are: comparison, memcpy, add, or, and, xor, >> left shift, right shift, and mb. The system call receives a CPU number >> from user-space as argument, which is the CPU on which those operations >> need to be performed. All preparation steps such as loading pointers, >> and applying offsets to arrays, need to be performed by user-space >> before invoking the system call. The "comparison" operation can be used >> to check that the data used in the preparation step did not change >> between preparation of system call inputs and operation execution within >> the preempt-off critical section. >> >> The reason why we require all pointer offsets to be calculated by >> user-space beforehand is because we need to use get_user_pages_fast() to >> first pin all pages touched by each operation. This takes care of >> faulting-in the pages. Then, preemption is disabled, and the operations >> are performed atomically with respect to other thread execution on that >> CPU, without generating any page fault. >> >> A maximum limit of 16 operations per cpu_opv syscall invocation is >> enforced, so user-space cannot generate a too long preempt-off critical >> section. Each operation is also limited a length of PAGE_SIZE bytes, >> meaning that an operation can touch a maximum of 4 pages (memcpy: 2 >> pages for source, 2 pages for destination if addresses are not aligned >> on page boundaries). Moreover, a total limit of 4216 bytes is applied >> to operation lengths. >> >> If the thread is not running on the requested CPU, a new >> push_task_to_cpu() is invoked to migrate the task to the requested CPU. >> If the requested CPU is not part of the cpus allowed mask of the thread, >> the system call fails with EINVAL. After the migration has been >> performed, preemption is disabled, and the current CPU number is checked >> again and compared to the requested CPU number. If it still differs, it >> means the scheduler migrated us away from that CPU. Return EAGAIN to >> user-space in that case, and let user-space retry (either requesting the >> same CPU number, or a different one, depending on the user-space >> algorithm constraints). >> >> Signed-off-by: Mathieu Desnoyers >> CC: "Paul E. McKenney" >> CC: Peter Zijlstra >> CC: Paul Turner >> CC: Thomas Gleixner >> CC: Andrew Hunter >> CC: Andy Lutomirski >> CC: Andi Kleen >> CC: Dave Watson >> CC: Chris Lameter >> CC: Ingo Molnar >> CC: "H. Peter Anvin" >> CC: Ben Maurer >> CC: Steven Rostedt >> CC: Josh Triplett >> CC: Linus Torvalds >> CC: Andrew Morton >> CC: Russell King >> CC: Catalin Marinas >> CC: Will Deacon >> CC: Michael Kerrisk >> CC: Boqun Feng >> CC: linux-...@vger.kernel.org >> --- >> >> Changes since v1: >> - handle CPU hotplug, >> - cleanup implementation using function pointers: We can use function >> pointers to implement the operations rather than duplicating all the >> user-access code. >> - refuse device pages: Performing cpu_opv operations on io map'd pages >> with preemption disabled could generate long preempt-off critical >> sections, which leads to unwanted scheduler latency. Return EFAULT if >> a device page is received as parameter >> - restrict op vector to 4216 bytes length sum: Restrict the operation >> vector to length sum of: >> - 4096 bytes (typical page size on most architectures, should be >> enough for a string, or structures) >> - 15 * 8 bytes (typical operations on integers or pointers). >> The goal here is to keep the duration of preempt off critical section >> short, so we don't add significant scheduler latency. >> - Add INIT_ONSTACK macro: Introduce the >> CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users >> correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their >> stack to 0 on 32-bit architectures. >> - Add CPU_MB_OP operation: >> Use-cases with: >> - two consecutive stores, >> - a mempcy followed by a store, >> require a memory barrier before the final store operation. A typical >> use-case is a store-release on the final store. Given that this is a >> slow path, just providing an explicit full barrier instruction should >> be sufficient. >> - Add expect fault field: >> The use-case of list_pop brings interesting challenges. With rseq, we >> can use rseq_cmpnev_storeoffp_load(), and
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Hi Matthieu On 14 November 2017 at 21:03, Mathieu Desnoyerswrote: > This new cpu_opv system call executes a vector of operations on behalf > of user-space on a specific CPU with preemption disabled. It is inspired > from readv() and writev() system calls which take a "struct iovec" array > as argument. Do you have a man page spfr this syscall already? Thanks, Michael > The operations available are: comparison, memcpy, add, or, and, xor, > left shift, right shift, and mb. The system call receives a CPU number > from user-space as argument, which is the CPU on which those operations > need to be performed. All preparation steps such as loading pointers, > and applying offsets to arrays, need to be performed by user-space > before invoking the system call. The "comparison" operation can be used > to check that the data used in the preparation step did not change > between preparation of system call inputs and operation execution within > the preempt-off critical section. > > The reason why we require all pointer offsets to be calculated by > user-space beforehand is because we need to use get_user_pages_fast() to > first pin all pages touched by each operation. This takes care of > faulting-in the pages. Then, preemption is disabled, and the operations > are performed atomically with respect to other thread execution on that > CPU, without generating any page fault. > > A maximum limit of 16 operations per cpu_opv syscall invocation is > enforced, so user-space cannot generate a too long preempt-off critical > section. Each operation is also limited a length of PAGE_SIZE bytes, > meaning that an operation can touch a maximum of 4 pages (memcpy: 2 > pages for source, 2 pages for destination if addresses are not aligned > on page boundaries). Moreover, a total limit of 4216 bytes is applied > to operation lengths. > > If the thread is not running on the requested CPU, a new > push_task_to_cpu() is invoked to migrate the task to the requested CPU. > If the requested CPU is not part of the cpus allowed mask of the thread, > the system call fails with EINVAL. After the migration has been > performed, preemption is disabled, and the current CPU number is checked > again and compared to the requested CPU number. If it still differs, it > means the scheduler migrated us away from that CPU. Return EAGAIN to > user-space in that case, and let user-space retry (either requesting the > same CPU number, or a different one, depending on the user-space > algorithm constraints). > > Signed-off-by: Mathieu Desnoyers > CC: "Paul E. McKenney" > CC: Peter Zijlstra > CC: Paul Turner > CC: Thomas Gleixner > CC: Andrew Hunter > CC: Andy Lutomirski > CC: Andi Kleen > CC: Dave Watson > CC: Chris Lameter > CC: Ingo Molnar > CC: "H. Peter Anvin" > CC: Ben Maurer > CC: Steven Rostedt > CC: Josh Triplett > CC: Linus Torvalds > CC: Andrew Morton > CC: Russell King > CC: Catalin Marinas > CC: Will Deacon > CC: Michael Kerrisk > CC: Boqun Feng > CC: linux-...@vger.kernel.org > --- > > Changes since v1: > - handle CPU hotplug, > - cleanup implementation using function pointers: We can use function > pointers to implement the operations rather than duplicating all the > user-access code. > - refuse device pages: Performing cpu_opv operations on io map'd pages > with preemption disabled could generate long preempt-off critical > sections, which leads to unwanted scheduler latency. Return EFAULT if > a device page is received as parameter > - restrict op vector to 4216 bytes length sum: Restrict the operation > vector to length sum of: > - 4096 bytes (typical page size on most architectures, should be > enough for a string, or structures) > - 15 * 8 bytes (typical operations on integers or pointers). > The goal here is to keep the duration of preempt off critical section > short, so we don't add significant scheduler latency. > - Add INIT_ONSTACK macro: Introduce the > CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users > correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their > stack to 0 on 32-bit architectures. > - Add CPU_MB_OP operation: > Use-cases with: > - two consecutive stores, > - a mempcy followed by a store, > require a memory barrier before the final store operation. A typical > use-case is a store-release on the final store. Given that this is a > slow path, just providing an explicit full barrier instruction should > be
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Hi Matthieu On 14 November 2017 at 21:03, Mathieu Desnoyers wrote: > This new cpu_opv system call executes a vector of operations on behalf > of user-space on a specific CPU with preemption disabled. It is inspired > from readv() and writev() system calls which take a "struct iovec" array > as argument. Do you have a man page spfr this syscall already? Thanks, Michael > The operations available are: comparison, memcpy, add, or, and, xor, > left shift, right shift, and mb. The system call receives a CPU number > from user-space as argument, which is the CPU on which those operations > need to be performed. All preparation steps such as loading pointers, > and applying offsets to arrays, need to be performed by user-space > before invoking the system call. The "comparison" operation can be used > to check that the data used in the preparation step did not change > between preparation of system call inputs and operation execution within > the preempt-off critical section. > > The reason why we require all pointer offsets to be calculated by > user-space beforehand is because we need to use get_user_pages_fast() to > first pin all pages touched by each operation. This takes care of > faulting-in the pages. Then, preemption is disabled, and the operations > are performed atomically with respect to other thread execution on that > CPU, without generating any page fault. > > A maximum limit of 16 operations per cpu_opv syscall invocation is > enforced, so user-space cannot generate a too long preempt-off critical > section. Each operation is also limited a length of PAGE_SIZE bytes, > meaning that an operation can touch a maximum of 4 pages (memcpy: 2 > pages for source, 2 pages for destination if addresses are not aligned > on page boundaries). Moreover, a total limit of 4216 bytes is applied > to operation lengths. > > If the thread is not running on the requested CPU, a new > push_task_to_cpu() is invoked to migrate the task to the requested CPU. > If the requested CPU is not part of the cpus allowed mask of the thread, > the system call fails with EINVAL. After the migration has been > performed, preemption is disabled, and the current CPU number is checked > again and compared to the requested CPU number. If it still differs, it > means the scheduler migrated us away from that CPU. Return EAGAIN to > user-space in that case, and let user-space retry (either requesting the > same CPU number, or a different one, depending on the user-space > algorithm constraints). > > Signed-off-by: Mathieu Desnoyers > CC: "Paul E. McKenney" > CC: Peter Zijlstra > CC: Paul Turner > CC: Thomas Gleixner > CC: Andrew Hunter > CC: Andy Lutomirski > CC: Andi Kleen > CC: Dave Watson > CC: Chris Lameter > CC: Ingo Molnar > CC: "H. Peter Anvin" > CC: Ben Maurer > CC: Steven Rostedt > CC: Josh Triplett > CC: Linus Torvalds > CC: Andrew Morton > CC: Russell King > CC: Catalin Marinas > CC: Will Deacon > CC: Michael Kerrisk > CC: Boqun Feng > CC: linux-...@vger.kernel.org > --- > > Changes since v1: > - handle CPU hotplug, > - cleanup implementation using function pointers: We can use function > pointers to implement the operations rather than duplicating all the > user-access code. > - refuse device pages: Performing cpu_opv operations on io map'd pages > with preemption disabled could generate long preempt-off critical > sections, which leads to unwanted scheduler latency. Return EFAULT if > a device page is received as parameter > - restrict op vector to 4216 bytes length sum: Restrict the operation > vector to length sum of: > - 4096 bytes (typical page size on most architectures, should be > enough for a string, or structures) > - 15 * 8 bytes (typical operations on integers or pointers). > The goal here is to keep the duration of preempt off critical section > short, so we don't add significant scheduler latency. > - Add INIT_ONSTACK macro: Introduce the > CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users > correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their > stack to 0 on 32-bit architectures. > - Add CPU_MB_OP operation: > Use-cases with: > - two consecutive stores, > - a mempcy followed by a store, > require a memory barrier before the final store operation. A typical > use-case is a store-release on the final store. Given that this is a > slow path, just providing an explicit full barrier instruction should > be sufficient. > - Add expect fault field: > The use-case of list_pop brings interesting challenges. With rseq, we > can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer, > compare it against NULL, add an offset, and load the target "next" > pointer from the object, all within a single req critical section. > > Life is not so easy for cpu_opv in this use-case, mainly because we > need to pin all pages we are going to touch in the preempt-off > critical section beforehand. So we need to
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 14, 2017, at 3:03 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: [...] > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 3b448ba82225..cab256c1720a 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, > unsigned int cpu) > #endif > } > > +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu); Testing on CONFIG_SMP=n showed that I needed to add empty static inline (returning 0) for !SMP case. Mathieu > + > /* > * Tunables that become constants when CONFIG_SCHED_DEBUG is off: > */ -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
- On Nov 14, 2017, at 3:03 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: [...] > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 3b448ba82225..cab256c1720a 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, > unsigned int cpu) > #endif > } > > +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu); Testing on CONFIG_SMP=n showed that I needed to add empty static inline (returning 0) for !SMP case. Mathieu > + > /* > * Tunables that become constants when CONFIG_SCHED_DEBUG is off: > */ -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com