Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-21 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote:
>> The use-case for 4k memcpy operation is a per-cpu ring buffer where
>> the rseq fast-path does the following:
>> 
>> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>>   given structure (limited to 4k in size) into a ring buffer,
>>   followed by the final commit instruction which increments the current
>>   position offset by the number of bytes pushed.
>> 
>> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>>   The final commit instruction decrements the current position offset by
>>   the number of bytes pop'd.
>> 
>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> I'm still confused. Before you talked about the sequence:
> 
>1) Reserve
> 
>2) Commit
> 
> and both use rseq, but obviously you cannot do two "atomic" operation in
> one section.
> 
> So now you talk about push. Is that what you described earlier as commit?
> 
> Assumed that it is, I still have no idea why the memcpy needs to happen in
> that preempt disabled region.
> 
> If you have space reserved, then the copy does not have any dependencies
> neither on the cpu it runs on nor on anything else. So the only
> interresting operation is the final commit instruction which tells the
> consumer that its ready.
> 
> So what is the part I am missing here aside of having difficulties to map
> the constantly changing names of this stuff?

Let's clear up some confusion: those are two different use-cases. The
ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer
with memcpy+position update is a LIFO queue.

Let me explain the various use-cases here, so we know what we're talking
about.

rseq and cpu_opv use-cases

1) per-cpu spinlock

A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.

The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).


2) per-cpu statistics counters

A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.

The cpu_opv fallback can be implemented as a "ADD" operation.

Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.


3) per-cpu LIFO linked-list (unlimited size stack)

A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.

The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.

The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.

The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.


4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)

This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.

The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-21 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote:
>> The use-case for 4k memcpy operation is a per-cpu ring buffer where
>> the rseq fast-path does the following:
>> 
>> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>>   given structure (limited to 4k in size) into a ring buffer,
>>   followed by the final commit instruction which increments the current
>>   position offset by the number of bytes pushed.
>> 
>> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>>   The final commit instruction decrements the current position offset by
>>   the number of bytes pop'd.
>> 
>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> I'm still confused. Before you talked about the sequence:
> 
>1) Reserve
> 
>2) Commit
> 
> and both use rseq, but obviously you cannot do two "atomic" operation in
> one section.
> 
> So now you talk about push. Is that what you described earlier as commit?
> 
> Assumed that it is, I still have no idea why the memcpy needs to happen in
> that preempt disabled region.
> 
> If you have space reserved, then the copy does not have any dependencies
> neither on the cpu it runs on nor on anything else. So the only
> interresting operation is the final commit instruction which tells the
> consumer that its ready.
> 
> So what is the part I am missing here aside of having difficulties to map
> the constantly changing names of this stuff?

Let's clear up some confusion: those are two different use-cases. The
ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer
with memcpy+position update is a LIFO queue.

Let me explain the various use-cases here, so we know what we're talking
about.

rseq and cpu_opv use-cases

1) per-cpu spinlock

A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.

The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).


2) per-cpu statistics counters

A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.

The cpu_opv fallback can be implemented as a "ADD" operation.

Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.


3) per-cpu LIFO linked-list (unlimited size stack)

A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.

The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.

The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.

The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.


4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)

This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.

The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 1:49 PM, Andi Kleen a...@firstfloor.org wrote:

>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> If anybody ever gets that right. It will be really hard to just
> test such a path.
> 
> It also seems fairly theoretical to me. Do you even have a
> test case where the normal path stops making forward progress?

We expect the following loop to progress, typically after a single
iteration:

do {
cpu = rseq_cpu_start();
ret = rseq_addv(, 1, cpu);
attempts++;
} while (ret);

Now runnig this in gdb, break on "main", run, and single-step
execution with "next", the program is stuck in an infinite loop.

What solution do you have in mind to handle this kind of
scenario without breaking pre-existing debuggers ?

Looking at vDSO examples of vgetcpu and vclock_gettime under
gdb 7.7.1 (debian) with glibc 2.19:

sched_getcpu behavior under single-stepping per source line
with "step" seems to only see the 
../sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S
source lines, which makes it skip single-stepping of the vDSO.

sched_getcpu under "stepi": it does go through the vDSO instruction
addresses. It does progress, given that there is no loop there.

clock_gettime under "step": it only sees source lines of
../sysdeps/unix/clock_gettime.c.

clock_gettime under "stepi": it's stuck in an infinite loop.

So instruction-level stepping from gdb turns clock_gettime vDSO
into a never-ending loop, which is already bad. But with rseq,
the situation is even worse, because it turns source line level
single-stepping into infinite loops.

My understanding from https://sourceware.org/bugzilla/show_bug.cgi?id=14466
is that GDB currently simply removes the vDSO from its list of library
mappings, which is probably why it skips over vDSO for the source
lines single-stepping case. We cannot do that with rseq, because we
_want_ the rseq critical section to be inlined into the application
or library. A function call costs more than most rseq critical sections.

I plan to have the rseq user-space code provide a "__rseq_table" section
so debuggers can eventually figure out that they need to skip over the
rseq critical sections. However, it won't help the fact that pre-existing
debugger single-stepping will start turning perfectly working programs
into never-ending loops simply by having glibc use rseq for memory
allocation.

Using the cpu_opv system call on rseq failure solves this problem
entirely.

I would even go further and recommend to take a similar approach when
lack of progress is detected in a vDSO, and invoke the equivalent
system call. The current implementation of the clock_gettime()
vDSO turns instruction-level single-stepping into never
ending loops, which is far from being elegant.

Thanks,

Mathieu 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 1:49 PM, Andi Kleen a...@firstfloor.org wrote:

>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> If anybody ever gets that right. It will be really hard to just
> test such a path.
> 
> It also seems fairly theoretical to me. Do you even have a
> test case where the normal path stops making forward progress?

We expect the following loop to progress, typically after a single
iteration:

do {
cpu = rseq_cpu_start();
ret = rseq_addv(, 1, cpu);
attempts++;
} while (ret);

Now runnig this in gdb, break on "main", run, and single-step
execution with "next", the program is stuck in an infinite loop.

What solution do you have in mind to handle this kind of
scenario without breaking pre-existing debuggers ?

Looking at vDSO examples of vgetcpu and vclock_gettime under
gdb 7.7.1 (debian) with glibc 2.19:

sched_getcpu behavior under single-stepping per source line
with "step" seems to only see the 
../sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S
source lines, which makes it skip single-stepping of the vDSO.

sched_getcpu under "stepi": it does go through the vDSO instruction
addresses. It does progress, given that there is no loop there.

clock_gettime under "step": it only sees source lines of
../sysdeps/unix/clock_gettime.c.

clock_gettime under "stepi": it's stuck in an infinite loop.

So instruction-level stepping from gdb turns clock_gettime vDSO
into a never-ending loop, which is already bad. But with rseq,
the situation is even worse, because it turns source line level
single-stepping into infinite loops.

My understanding from https://sourceware.org/bugzilla/show_bug.cgi?id=14466
is that GDB currently simply removes the vDSO from its list of library
mappings, which is probably why it skips over vDSO for the source
lines single-stepping case. We cannot do that with rseq, because we
_want_ the rseq critical section to be inlined into the application
or library. A function call costs more than most rseq critical sections.

I plan to have the rseq user-space code provide a "__rseq_table" section
so debuggers can eventually figure out that they need to skip over the
rseq critical sections. However, it won't help the fact that pre-existing
debugger single-stepping will start turning perfectly working programs
into never-ending loops simply by having glibc use rseq for memory
allocation.

Using the cpu_opv system call on rseq failure solves this problem
entirely.

I would even go further and recommend to take a similar approach when
lack of progress is detected in a vDSO, and invoke the equivalent
system call. The current implementation of the clock_gettime()
vDSO turns instruction-level single-stepping into never
ending loops, which is far from being elegant.

Thanks,

Mathieu 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Thomas Gleixner
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote:
> The use-case for 4k memcpy operation is a per-cpu ring buffer where
> the rseq fast-path does the following:
> 
> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>   given structure (limited to 4k in size) into a ring buffer,
>   followed by the final commit instruction which increments the current
>   position offset by the number of bytes pushed.
> 
> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>   The final commit instruction decrements the current position offset by
>   the number of bytes pop'd.
> 
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
> rseq fails to progress.

I'm still confused. Before you talked about the sequence:

1) Reserve

2) Commit

and both use rseq, but obviously you cannot do two "atomic" operation in
one section.

So now you talk about push. Is that what you described earlier as commit?

Assumed that it is, I still have no idea why the memcpy needs to happen in
that preempt disabled region.

If you have space reserved, then the copy does not have any dependencies
neither on the cpu it runs on nor on anything else. So the only
interresting operation is the final commit instruction which tells the
consumer that its ready.

So what is the part I am missing here aside of having difficulties to map
the constantly changing names of this stuff?

Thanks,

tglx




Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Thomas Gleixner
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> - On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote:
> The use-case for 4k memcpy operation is a per-cpu ring buffer where
> the rseq fast-path does the following:
> 
> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>   given structure (limited to 4k in size) into a ring buffer,
>   followed by the final commit instruction which increments the current
>   position offset by the number of bytes pushed.
> 
> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>   The final commit instruction decrements the current position offset by
>   the number of bytes pop'd.
> 
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
> rseq fails to progress.

I'm still confused. Before you talked about the sequence:

1) Reserve

2) Commit

and both use rseq, but obviously you cannot do two "atomic" operation in
one section.

So now you talk about push. Is that what you described earlier as commit?

Assumed that it is, I still have no idea why the memcpy needs to happen in
that preempt disabled region.

If you have space reserved, then the copy does not have any dependencies
neither on the cpu it runs on nor on anything else. So the only
interresting operation is the final commit instruction which tells the
consumer that its ready.

So what is the part I am missing here aside of having difficulties to map
the constantly changing names of this stuff?

Thanks,

tglx




Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Andi Kleen
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
> rseq fails to progress.

If anybody ever gets that right. It will be really hard to just
test such a path.

It also seems fairly theoretical to me. Do you even have a 
test case where the normal path stops making forward progress?

-Andi


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Andi Kleen
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
> rseq fails to progress.

If anybody ever gets that right. It will be really hard to just
test such a path.

It also seems fairly theoretical to me. Do you even have a 
test case where the normal path stops making forward progress?

-Andi


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 1:03 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Mon, 20 Nov 2017, Thomas Gleixner wrote:
>> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> > >> + * The reason why we require all pointer offsets to be calculated by
>> > >> + * user-space beforehand is because we need to use 
>> > >> get_user_pages_fast()
>> > >> + * to first pin all pages touched by each operation. This takes care of
>> > > 
>> > > That doesnt explain it either.
>> > 
>> > What kind of explication are you looking for here ? Perhaps being too close
>> > to the implementation prevents me from understanding what is unclear from
>> > your perspective.
>> 
>> What the heck are pointer offsets?
>> 
>> The ops have one or two pointer(s) to a lump of memory. So if a pointer
>> points to the wrong lump of memory then you're screwed, but that's true for
>> all pointers handed to the kernel.
> 
> So I think you mix here the 'user space programmer guide - how to set up
> the magic ops - into the kernel side documentation. The kernel side does
> not care about the pointers, as long as they are valid to access.
> 
> Again. Try to explain things at the conceptual level.

Yes, I took the sentence you suggested for that reason: it removes
usage details that are meant for user-space implementer, which do not
belong in those comments.

Thanks,

Mathieu

> 
> Thanks,
> 
>   tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 1:03 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Mon, 20 Nov 2017, Thomas Gleixner wrote:
>> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> > >> + * The reason why we require all pointer offsets to be calculated by
>> > >> + * user-space beforehand is because we need to use 
>> > >> get_user_pages_fast()
>> > >> + * to first pin all pages touched by each operation. This takes care of
>> > > 
>> > > That doesnt explain it either.
>> > 
>> > What kind of explication are you looking for here ? Perhaps being too close
>> > to the implementation prevents me from understanding what is unclear from
>> > your perspective.
>> 
>> What the heck are pointer offsets?
>> 
>> The ops have one or two pointer(s) to a lump of memory. So if a pointer
>> points to the wrong lump of memory then you're screwed, but that's true for
>> all pointers handed to the kernel.
> 
> So I think you mix here the 'user space programmer guide - how to set up
> the magic ops - into the kernel side documentation. The kernel side does
> not care about the pointers, as long as they are valid to access.
> 
> Again. Try to explain things at the conceptual level.

Yes, I took the sentence you suggested for that reason: it removes
usage details that are meant for user-space implementer, which do not
belong in those comments.

Thanks,

Mathieu

> 
> Thanks,
> 
>   tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote:
>> >> +#define NR_PINNED_PAGES_ON_STACK 8
>> > 
>> > 8 pinned pages on stack? Which stack?
>> 
>> The common cases need to touch few pages, and we can keep the
>> pointers in an array on the kernel stack within the cpu_opv system
>> call.
>> 
>> Updating to:
>> 
>> /*
>>  * Typical invocation of cpu_opv need few pages. Keep struct page
>>  * pointers in an array on the stack of the cpu_opv system call up to
>>  * this limit, beyond which the array is dynamically allocated.
>>  */
>> #define NR_PIN_PAGES_ON_STACK8
> 
> That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

fixed.

> 
>> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> >> + * left shift, and right shift. The system call receives a CPU number
>> >> + * from user-space as argument, which is the CPU on which those
>> >> + * operations need to be performed. All preparation steps such as
>> >> + * loading pointers, and applying offsets to arrays, need to be
>> >> + * performed by user-space before invoking the system call. The
>> > 
>> > loading pointers and applying offsets? That makes no sense.
>> 
>> Updating to:
>> 
>>  * All preparation steps such as
>>  * loading base pointers, and adding offsets derived from the current
>>  * CPU number, need to be performed by user-space before invoking the
>>  * system call.
> 
> This still does not explain anything, really.
> 
> Which base pointer is loaded?  I nowhere see a reference to a base
> pointer.
> 
> And what are the offsets about?
> 
> derived from current cpu number? What is current CPU number? The one on
> which the task executes now or the one which it should execute on?
> 
> I assume what you want to say is:
> 
>  All pointers in the ops must have been set up to point to the per CPU
>  memory of the CPU on which the operations should be executed.
> 
> At least that's what I oracle in to that.

Exactly that. Will update to use this description instead.

> 
>> >> + * "comparison" operation can be used to check that the data used in the
>> >> + * preparation step did not change between preparation of system call
>> >> + * inputs and operation execution within the preempt-off critical
>> >> + * section.
>> >> + *
>> >> + * The reason why we require all pointer offsets to be calculated by
>> >> + * user-space beforehand is because we need to use get_user_pages_fast()
>> >> + * to first pin all pages touched by each operation. This takes care of
>> > 
>> > That doesnt explain it either.
>> 
>> What kind of explication are you looking for here ? Perhaps being too close
>> to the implementation prevents me from understanding what is unclear from
>> your perspective.
> 
> What the heck are pointer offsets?
> 
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

I think the sentence you suggested above is clear enough. I'll simply use
it.

> 
>> Sorry, that paragraph was unclear. Updated:
>> 
>>  * An overall maximum of 4216 bytes in enforced on the sum of operation
>>  * length within an operation vector, so user-space cannot generate a
>>  * too long preempt-off critical section (cache cold critical section
>>  * duration measured as 4.7µs on x86-64). Each operation is also limited
>>  * a length of PAGE_SIZE bytes,
> 
> Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
> is a hard limit of 4K. And because there is no alignment requiremnt the
> rest of the sentence is stating the obvious.

I can make that a 4K limit if you prefer. This presumes that no architecture
has pages smaller than 4K, which is true on Linux.

> 
>>  * meaning that an operation can touch a
>>  * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>>  * destination if addresses are not aligned on page boundaries).
> 
> I still have to understand why the 4K copy is necessary in the first place.
> 
>> > What's the critical section duration for operations which go to the limits
>> > of this on a average x86 64 machine?
>> 
>> When cache-cold, I measure 4.7 µs per critical section doing a
>> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
>> acceptable preempt-off latency for RT ?
> 
> Depends on the use case as always 

The use-case for 4k memcpy operation is a per-cpu ring buffer where
the rseq fast-path does the following:

- ring buffer push: in the rseq asm instruction sequence, a memcpy of a
  given structure (limited to 4k in size) into a ring buffer,
  followed by the final commit instruction which increments the current
  position offset by the number of bytes pushed.

- ring buffer pop: in the rseq asm instruction sequence, a memcpy of
  a given structure (up to 4k) 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote:
>> >> +#define NR_PINNED_PAGES_ON_STACK 8
>> > 
>> > 8 pinned pages on stack? Which stack?
>> 
>> The common cases need to touch few pages, and we can keep the
>> pointers in an array on the kernel stack within the cpu_opv system
>> call.
>> 
>> Updating to:
>> 
>> /*
>>  * Typical invocation of cpu_opv need few pages. Keep struct page
>>  * pointers in an array on the stack of the cpu_opv system call up to
>>  * this limit, beyond which the array is dynamically allocated.
>>  */
>> #define NR_PIN_PAGES_ON_STACK8
> 
> That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

fixed.

> 
>> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> >> + * left shift, and right shift. The system call receives a CPU number
>> >> + * from user-space as argument, which is the CPU on which those
>> >> + * operations need to be performed. All preparation steps such as
>> >> + * loading pointers, and applying offsets to arrays, need to be
>> >> + * performed by user-space before invoking the system call. The
>> > 
>> > loading pointers and applying offsets? That makes no sense.
>> 
>> Updating to:
>> 
>>  * All preparation steps such as
>>  * loading base pointers, and adding offsets derived from the current
>>  * CPU number, need to be performed by user-space before invoking the
>>  * system call.
> 
> This still does not explain anything, really.
> 
> Which base pointer is loaded?  I nowhere see a reference to a base
> pointer.
> 
> And what are the offsets about?
> 
> derived from current cpu number? What is current CPU number? The one on
> which the task executes now or the one which it should execute on?
> 
> I assume what you want to say is:
> 
>  All pointers in the ops must have been set up to point to the per CPU
>  memory of the CPU on which the operations should be executed.
> 
> At least that's what I oracle in to that.

Exactly that. Will update to use this description instead.

> 
>> >> + * "comparison" operation can be used to check that the data used in the
>> >> + * preparation step did not change between preparation of system call
>> >> + * inputs and operation execution within the preempt-off critical
>> >> + * section.
>> >> + *
>> >> + * The reason why we require all pointer offsets to be calculated by
>> >> + * user-space beforehand is because we need to use get_user_pages_fast()
>> >> + * to first pin all pages touched by each operation. This takes care of
>> > 
>> > That doesnt explain it either.
>> 
>> What kind of explication are you looking for here ? Perhaps being too close
>> to the implementation prevents me from understanding what is unclear from
>> your perspective.
> 
> What the heck are pointer offsets?
> 
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

I think the sentence you suggested above is clear enough. I'll simply use
it.

> 
>> Sorry, that paragraph was unclear. Updated:
>> 
>>  * An overall maximum of 4216 bytes in enforced on the sum of operation
>>  * length within an operation vector, so user-space cannot generate a
>>  * too long preempt-off critical section (cache cold critical section
>>  * duration measured as 4.7µs on x86-64). Each operation is also limited
>>  * a length of PAGE_SIZE bytes,
> 
> Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
> is a hard limit of 4K. And because there is no alignment requiremnt the
> rest of the sentence is stating the obvious.

I can make that a 4K limit if you prefer. This presumes that no architecture
has pages smaller than 4K, which is true on Linux.

> 
>>  * meaning that an operation can touch a
>>  * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>>  * destination if addresses are not aligned on page boundaries).
> 
> I still have to understand why the 4K copy is necessary in the first place.
> 
>> > What's the critical section duration for operations which go to the limits
>> > of this on a average x86 64 machine?
>> 
>> When cache-cold, I measure 4.7 µs per critical section doing a
>> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
>> acceptable preempt-off latency for RT ?
> 
> Depends on the use case as always 

The use-case for 4k memcpy operation is a per-cpu ring buffer where
the rseq fast-path does the following:

- ring buffer push: in the rseq asm instruction sequence, a memcpy of a
  given structure (limited to 4k in size) into a ring buffer,
  followed by the final commit instruction which increments the current
  position offset by the number of bytes pushed.

- ring buffer pop: in the rseq asm instruction sequence, a memcpy of
  a given structure (up to 4k) 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Thomas Gleixner
On Mon, 20 Nov 2017, Thomas Gleixner wrote:
> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> > >> + * The reason why we require all pointer offsets to be calculated by
> > >> + * user-space beforehand is because we need to use get_user_pages_fast()
> > >> + * to first pin all pages touched by each operation. This takes care of
> > > 
> > > That doesnt explain it either.
> > 
> > What kind of explication are you looking for here ? Perhaps being too close
> > to the implementation prevents me from understanding what is unclear from
> > your perspective.
> 
> What the heck are pointer offsets?
> 
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

So I think you mix here the 'user space programmer guide - how to set up
the magic ops - into the kernel side documentation. The kernel side does
not care about the pointers, as long as they are valid to access.

Again. Try to explain things at the conceptual level.

Thanks,

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Thomas Gleixner
On Mon, 20 Nov 2017, Thomas Gleixner wrote:
> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> > >> + * The reason why we require all pointer offsets to be calculated by
> > >> + * user-space beforehand is because we need to use get_user_pages_fast()
> > >> + * to first pin all pages touched by each operation. This takes care of
> > > 
> > > That doesnt explain it either.
> > 
> > What kind of explication are you looking for here ? Perhaps being too close
> > to the implementation prevents me from understanding what is unclear from
> > your perspective.
> 
> What the heck are pointer offsets?
> 
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

So I think you mix here the 'user space programmer guide - how to set up
the magic ops - into the kernel side documentation. The kernel side does
not care about the pointers, as long as they are valid to access.

Again. Try to explain things at the conceptual level.

Thanks,

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Thomas Gleixner
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote:
> >> +#define NR_PINNED_PAGES_ON_STACK  8
> > 
> > 8 pinned pages on stack? Which stack?
> 
> The common cases need to touch few pages, and we can keep the
> pointers in an array on the kernel stack within the cpu_opv system
> call.
> 
> Updating to:
> 
> /*
>  * Typical invocation of cpu_opv need few pages. Keep struct page
>  * pointers in an array on the stack of the cpu_opv system call up to
>  * this limit, beyond which the array is dynamically allocated.
>  */
> #define NR_PIN_PAGES_ON_STACK8

That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
> >> + * left shift, and right shift. The system call receives a CPU number
> >> + * from user-space as argument, which is the CPU on which those
> >> + * operations need to be performed. All preparation steps such as
> >> + * loading pointers, and applying offsets to arrays, need to be
> >> + * performed by user-space before invoking the system call. The
> > 
> > loading pointers and applying offsets? That makes no sense.
> 
> Updating to:
> 
>  * All preparation steps such as
>  * loading base pointers, and adding offsets derived from the current
>  * CPU number, need to be performed by user-space before invoking the
>  * system call.

This still does not explain anything, really.

Which base pointer is loaded?  I nowhere see a reference to a base
pointer.

And what are the offsets about?

derived from current cpu number? What is current CPU number? The one on
which the task executes now or the one which it should execute on?

I assume what you want to say is:

  All pointers in the ops must have been set up to point to the per CPU
  memory of the CPU on which the operations should be executed.

At least that's what I oracle in to that.

> >> + * "comparison" operation can be used to check that the data used in the
> >> + * preparation step did not change between preparation of system call
> >> + * inputs and operation execution within the preempt-off critical
> >> + * section.
> >> + *
> >> + * The reason why we require all pointer offsets to be calculated by
> >> + * user-space beforehand is because we need to use get_user_pages_fast()
> >> + * to first pin all pages touched by each operation. This takes care of
> > 
> > That doesnt explain it either.
> 
> What kind of explication are you looking for here ? Perhaps being too close
> to the implementation prevents me from understanding what is unclear from
> your perspective.

What the heck are pointer offsets?

The ops have one or two pointer(s) to a lump of memory. So if a pointer
points to the wrong lump of memory then you're screwed, but that's true for
all pointers handed to the kernel.

> Sorry, that paragraph was unclear. Updated:
> 
>  * An overall maximum of 4216 bytes in enforced on the sum of operation
>  * length within an operation vector, so user-space cannot generate a
>  * too long preempt-off critical section (cache cold critical section
>  * duration measured as 4.7µs on x86-64). Each operation is also limited
>  * a length of PAGE_SIZE bytes,

Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
is a hard limit of 4K. And because there is no alignment requiremnt the
rest of the sentence is stating the obvious.

>  * meaning that an operation can touch a
>  * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>  * destination if addresses are not aligned on page boundaries).

I still have to understand why the 4K copy is necessary in the first place.

> > What's the critical section duration for operations which go to the limits
> > of this on a average x86 64 machine?
> 
> When cache-cold, I measure 4.7 µs per critical section doing a
> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
> acceptable preempt-off latency for RT ?

Depends on the use case as always 

Thanks,

tglx

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Thomas Gleixner
On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> - On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote:
> >> +#define NR_PINNED_PAGES_ON_STACK  8
> > 
> > 8 pinned pages on stack? Which stack?
> 
> The common cases need to touch few pages, and we can keep the
> pointers in an array on the kernel stack within the cpu_opv system
> call.
> 
> Updating to:
> 
> /*
>  * Typical invocation of cpu_opv need few pages. Keep struct page
>  * pointers in an array on the stack of the cpu_opv system call up to
>  * this limit, beyond which the array is dynamically allocated.
>  */
> #define NR_PIN_PAGES_ON_STACK8

That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
> >> + * left shift, and right shift. The system call receives a CPU number
> >> + * from user-space as argument, which is the CPU on which those
> >> + * operations need to be performed. All preparation steps such as
> >> + * loading pointers, and applying offsets to arrays, need to be
> >> + * performed by user-space before invoking the system call. The
> > 
> > loading pointers and applying offsets? That makes no sense.
> 
> Updating to:
> 
>  * All preparation steps such as
>  * loading base pointers, and adding offsets derived from the current
>  * CPU number, need to be performed by user-space before invoking the
>  * system call.

This still does not explain anything, really.

Which base pointer is loaded?  I nowhere see a reference to a base
pointer.

And what are the offsets about?

derived from current cpu number? What is current CPU number? The one on
which the task executes now or the one which it should execute on?

I assume what you want to say is:

  All pointers in the ops must have been set up to point to the per CPU
  memory of the CPU on which the operations should be executed.

At least that's what I oracle in to that.

> >> + * "comparison" operation can be used to check that the data used in the
> >> + * preparation step did not change between preparation of system call
> >> + * inputs and operation execution within the preempt-off critical
> >> + * section.
> >> + *
> >> + * The reason why we require all pointer offsets to be calculated by
> >> + * user-space beforehand is because we need to use get_user_pages_fast()
> >> + * to first pin all pages touched by each operation. This takes care of
> > 
> > That doesnt explain it either.
> 
> What kind of explication are you looking for here ? Perhaps being too close
> to the implementation prevents me from understanding what is unclear from
> your perspective.

What the heck are pointer offsets?

The ops have one or two pointer(s) to a lump of memory. So if a pointer
points to the wrong lump of memory then you're screwed, but that's true for
all pointers handed to the kernel.

> Sorry, that paragraph was unclear. Updated:
> 
>  * An overall maximum of 4216 bytes in enforced on the sum of operation
>  * length within an operation vector, so user-space cannot generate a
>  * too long preempt-off critical section (cache cold critical section
>  * duration measured as 4.7µs on x86-64). Each operation is also limited
>  * a length of PAGE_SIZE bytes,

Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
is a hard limit of 4K. And because there is no alignment requiremnt the
rest of the sentence is stating the obvious.

>  * meaning that an operation can touch a
>  * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>  * destination if addresses are not aligned on page boundaries).

I still have to understand why the 4K copy is necessary in the first place.

> > What's the critical section duration for operations which go to the limits
> > of this on a average x86 64 machine?
> 
> When cache-cold, I measure 4.7 µs per critical section doing a
> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
> acceptable preempt-off latency for RT ?

Depends on the use case as always 

Thanks,

tglx

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 17, 2017, at 3:22 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Fri, 17 Nov 2017, Mathieu Desnoyers wrote:
>> - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote:
>> 7) Allow libraries with multi-part algorithms to work on same per-cpu
>>data without affecting the allowed cpu mask
>> 
>> I stumbled on an interesting use-case within the lttng-ust tracer
>> per-cpu buffers: the algorithm needs to update a "reserve" counter,
>> serialize data into the buffer, and then update a "commit" counter
>> _on the same per-cpu buffer_. My goal is to use rseq for both reserve
>> and commit.
>> 
>> Clearly, if rseq reserve fails, the algorithm can retry on a different
>> per-cpu buffer. However, it's not that easy for the commit. It needs to
>> be performed on the same per-cpu buffer as the reserve.
>> 
>> The cpu_opv system call solves that problem by receiving the cpu number
>> on which the operation needs to be performed as argument. It can push
>> the task to the right CPU if needed, and perform the operations there
>> with preemption disabled.
> 
> If your transaction cannot be done in one go, then abusing that byte code
> interpreter for concluding it is just hillarious. That whole exercise is a
> gazillion times slower than the atomic operations which are neccesary to do
> it without all that.
> 
> I'm even more convinced now that this is overengineered beyond repair.

The fast-path (typical case) will execute on the right CPU, and rseq will
do both the reserve and the commit, and that is faster than atomic ops.

However, we need to handle migration between reserve and commit.
Unfortunately, concurrent rseq and atomic ops don't mix well on the
same per-cpu data, so we cannot fall-back on atomic ops, except in
very specific cases where we can use a split-counter strategy.

So cpu_opv handles migration for this use-case by ensuring
the slow-path is performed with preemption-off after migrating
the current task to the right CPU.

Thanks,

Mathieu


> 
> Thanks,
> 
>   tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 17, 2017, at 3:22 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Fri, 17 Nov 2017, Mathieu Desnoyers wrote:
>> - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote:
>> 7) Allow libraries with multi-part algorithms to work on same per-cpu
>>data without affecting the allowed cpu mask
>> 
>> I stumbled on an interesting use-case within the lttng-ust tracer
>> per-cpu buffers: the algorithm needs to update a "reserve" counter,
>> serialize data into the buffer, and then update a "commit" counter
>> _on the same per-cpu buffer_. My goal is to use rseq for both reserve
>> and commit.
>> 
>> Clearly, if rseq reserve fails, the algorithm can retry on a different
>> per-cpu buffer. However, it's not that easy for the commit. It needs to
>> be performed on the same per-cpu buffer as the reserve.
>> 
>> The cpu_opv system call solves that problem by receiving the cpu number
>> on which the operation needs to be performed as argument. It can push
>> the task to the right CPU if needed, and perform the operations there
>> with preemption disabled.
> 
> If your transaction cannot be done in one go, then abusing that byte code
> interpreter for concluding it is just hillarious. That whole exercise is a
> gazillion times slower than the atomic operations which are neccesary to do
> it without all that.
> 
> I'm even more convinced now that this is overengineered beyond repair.

The fast-path (typical case) will execute on the right CPU, and rseq will
do both the reserve and the commit, and that is faster than atomic ops.

However, we need to handle migration between reserve and commit.
Unfortunately, concurrent rseq and atomic ops don't mix well on the
same per-cpu data, so we cannot fall-back on atomic ops, except in
very specific cases where we can use a split-counter strategy.

So cpu_opv handles migration for this use-case by ensuring
the slow-path is performed with preemption-off after migrating
the current task to the right CPU.

Thanks,

Mathieu


> 
> Thanks,
> 
>   tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
>> +#ifdef __KERNEL__
>> +# include 
>> +#else   /* #ifdef __KERNEL__ */
> 
>  Sigh.

fixed.

> 
>> +# include 
>> +#endif  /* #else #ifdef __KERNEL__ */
>> +
>> +#include 
>> +
>> +#ifdef __LP64__
>> +# define CPU_OP_FIELD_u32_u64(field)uint64_t field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)field = 
>> (intptr_t)v
>> +#elif defined(__BYTE_ORDER) ? \
>> +__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
>> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field ## _padding, 
>> field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\
>> +field ## _padding = 0, field = (intptr_t)v
>> +#else
>> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field, field ## 
>> _padding
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\
>> +field = (intptr_t)v, field ## _padding = 0
>> +#endif
> 
> So in the rseq patch you have:
> 
> +#ifdef __LP64__
> +# define RSEQ_FIELD_u32_u64(field) uint64_t field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +   __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> +   field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> +   field = (intptr_t)v, field ## _padding = 0
> +#endif
> 
> IOW the same macro maze. Please use a separate header file which provides
> these macros once and share them between the two facilities.

ok. I'll introduce uapi/linux/types_32_64.h, and prefix defines with
"LINUX_":

LINUX_FIELD_u32_u64()

unless other names are preferred. It will be in a separate patch.

> 
>> +#define CPU_OP_VEC_LEN_MAX  16
>> +#define CPU_OP_ARG_LEN_MAX  24
>> +/* Max. data len per operation. */
>> +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE
> 
> That's something between 4K and 256K depending on the architecture.
> 
> You really want to allow up to 256K data copy with preemption disabled?
> Shudder.

This is the max per operation. Following peterz' comments at KS, I added a
limit of 4096 + 15 * 8 on the sum of len for all operations in a vector. This
is defined below as CPU_OP_VEC_DATA_LEN_MAX.

So each of the 16 op cannot have a len larger than PAGE_SIZE, so we
pin at most 4 pages per op (e.g. memcpy 2 pages for src, 2 pages for dst),
*and* the sum of all ops len needs to be <= 4216. So the max limit you are
interested in here is the 4216 bytes limit.

> 
>> +/*
>> + * Max. data len for overall vector. We to restrict the amount of
> 
> We to ?

fixed

> 
>> + * user-space data touched by the kernel in non-preemptible context so
>> + * we do not introduce long scheduler latencies.
>> + * This allows one copy of up to 4096 bytes, and 15 operations touching
>> + * 8 bytes each.
>> + * This limit is applied to the sum of length specified for all
>> + * operations in a vector.
>> + */
>> +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8)
> 
> Magic numbers. Please use proper defines for heavens sake.

ok, it becomes:

#define CPU_OP_MEMCPY_EXPECT_LEN4096
#define CPU_OP_EXPECT_LEN   8
#define CPU_OP_VEC_DATA_LEN_MAX \
(CPU_OP_MEMCPY_EXPECT_LEN + \
 (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN)

> 
>> +#define CPU_OP_MAX_PAGES4   /* Max. pages per op. */

Actually, I'll move the CPU_OP_MAX_PAGES define to cpu_opv.c. It's not
needed in the uapi header.


>> +
>> +enum cpu_op_type {
>> +CPU_COMPARE_EQ_OP,  /* compare */
>> +CPU_COMPARE_NE_OP,  /* compare */
>> +CPU_MEMCPY_OP,  /* memcpy */
>> +CPU_ADD_OP, /* arithmetic */
>> +CPU_OR_OP,  /* bitwise */
>> +CPU_AND_OP, /* bitwise */
>> +CPU_XOR_OP, /* bitwise */
>> +CPU_LSHIFT_OP,  /* shift */
>> +CPU_RSHIFT_OP,  /* shift */
>> +CPU_MB_OP,  /* memory barrier */
>> +};
>> +
>> +/* Vector of operations to perform. Limited to 16. */
>> +struct cpu_op {
>> +int32_t op; /* enum cpu_op_type. */
>> +uint32_t len;   /* data length, in bytes. */
> 
> Please get rid of these tail comments

ok

> 
>> +union {
>> +#define TMP_BUFLEN  64
>> +#define NR_PINNED_PAGES_ON_STACK8
> 
> 8 pinned pages on stack? Which stack?

The common cases need to touch few pages, and we can keep the
pointers in an array on the kernel stack within the cpu_opv system
call.

Updating to:

/*
 * Typical invocation of cpu_opv need few pages. Keep struct page
 * pointers in an array on the stack of the cpu_opv 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-20 Thread Mathieu Desnoyers
- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
>> +#ifdef __KERNEL__
>> +# include 
>> +#else   /* #ifdef __KERNEL__ */
> 
>  Sigh.

fixed.

> 
>> +# include 
>> +#endif  /* #else #ifdef __KERNEL__ */
>> +
>> +#include 
>> +
>> +#ifdef __LP64__
>> +# define CPU_OP_FIELD_u32_u64(field)uint64_t field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)field = 
>> (intptr_t)v
>> +#elif defined(__BYTE_ORDER) ? \
>> +__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
>> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field ## _padding, 
>> field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\
>> +field ## _padding = 0, field = (intptr_t)v
>> +#else
>> +# define CPU_OP_FIELD_u32_u64(field)uint32_t field, field ## 
>> _padding
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)\
>> +field = (intptr_t)v, field ## _padding = 0
>> +#endif
> 
> So in the rseq patch you have:
> 
> +#ifdef __LP64__
> +# define RSEQ_FIELD_u32_u64(field) uint64_t field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +   __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> +   field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> +   field = (intptr_t)v, field ## _padding = 0
> +#endif
> 
> IOW the same macro maze. Please use a separate header file which provides
> these macros once and share them between the two facilities.

ok. I'll introduce uapi/linux/types_32_64.h, and prefix defines with
"LINUX_":

LINUX_FIELD_u32_u64()

unless other names are preferred. It will be in a separate patch.

> 
>> +#define CPU_OP_VEC_LEN_MAX  16
>> +#define CPU_OP_ARG_LEN_MAX  24
>> +/* Max. data len per operation. */
>> +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE
> 
> That's something between 4K and 256K depending on the architecture.
> 
> You really want to allow up to 256K data copy with preemption disabled?
> Shudder.

This is the max per operation. Following peterz' comments at KS, I added a
limit of 4096 + 15 * 8 on the sum of len for all operations in a vector. This
is defined below as CPU_OP_VEC_DATA_LEN_MAX.

So each of the 16 op cannot have a len larger than PAGE_SIZE, so we
pin at most 4 pages per op (e.g. memcpy 2 pages for src, 2 pages for dst),
*and* the sum of all ops len needs to be <= 4216. So the max limit you are
interested in here is the 4216 bytes limit.

> 
>> +/*
>> + * Max. data len for overall vector. We to restrict the amount of
> 
> We to ?

fixed

> 
>> + * user-space data touched by the kernel in non-preemptible context so
>> + * we do not introduce long scheduler latencies.
>> + * This allows one copy of up to 4096 bytes, and 15 operations touching
>> + * 8 bytes each.
>> + * This limit is applied to the sum of length specified for all
>> + * operations in a vector.
>> + */
>> +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8)
> 
> Magic numbers. Please use proper defines for heavens sake.

ok, it becomes:

#define CPU_OP_MEMCPY_EXPECT_LEN4096
#define CPU_OP_EXPECT_LEN   8
#define CPU_OP_VEC_DATA_LEN_MAX \
(CPU_OP_MEMCPY_EXPECT_LEN + \
 (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN)

> 
>> +#define CPU_OP_MAX_PAGES4   /* Max. pages per op. */

Actually, I'll move the CPU_OP_MAX_PAGES define to cpu_opv.c. It's not
needed in the uapi header.


>> +
>> +enum cpu_op_type {
>> +CPU_COMPARE_EQ_OP,  /* compare */
>> +CPU_COMPARE_NE_OP,  /* compare */
>> +CPU_MEMCPY_OP,  /* memcpy */
>> +CPU_ADD_OP, /* arithmetic */
>> +CPU_OR_OP,  /* bitwise */
>> +CPU_AND_OP, /* bitwise */
>> +CPU_XOR_OP, /* bitwise */
>> +CPU_LSHIFT_OP,  /* shift */
>> +CPU_RSHIFT_OP,  /* shift */
>> +CPU_MB_OP,  /* memory barrier */
>> +};
>> +
>> +/* Vector of operations to perform. Limited to 16. */
>> +struct cpu_op {
>> +int32_t op; /* enum cpu_op_type. */
>> +uint32_t len;   /* data length, in bytes. */
> 
> Please get rid of these tail comments

ok

> 
>> +union {
>> +#define TMP_BUFLEN  64
>> +#define NR_PINNED_PAGES_ON_STACK8
> 
> 8 pinned pages on stack? Which stack?

The common cases need to touch few pages, and we can keep the
pointers in an array on the kernel stack within the cpu_opv system
call.

Updating to:

/*
 * Typical invocation of cpu_opv need few pages. Keep struct page
 * pointers in an array on the stack of the cpu_opv 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-18 Thread Andy Lutomirski
On Fri, Nov 17, 2017 at 12:07 PM, Thomas Gleixner  wrote:
> On Fri, 17 Nov 2017, Andi Kleen wrote:
>> > The most straight forward is to have a mechanism which forces everything
>> > into the slow path in case of debugging, lack of progress, etc. The slow
>>
>> That's the abort address, right?
>
> Yes.
>
>> For the generic case the fall back path would require disabling preemption
>> unfortunately, for which we don't have a mechanism in user space.
>>
>> I think that is what Mathieu tried to implement here with this call.
>
> Yes. preempt disabled execution of byte code to make sure that the
> transaction succeeds.
>
> But, why is disabling preemption mandatory? If stuff fails due to hitting a
> breakpoint or because it retried a gazillion times without progress, then
> the abort code can detect that and act accordingly. Pseudo code:
>
> abort:
> if (!slowpath_required() &&
> !breakpoint_caused_abort() &&
> !stall_detected()) {
> do_the_normal_abort_postprocessing();
> goto retry;
> }
>
> lock(slowpath_lock[cpu]);
>
> if (!slowpath_required()) {
> unlock(slowpath_lock[cpu]);
> goto retry;
> }
>
> if (rseq_supported)
> set_slow_path();
>
> /* Same code as inside the actual rseq */
> do_transaction();
>
> if (rseq_supported)
> unset_slow_path();
>
> unlock(slowpath_lock[cpu]);

My objection to this approach is that people will get it wrong and not
notice until it's too late.  TSX has two things going for it:

1. It's part of the ISA, so debuggers have very well-defined semantics
to deal with and debuggers will know about it.  rseq is a made-up
Linux thing and debuggers may not know what to do with it.

2. TSX is slow and crappy, so it may not be that widely used.  glibc,
OTOH, will probably start using rseq on all machines if the patches
are merged.


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-18 Thread Andy Lutomirski
On Fri, Nov 17, 2017 at 12:07 PM, Thomas Gleixner  wrote:
> On Fri, 17 Nov 2017, Andi Kleen wrote:
>> > The most straight forward is to have a mechanism which forces everything
>> > into the slow path in case of debugging, lack of progress, etc. The slow
>>
>> That's the abort address, right?
>
> Yes.
>
>> For the generic case the fall back path would require disabling preemption
>> unfortunately, for which we don't have a mechanism in user space.
>>
>> I think that is what Mathieu tried to implement here with this call.
>
> Yes. preempt disabled execution of byte code to make sure that the
> transaction succeeds.
>
> But, why is disabling preemption mandatory? If stuff fails due to hitting a
> breakpoint or because it retried a gazillion times without progress, then
> the abort code can detect that and act accordingly. Pseudo code:
>
> abort:
> if (!slowpath_required() &&
> !breakpoint_caused_abort() &&
> !stall_detected()) {
> do_the_normal_abort_postprocessing();
> goto retry;
> }
>
> lock(slowpath_lock[cpu]);
>
> if (!slowpath_required()) {
> unlock(slowpath_lock[cpu]);
> goto retry;
> }
>
> if (rseq_supported)
> set_slow_path();
>
> /* Same code as inside the actual rseq */
> do_transaction();
>
> if (rseq_supported)
> unset_slow_path();
>
> unlock(slowpath_lock[cpu]);

My objection to this approach is that people will get it wrong and not
notice until it's too late.  TSX has two things going for it:

1. It's part of the ISA, so debuggers have very well-defined semantics
to deal with and debuggers will know about it.  rseq is a made-up
Linux thing and debuggers may not know what to do with it.

2. TSX is slow and crappy, so it may not be that widely used.  glibc,
OTOH, will probably start using rseq on all machines if the patches
are merged.


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Mathieu Desnoyers wrote:
> - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote:
> 7) Allow libraries with multi-part algorithms to work on same per-cpu
>data without affecting the allowed cpu mask
> 
> I stumbled on an interesting use-case within the lttng-ust tracer
> per-cpu buffers: the algorithm needs to update a "reserve" counter,
> serialize data into the buffer, and then update a "commit" counter
> _on the same per-cpu buffer_. My goal is to use rseq for both reserve
> and commit.
> 
> Clearly, if rseq reserve fails, the algorithm can retry on a different
> per-cpu buffer. However, it's not that easy for the commit. It needs to
> be performed on the same per-cpu buffer as the reserve.
> 
> The cpu_opv system call solves that problem by receiving the cpu number
> on which the operation needs to be performed as argument. It can push
> the task to the right CPU if needed, and perform the operations there
> with preemption disabled.

If your transaction cannot be done in one go, then abusing that byte code
interpreter for concluding it is just hillarious. That whole exercise is a
gazillion times slower than the atomic operations which are neccesary to do
it without all that.

I'm even more convinced now that this is overengineered beyond repair.

Thanks,

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Mathieu Desnoyers wrote:
> - On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote:
> 7) Allow libraries with multi-part algorithms to work on same per-cpu
>data without affecting the allowed cpu mask
> 
> I stumbled on an interesting use-case within the lttng-ust tracer
> per-cpu buffers: the algorithm needs to update a "reserve" counter,
> serialize data into the buffer, and then update a "commit" counter
> _on the same per-cpu buffer_. My goal is to use rseq for both reserve
> and commit.
> 
> Clearly, if rseq reserve fails, the algorithm can retry on a different
> per-cpu buffer. However, it's not that easy for the commit. It needs to
> be performed on the same per-cpu buffer as the reserve.
> 
> The cpu_opv system call solves that problem by receiving the cpu number
> on which the operation needs to be performed as argument. It can push
> the task to the right CPU if needed, and perform the operations there
> with preemption disabled.

If your transaction cannot be done in one go, then abusing that byte code
interpreter for concluding it is just hillarious. That whole exercise is a
gazillion times slower than the atomic operations which are neccesary to do
it without all that.

I'm even more convinced now that this is overengineered beyond repair.

Thanks,

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Andi Kleen wrote:
> > The most straight forward is to have a mechanism which forces everything
> > into the slow path in case of debugging, lack of progress, etc. The slow
> 
> That's the abort address, right?

Yes.

> For the generic case the fall back path would require disabling preemption
> unfortunately, for which we don't have a mechanism in user space.
> 
> I think that is what Mathieu tried to implement here with this call.

Yes. preempt disabled execution of byte code to make sure that the
transaction succeeds.

But, why is disabling preemption mandatory? If stuff fails due to hitting a
breakpoint or because it retried a gazillion times without progress, then
the abort code can detect that and act accordingly. Pseudo code:

abort:
if (!slowpath_required() &&
!breakpoint_caused_abort() &&
!stall_detected()) {
do_the_normal_abort_postprocessing();
goto retry;
}

lock(slowpath_lock[cpu]);

if (!slowpath_required()) {
unlock(slowpath_lock[cpu]);
goto retry;
}

if (rseq_supported)
set_slow_path();

/* Same code as inside the actual rseq */
do_transaction();

if (rseq_supported)
unset_slow_path();

unlock(slowpath_lock[cpu]);

The only interesting question is how to make sure that all threads on that
CPU see the slowpath required before they execute the commit so they are
forced into the slow path. The simplest thing would be atomics, but that's
what rseq wants to avoid.

I think that this can be solved cleanly with the help of the membarrier
syscall or some variant of that without all that 'yet another byte code
interpreter' mess.

The other question is whether do_transaction() is required to run on that
specific CPU. I don't think so because that magic interpreter operates even
when the required target cpu is offline and with locking in place there is
no reason why running on the target CPU would be required.

Sure, that's going to affect performance, but only for two cases:

  1) Debugging. That's completely uninteresting

  2) No progress at all. Performance is down the drain anyway, so it does
 not matter at all whether you spend a few more cycles or not to
 resolve that.

I might be missing something as usual :)

Thanks

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Andi Kleen wrote:
> > The most straight forward is to have a mechanism which forces everything
> > into the slow path in case of debugging, lack of progress, etc. The slow
> 
> That's the abort address, right?

Yes.

> For the generic case the fall back path would require disabling preemption
> unfortunately, for which we don't have a mechanism in user space.
> 
> I think that is what Mathieu tried to implement here with this call.

Yes. preempt disabled execution of byte code to make sure that the
transaction succeeds.

But, why is disabling preemption mandatory? If stuff fails due to hitting a
breakpoint or because it retried a gazillion times without progress, then
the abort code can detect that and act accordingly. Pseudo code:

abort:
if (!slowpath_required() &&
!breakpoint_caused_abort() &&
!stall_detected()) {
do_the_normal_abort_postprocessing();
goto retry;
}

lock(slowpath_lock[cpu]);

if (!slowpath_required()) {
unlock(slowpath_lock[cpu]);
goto retry;
}

if (rseq_supported)
set_slow_path();

/* Same code as inside the actual rseq */
do_transaction();

if (rseq_supported)
unset_slow_path();

unlock(slowpath_lock[cpu]);

The only interesting question is how to make sure that all threads on that
CPU see the slowpath required before they execute the commit so they are
forced into the slow path. The simplest thing would be atomics, but that's
what rseq wants to avoid.

I think that this can be solved cleanly with the help of the membarrier
syscall or some variant of that without all that 'yet another byte code
interpreter' mess.

The other question is whether do_transaction() is required to run on that
specific CPU. I don't think so because that magic interpreter operates even
when the required target cpu is offline and with locking in place there is
no reason why running on the target CPU would be required.

Sure, that's going to affect performance, but only for two cases:

  1) Debugging. That's completely uninteresting

  2) No progress at all. Performance is down the drain anyway, so it does
 not matter at all whether you spend a few more cycles or not to
 resolve that.

I might be missing something as usual :)

Thanks

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Andi Kleen
> The most straight forward is to have a mechanism which forces everything
> into the slow path in case of debugging, lack of progress, etc. The slow

That's the abort address, right?

For the generic case the fall back path would require disabling preemption
unfortunately, for which we don't have a mechanism in user space.

I think that is what Mathieu tried to implement here with this call.

There may be some special cases where it's possible without preemption
control, e.g. a malloc could just not use the per cpu cache. But I doubt
that is possible in all cases that Mathieu envisions. 

But again would be a different code path, and I question the need for it when
we can just let the operator of the debugger deal with it.

-Andi


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Andi Kleen
> The most straight forward is to have a mechanism which forces everything
> into the slow path in case of debugging, lack of progress, etc. The slow

That's the abort address, right?

For the generic case the fall back path would require disabling preemption
unfortunately, for which we don't have a mechanism in user space.

I think that is what Mathieu tried to implement here with this call.

There may be some special cases where it's possible without preemption
control, e.g. a malloc could just not use the per cpu cache. But I doubt
that is possible in all cases that Mathieu envisions. 

But again would be a different code path, and I question the need for it when
we can just let the operator of the debugger deal with it.

-Andi


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Andi Kleen wrote:
> > 1) Handling single-stepping from tools
> > 
> > Tools like debuggers, and simulators like record-replay ("rr") use
> > single-stepping to run through existing programs. If core libraries start
> 
> No rr doesn't use single stepping. It uses branch stepping based on the
> PMU, and those should only happen on external events or syscalls which would
> abort the rseq anyways.
> 
> Eventually it will suceeed because not every retry there will be a new
> signal. If you put a syscall into your rseq you will just get
> what you deserve.
> 
> If it was single stepping it couldn't execute the vDSO (and it would
> be incredible slow)
>
> Yes debuggers have to skip instead of step, but that's easily done
> (and needed today already for every gettimeofday, which tends to be
> the most common syscall...)

The same problem exists with TSX. Hitting a break point inside a
transaction section triggers abort with the cause bit 'breakpoint' set.

rseq can be looked at as a variant of software transactional memory with a
limited feature set, but the underlying problems are exactly the
same. Breakpoints are not working either.

> > to use restartable sequences for e.g. memory allocation, this means
> > pre-existing programs cannot be single-stepped, simply because the
> > underlying glibc or jemalloc has changed.
> 
> But how likely is it that the fall back path even works?
> 
> It would never be exercised in normal operation, so it would be a prime
> target for bit rot, or not ever being tested and be broken in the first place.

What's worse is that the fall back path cannot be debugged at all. It's a
magic byte code which is interpreted inside the kernel.


I really do not understand why this does not utilize existing design
patterns well known from transactional memory.

The most straight forward is to have a mechanism which forces everything
into the slow path in case of debugging, lack of progress, etc. The slow
path uses traditional locking to resolve the situation. That's well known
to work and if done correctly then the only difference between slow path
and fast path is locking/transaction control, i.e. it's a single
implementation of the actual memory operations which can be single stepped,
debug traced and whatever.

It solves _ALL_ of the problems you describe including support for systems
which do not support rseq at all.

This syscall is horribly overengineered and creates more problems than it
solves.

Thanks,

tglx



Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Fri, 17 Nov 2017, Andi Kleen wrote:
> > 1) Handling single-stepping from tools
> > 
> > Tools like debuggers, and simulators like record-replay ("rr") use
> > single-stepping to run through existing programs. If core libraries start
> 
> No rr doesn't use single stepping. It uses branch stepping based on the
> PMU, and those should only happen on external events or syscalls which would
> abort the rseq anyways.
> 
> Eventually it will suceeed because not every retry there will be a new
> signal. If you put a syscall into your rseq you will just get
> what you deserve.
> 
> If it was single stepping it couldn't execute the vDSO (and it would
> be incredible slow)
>
> Yes debuggers have to skip instead of step, but that's easily done
> (and needed today already for every gettimeofday, which tends to be
> the most common syscall...)

The same problem exists with TSX. Hitting a break point inside a
transaction section triggers abort with the cause bit 'breakpoint' set.

rseq can be looked at as a variant of software transactional memory with a
limited feature set, but the underlying problems are exactly the
same. Breakpoints are not working either.

> > to use restartable sequences for e.g. memory allocation, this means
> > pre-existing programs cannot be single-stepped, simply because the
> > underlying glibc or jemalloc has changed.
> 
> But how likely is it that the fall back path even works?
> 
> It would never be exercised in normal operation, so it would be a prime
> target for bit rot, or not ever being tested and be broken in the first place.

What's worse is that the fall back path cannot be debugged at all. It's a
magic byte code which is interpreted inside the kernel.


I really do not understand why this does not utilize existing design
patterns well known from transactional memory.

The most straight forward is to have a mechanism which forces everything
into the slow path in case of debugging, lack of progress, etc. The slow
path uses traditional locking to resolve the situation. That's well known
to work and if done correctly then the only difference between slow path
and fast path is locking/transaction control, i.e. it's a single
implementation of the actual memory operations which can be single stepped,
debug traced and whatever.

It solves _ALL_ of the problems you describe including support for systems
which do not support rseq at all.

This syscall is horribly overengineered and creates more problems than it
solves.

Thanks,

tglx



Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Andi Kleen
Thanks for the detailed write up. That should have been in the
changelog...

Some comments below. Overall I think the case for the syscall is still
very weak.

> Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
> changelog and documentation to include that information.
> 
> 1) Handling single-stepping from tools
> 
> Tools like debuggers, and simulators like record-replay ("rr") use
> single-stepping to run through existing programs. If core libraries start

No rr doesn't use single stepping. It uses branch stepping based on the
PMU, and those should only happen on external events or syscalls which would
abort the rseq anyways.

Eventually it will suceeed because not every retry there will be a new
signal. If you put a syscall into your rseq you will just get
what you deserve.

If it was single stepping it couldn't execute the vDSO (and it would
be incredible slow)

Yes debuggers have to skip instead of step, but that's easily done
(and needed today already for every gettimeofday, which tends to be
the most common syscall...) 

> to use restartable sequences for e.g. memory allocation, this means
> pre-existing programs cannot be single-stepped, simply because the
> underlying glibc or jemalloc has changed.

But how likely is it that the fall back path even works?

It would never be exercised in normal operation, so it would be a prime
target for bit rot, or not ever being tested and be broken in the first place.

> Having a performance-related library improvement break tooling is likely to
> cause a big push-back against wide adoption of rseq. *I* would not even
> be using rseq in liburcu and lttng-ust until gdb gets updated in every
> distributions that my users depend on. This will likely happen... never.

I suspect your scheme already has a <50% likelihood of working due to
the above that it's equivalent.

> 
> 
> 2) Forward-progress guarantee
> 
> Having a piece of user-space code that stops progressing due to
> external conditions is pretty bad. We are used to think of fast-path and
> slow-path (e.g. for locking), where the contended vs uncontended cases
> have different performance characteristics, but each need to provide some
> level of progress guarantees.

We already have that in the vDSO. Has never been a problem.

> 3) Handling page faults
> 
> If we get creative enough, it's pretty easy to come up with corner-case
> scenarios where rseq does not progress without the help from cpu_opv. For
> instance, a system with swap enabled which is under high memory pressure
> could trigger page faults at pretty much every rseq attempt. I recognize
> that this scenario is extremely unlikely, but I'm not comfortable making
> rseq the weak link of the chain here.

Seems very unlikely. But if this happens the program is dead anyways,
so doesn't make much difference.


> The main difference between LL/SC and rseq is that debuggers had
> to support single-stepping through LL/SC critical sections from the
> get go in order to support a given architecture. For rseq, we're
> adding critical sections into pre-existing applications/libraries,
> so the user expectation is that tools don't break due to a library
> optimization.

I would argue that debugging some other path that is normally
executed is wrong by definition. How would you find the bug if it is 
in the original path only, but not the fallback.

The whole point of debugging is to punch through abstractions,
but you're adding another layer of obfuscation here. And worse
you have no guarantee that the new layer is actually functionally
equivalent.

Having less magic and just assume the user can do the right thing
seems like a far more practical scheme.

> In order to facilitate integration of rseq into user-space, cpu_opv
> can provide a (relatively slower) architecture-agnostic implementation
> of rseq. This means that user-space code can be ported to all
> architectures through use of cpu_opv initially, and have the fast-path
> use rseq whenever the asm code is implemented.

While that's in theory correct, in practice it will be so slow
that it is useless. Nobody will want a system call in their malloc
fast path.


-Andi



Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Andi Kleen
Thanks for the detailed write up. That should have been in the
changelog...

Some comments below. Overall I think the case for the syscall is still
very weak.

> Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
> changelog and documentation to include that information.
> 
> 1) Handling single-stepping from tools
> 
> Tools like debuggers, and simulators like record-replay ("rr") use
> single-stepping to run through existing programs. If core libraries start

No rr doesn't use single stepping. It uses branch stepping based on the
PMU, and those should only happen on external events or syscalls which would
abort the rseq anyways.

Eventually it will suceeed because not every retry there will be a new
signal. If you put a syscall into your rseq you will just get
what you deserve.

If it was single stepping it couldn't execute the vDSO (and it would
be incredible slow)

Yes debuggers have to skip instead of step, but that's easily done
(and needed today already for every gettimeofday, which tends to be
the most common syscall...) 

> to use restartable sequences for e.g. memory allocation, this means
> pre-existing programs cannot be single-stepped, simply because the
> underlying glibc or jemalloc has changed.

But how likely is it that the fall back path even works?

It would never be exercised in normal operation, so it would be a prime
target for bit rot, or not ever being tested and be broken in the first place.

> Having a performance-related library improvement break tooling is likely to
> cause a big push-back against wide adoption of rseq. *I* would not even
> be using rseq in liburcu and lttng-ust until gdb gets updated in every
> distributions that my users depend on. This will likely happen... never.

I suspect your scheme already has a <50% likelihood of working due to
the above that it's equivalent.

> 
> 
> 2) Forward-progress guarantee
> 
> Having a piece of user-space code that stops progressing due to
> external conditions is pretty bad. We are used to think of fast-path and
> slow-path (e.g. for locking), where the contended vs uncontended cases
> have different performance characteristics, but each need to provide some
> level of progress guarantees.

We already have that in the vDSO. Has never been a problem.

> 3) Handling page faults
> 
> If we get creative enough, it's pretty easy to come up with corner-case
> scenarios where rseq does not progress without the help from cpu_opv. For
> instance, a system with swap enabled which is under high memory pressure
> could trigger page faults at pretty much every rseq attempt. I recognize
> that this scenario is extremely unlikely, but I'm not comfortable making
> rseq the weak link of the chain here.

Seems very unlikely. But if this happens the program is dead anyways,
so doesn't make much difference.


> The main difference between LL/SC and rseq is that debuggers had
> to support single-stepping through LL/SC critical sections from the
> get go in order to support a given architecture. For rseq, we're
> adding critical sections into pre-existing applications/libraries,
> so the user expectation is that tools don't break due to a library
> optimization.

I would argue that debugging some other path that is normally
executed is wrong by definition. How would you find the bug if it is 
in the original path only, but not the fallback.

The whole point of debugging is to punch through abstractions,
but you're adding another layer of obfuscation here. And worse
you have no guarantee that the new layer is actually functionally
equivalent.

Having less magic and just assume the user can do the right thing
seems like a far more practical scheme.

> In order to facilitate integration of rseq into user-space, cpu_opv
> can provide a (relatively slower) architecture-agnostic implementation
> of rseq. This means that user-space code can be ported to all
> architectures through use of cpu_opv initially, and have the fast-path
> use rseq whenever the asm code is implemented.

While that's in theory correct, in practice it will be so slow
that it is useless. Nobody will want a system call in their malloc
fast path.


-Andi



Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote:

> On Thu, 16 Nov 2017, Andi Kleen wrote:
>> My preference would be just to drop this new super ugly system call.
>> 
>> It's also not just the ugliness, but the very large attack surface
>> that worries me here.
>> 
>> As far as I know it is only needed to support single stepping, correct?
> 
> I can't figure that out because the changelog describes only WHAT the patch
> does and not WHY. Useful, isn't it?
> 
>> Then this whole mess would disappear.
> 
> Agreed. That would be much appreciated.

Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
changelog and documentation to include that information.

1) Handling single-stepping from tools

Tools like debuggers, and simulators like record-replay ("rr") use
single-stepping to run through existing programs. If core libraries start
to use restartable sequences for e.g. memory allocation, this means
pre-existing programs cannot be single-stepped, simply because the
underlying glibc or jemalloc has changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they want.
However, this requires upgrading tools, and still breaks single-stepping
in case where glibc or jemalloc is updated, but not the tooling.

Having a performance-related library improvement break tooling is likely to
cause a big push-back against wide adoption of rseq. *I* would not even
be using rseq in liburcu and lttng-ust until gdb gets updated in every
distributions that my users depend on. This will likely happen... never.


2) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to
external conditions is pretty bad. We are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide some
level of progress guarantees.

I'm very concerned about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for trouble
when real-life will happen: page faults, uprobes, and other unforeseen
conditions that would seldom cause a rseq fast-path to never progress.


3) Handling page faults

If we get creative enough, it's pretty easy to come up with corner-case
scenarios where rseq does not progress without the help from cpu_opv. For
instance, a system with swap enabled which is under high memory pressure
could trigger page faults at pretty much every rseq attempt. I recognize
that this scenario is extremely unlikely, but I'm not comfortable making
rseq the weak link of the chain here.


4) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to support single-stepping through LL/SC critical sections from the
get go in order to support a given architecture. For rseq, we're
adding critical sections into pre-existing applications/libraries,
so the user expectation is that tools don't break due to a library
optimization.


5) Perform maintenance operations on per-cpu data

rseq c.s. are quite limited feature-wise: they need to end with a
*single* commit instruction that updates a memory location. On the
other hand, the cpu_opv system call can combine a sequence of operations
that need to be executed with preemption disabled. While slower than
rseq, this allows for more complex maintenance operations to be
performed on per-cpu data concurrently with rseq fast-paths, in cases
where it's not possible to map those sequences of ops to a rseq.


6) Use cpu_opv as generic implementation for architectures not
   implementing rseq assembly code

rseq critical sections require architecture-specific user-space code
to be crafted in order to port an algorithm to a given architecture.
In addition, it requires that the kernel architecture implementation
adds hooks into signal delivery and resume to user-space.

In order to facilitate integration of rseq into user-space, cpu_opv
can provide a (relatively slower) architecture-agnostic implementation
of rseq. This means that user-space code can be ported to all
architectures through use of cpu_opv initially, and have the fast-path
use rseq whenever the asm code is implemented.


7) Allow libraries with multi-part algorithms to work on same per-cpu
   data without affecting the allowed cpu mask

I stumbled on 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner t...@linutronix.de wrote:

> On Thu, 16 Nov 2017, Andi Kleen wrote:
>> My preference would be just to drop this new super ugly system call.
>> 
>> It's also not just the ugliness, but the very large attack surface
>> that worries me here.
>> 
>> As far as I know it is only needed to support single stepping, correct?
> 
> I can't figure that out because the changelog describes only WHAT the patch
> does and not WHY. Useful, isn't it?
> 
>> Then this whole mess would disappear.
> 
> Agreed. That would be much appreciated.

Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
changelog and documentation to include that information.

1) Handling single-stepping from tools

Tools like debuggers, and simulators like record-replay ("rr") use
single-stepping to run through existing programs. If core libraries start
to use restartable sequences for e.g. memory allocation, this means
pre-existing programs cannot be single-stepped, simply because the
underlying glibc or jemalloc has changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they want.
However, this requires upgrading tools, and still breaks single-stepping
in case where glibc or jemalloc is updated, but not the tooling.

Having a performance-related library improvement break tooling is likely to
cause a big push-back against wide adoption of rseq. *I* would not even
be using rseq in liburcu and lttng-ust until gdb gets updated in every
distributions that my users depend on. This will likely happen... never.


2) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to
external conditions is pretty bad. We are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide some
level of progress guarantees.

I'm very concerned about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for trouble
when real-life will happen: page faults, uprobes, and other unforeseen
conditions that would seldom cause a rseq fast-path to never progress.


3) Handling page faults

If we get creative enough, it's pretty easy to come up with corner-case
scenarios where rseq does not progress without the help from cpu_opv. For
instance, a system with swap enabled which is under high memory pressure
could trigger page faults at pretty much every rseq attempt. I recognize
that this scenario is extremely unlikely, but I'm not comfortable making
rseq the weak link of the chain here.


4) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to support single-stepping through LL/SC critical sections from the
get go in order to support a given architecture. For rseq, we're
adding critical sections into pre-existing applications/libraries,
so the user expectation is that tools don't break due to a library
optimization.


5) Perform maintenance operations on per-cpu data

rseq c.s. are quite limited feature-wise: they need to end with a
*single* commit instruction that updates a memory location. On the
other hand, the cpu_opv system call can combine a sequence of operations
that need to be executed with preemption disabled. While slower than
rseq, this allows for more complex maintenance operations to be
performed on per-cpu data concurrently with rseq fast-paths, in cases
where it's not possible to map those sequences of ops to a rseq.


6) Use cpu_opv as generic implementation for architectures not
   implementing rseq assembly code

rseq critical sections require architecture-specific user-space code
to be crafted in order to port an algorithm to a given architecture.
In addition, it requires that the kernel architecture implementation
adds hooks into signal delivery and resume to user-space.

In order to facilitate integration of rseq into user-space, cpu_opv
can provide a (relatively slower) architecture-agnostic implementation
of rseq. This means that user-space code can be ported to all
architectures through use of cpu_opv initially, and have the fast-path
use rseq whenever the asm code is implemented.


7) Allow libraries with multi-part algorithms to work on same per-cpu
   data without affecting the allowed cpu mask

I stumbled on 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Thu, 16 Nov 2017, Andi Kleen wrote:
> My preference would be just to drop this new super ugly system call. 
> 
> It's also not just the ugliness, but the very large attack surface
> that worries me here.
> 
> As far as I know it is only needed to support single stepping, correct?

I can't figure that out because the changelog describes only WHAT the patch
does and not WHY. Useful, isn't it?

> Then this whole mess would disappear.

Agreed. That would be much appreciated.

Thanks,

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-17 Thread Thomas Gleixner
On Thu, 16 Nov 2017, Andi Kleen wrote:
> My preference would be just to drop this new super ugly system call. 
> 
> It's also not just the ugliness, but the very large attack surface
> that worries me here.
> 
> As far as I know it is only needed to support single stepping, correct?

I can't figure that out because the changelog describes only WHAT the patch
does and not WHY. Useful, isn't it?

> Then this whole mess would disappear.

Agreed. That would be much appreciated.

Thanks,

tglx


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-16 Thread Andi Kleen

My preference would be just to drop this new super ugly system call. 

It's also not just the ugliness, but the very large attack surface
that worries me here.

As far as I know it is only needed to support single stepping, correct?

We already have other code that cannot be single stepped, most
prominently the ring 3 vdso time functions that rely on seq locks.

The code that actually cannot be single stepped is very small
and only a few lines.

As far as I know this wasn't ever a significant problem for anybody, and
there is always a simple workaround (set a temporary break point
after it and continue) 

Same should apply to the new rseq regions. They should be all
tiny, and we should just accept that they cannot be single-stepped,
but only skipped.

Then this whole mess would disappear.

-Andi




Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-16 Thread Andi Kleen

My preference would be just to drop this new super ugly system call. 

It's also not just the ugliness, but the very large attack surface
that worries me here.

As far as I know it is only needed to support single stepping, correct?

We already have other code that cannot be single stepped, most
prominently the ring 3 vdso time functions that rely on seq locks.

The code that actually cannot be single stepped is very small
and only a few lines.

As far as I know this wasn't ever a significant problem for anybody, and
there is always a simple workaround (set a temporary break point
after it and continue) 

Same should apply to the new rseq regions. They should be all
tiny, and we should just accept that they cannot be single-stepped,
but only skipped.

Then this whole mess would disappear.

-Andi




Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-16 Thread Thomas Gleixner
On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
> +#ifdef __KERNEL__
> +# include 
> +#else/* #ifdef __KERNEL__ */

   Sigh.

> +# include 
> +#endif   /* #else #ifdef __KERNEL__ */
> +
> +#include 
> +
> +#ifdef __LP64__
> +# define CPU_OP_FIELD_u32_u64(field) uint64_t field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define CPU_OP_FIELD_u32_u64(field) uint32_t field ## _padding, field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> + field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define CPU_OP_FIELD_u32_u64(field) uint32_t field, field ## _padding
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> + field = (intptr_t)v, field ## _padding = 0
> +#endif

So in the rseq patch you have:

+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field) uint64_t field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+   __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
+   field ## _padding = 0, field = (intptr_t)v
+#else
+# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
+   field = (intptr_t)v, field ## _padding = 0
+#endif

IOW the same macro maze. Please use a separate header file which provides
these macros once and share them between the two facilities.

> +#define CPU_OP_VEC_LEN_MAX   16
> +#define CPU_OP_ARG_LEN_MAX   24
> +/* Max. data len per operation. */
> +#define CPU_OP_DATA_LEN_MAX  PAGE_SIZE

That's something between 4K and 256K depending on the architecture. 

You really want to allow up to 256K data copy with preemption disabled?
Shudder.

> +/*
> + * Max. data len for overall vector. We to restrict the amount of

We to ?

> + * user-space data touched by the kernel in non-preemptible context so
> + * we do not introduce long scheduler latencies.
> + * This allows one copy of up to 4096 bytes, and 15 operations touching
> + * 8 bytes each.
> + * This limit is applied to the sum of length specified for all
> + * operations in a vector.
> + */
> +#define CPU_OP_VEC_DATA_LEN_MAX  (4096 + 15*8)

Magic numbers. Please use proper defines for heavens sake.

> +#define CPU_OP_MAX_PAGES 4   /* Max. pages per op. */
> +
> +enum cpu_op_type {
> + CPU_COMPARE_EQ_OP,  /* compare */
> + CPU_COMPARE_NE_OP,  /* compare */
> + CPU_MEMCPY_OP,  /* memcpy */
> + CPU_ADD_OP, /* arithmetic */
> + CPU_OR_OP,  /* bitwise */
> + CPU_AND_OP, /* bitwise */
> + CPU_XOR_OP, /* bitwise */
> + CPU_LSHIFT_OP,  /* shift */
> + CPU_RSHIFT_OP,  /* shift */
> + CPU_MB_OP,  /* memory barrier */
> +};
> +
> +/* Vector of operations to perform. Limited to 16. */
> +struct cpu_op {
> + int32_t op; /* enum cpu_op_type. */
> + uint32_t len;   /* data length, in bytes. */

Please get rid of these tail comments

> + union {
> +#define TMP_BUFLEN   64
> +#define NR_PINNED_PAGES_ON_STACK 8

8 pinned pages on stack? Which stack?

> +/*
> + * The cpu_opv system call executes a vector of operations on behalf of
> + * user-space on a specific CPU with preemption disabled. It is inspired
> + * from readv() and writev() system calls which take a "struct iovec"

s/from/by/

> + * array as argument.
> + *
> + * The operations available are: comparison, memcpy, add, or, and, xor,
> + * left shift, and right shift. The system call receives a CPU number
> + * from user-space as argument, which is the CPU on which those
> + * operations need to be performed. All preparation steps such as
> + * loading pointers, and applying offsets to arrays, need to be
> + * performed by user-space before invoking the system call. The

loading pointers and applying offsets? That makes no sense.

> + * "comparison" operation can be used to check that the data used in the
> + * preparation step did not change between preparation of system call
> + * inputs and operation execution within the preempt-off critical
> + * section.
> + *
> + * The reason why we require all pointer offsets to be calculated by
> + * user-space beforehand is because we need to use get_user_pages_fast()
> + * to first pin all pages touched by each operation. This takes care of

That doesnt explain it either.

> + * faulting-in the pages.  Then, preemption is disabled, and the
> + * operations are performed atomically with respect to other thread
> + * execution on that CPU, without generating any page fault.
> + *
> + * A maximum limit of 16 operations per cpu_opv syscall 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-16 Thread Thomas Gleixner
On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
> +#ifdef __KERNEL__
> +# include 
> +#else/* #ifdef __KERNEL__ */

   Sigh.

> +# include 
> +#endif   /* #else #ifdef __KERNEL__ */
> +
> +#include 
> +
> +#ifdef __LP64__
> +# define CPU_OP_FIELD_u32_u64(field) uint64_t field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define CPU_OP_FIELD_u32_u64(field) uint32_t field ## _padding, field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> + field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define CPU_OP_FIELD_u32_u64(field) uint32_t field, field ## _padding
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \
> + field = (intptr_t)v, field ## _padding = 0
> +#endif

So in the rseq patch you have:

+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field) uint64_t field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+   __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field) uint32_t field ## _padding, field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
+   field ## _padding = 0, field = (intptr_t)v
+#else
+# define RSEQ_FIELD_u32_u64(field) uint32_t field, field ## _padding
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v) \
+   field = (intptr_t)v, field ## _padding = 0
+#endif

IOW the same macro maze. Please use a separate header file which provides
these macros once and share them between the two facilities.

> +#define CPU_OP_VEC_LEN_MAX   16
> +#define CPU_OP_ARG_LEN_MAX   24
> +/* Max. data len per operation. */
> +#define CPU_OP_DATA_LEN_MAX  PAGE_SIZE

That's something between 4K and 256K depending on the architecture. 

You really want to allow up to 256K data copy with preemption disabled?
Shudder.

> +/*
> + * Max. data len for overall vector. We to restrict the amount of

We to ?

> + * user-space data touched by the kernel in non-preemptible context so
> + * we do not introduce long scheduler latencies.
> + * This allows one copy of up to 4096 bytes, and 15 operations touching
> + * 8 bytes each.
> + * This limit is applied to the sum of length specified for all
> + * operations in a vector.
> + */
> +#define CPU_OP_VEC_DATA_LEN_MAX  (4096 + 15*8)

Magic numbers. Please use proper defines for heavens sake.

> +#define CPU_OP_MAX_PAGES 4   /* Max. pages per op. */
> +
> +enum cpu_op_type {
> + CPU_COMPARE_EQ_OP,  /* compare */
> + CPU_COMPARE_NE_OP,  /* compare */
> + CPU_MEMCPY_OP,  /* memcpy */
> + CPU_ADD_OP, /* arithmetic */
> + CPU_OR_OP,  /* bitwise */
> + CPU_AND_OP, /* bitwise */
> + CPU_XOR_OP, /* bitwise */
> + CPU_LSHIFT_OP,  /* shift */
> + CPU_RSHIFT_OP,  /* shift */
> + CPU_MB_OP,  /* memory barrier */
> +};
> +
> +/* Vector of operations to perform. Limited to 16. */
> +struct cpu_op {
> + int32_t op; /* enum cpu_op_type. */
> + uint32_t len;   /* data length, in bytes. */

Please get rid of these tail comments

> + union {
> +#define TMP_BUFLEN   64
> +#define NR_PINNED_PAGES_ON_STACK 8

8 pinned pages on stack? Which stack?

> +/*
> + * The cpu_opv system call executes a vector of operations on behalf of
> + * user-space on a specific CPU with preemption disabled. It is inspired
> + * from readv() and writev() system calls which take a "struct iovec"

s/from/by/

> + * array as argument.
> + *
> + * The operations available are: comparison, memcpy, add, or, and, xor,
> + * left shift, and right shift. The system call receives a CPU number
> + * from user-space as argument, which is the CPU on which those
> + * operations need to be performed. All preparation steps such as
> + * loading pointers, and applying offsets to arrays, need to be
> + * performed by user-space before invoking the system call. The

loading pointers and applying offsets? That makes no sense.

> + * "comparison" operation can be used to check that the data used in the
> + * preparation step did not change between preparation of system call
> + * inputs and operation execution within the preempt-off critical
> + * section.
> + *
> + * The reason why we require all pointer offsets to be calculated by
> + * user-space beforehand is because we need to use get_user_pages_fast()
> + * to first pin all pages touched by each operation. This takes care of

That doesnt explain it either.

> + * faulting-in the pages.  Then, preemption is disabled, and the
> + * operations are performed atomically with respect to other thread
> + * execution on that CPU, without generating any page fault.
> + *
> + * A maximum limit of 16 operations per cpu_opv syscall 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-15 Thread Mathieu Desnoyers
- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpa...@gmail.com wrote:

> Hi Matthieu
> 
> On 14 November 2017 at 21:03, Mathieu Desnoyers
>  wrote:
>> This new cpu_opv system call executes a vector of operations on behalf
>> of user-space on a specific CPU with preemption disabled. It is inspired
>> from readv() and writev() system calls which take a "struct iovec" array
>> as argument.
> 
> Do you have a man page for this syscall already?

Hi Michael,

It's the next thing on my roadmap when the syscall reaches mainline.
That and membarrier commands man pages updates.

Thanks,

Mathieu

> 
> Thanks,
> 
> Michael
> 
> 
>> The operations available are: comparison, memcpy, add, or, and, xor,
>> left shift, right shift, and mb. The system call receives a CPU number
>> from user-space as argument, which is the CPU on which those operations
>> need to be performed. All preparation steps such as loading pointers,
>> and applying offsets to arrays, need to be performed by user-space
>> before invoking the system call. The "comparison" operation can be used
>> to check that the data used in the preparation step did not change
>> between preparation of system call inputs and operation execution within
>> the preempt-off critical section.
>>
>> The reason why we require all pointer offsets to be calculated by
>> user-space beforehand is because we need to use get_user_pages_fast() to
>> first pin all pages touched by each operation. This takes care of
>> faulting-in the pages. Then, preemption is disabled, and the operations
>> are performed atomically with respect to other thread execution on that
>> CPU, without generating any page fault.
>>
>> A maximum limit of 16 operations per cpu_opv syscall invocation is
>> enforced, so user-space cannot generate a too long preempt-off critical
>> section. Each operation is also limited a length of PAGE_SIZE bytes,
>> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
>> pages for source, 2 pages for destination if addresses are not aligned
>> on page boundaries). Moreover, a total limit of 4216 bytes is applied
>> to operation lengths.
>>
>> If the thread is not running on the requested CPU, a new
>> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
>> If the requested CPU is not part of the cpus allowed mask of the thread,
>> the system call fails with EINVAL. After the migration has been
>> performed, preemption is disabled, and the current CPU number is checked
>> again and compared to the requested CPU number. If it still differs, it
>> means the scheduler migrated us away from that CPU. Return EAGAIN to
>> user-space in that case, and let user-space retry (either requesting the
>> same CPU number, or a different one, depending on the user-space
>> algorithm constraints).
>>
>> Signed-off-by: Mathieu Desnoyers 
>> CC: "Paul E. McKenney" 
>> CC: Peter Zijlstra 
>> CC: Paul Turner 
>> CC: Thomas Gleixner 
>> CC: Andrew Hunter 
>> CC: Andy Lutomirski 
>> CC: Andi Kleen 
>> CC: Dave Watson 
>> CC: Chris Lameter 
>> CC: Ingo Molnar 
>> CC: "H. Peter Anvin" 
>> CC: Ben Maurer 
>> CC: Steven Rostedt 
>> CC: Josh Triplett 
>> CC: Linus Torvalds 
>> CC: Andrew Morton 
>> CC: Russell King 
>> CC: Catalin Marinas 
>> CC: Will Deacon 
>> CC: Michael Kerrisk 
>> CC: Boqun Feng 
>> CC: linux-...@vger.kernel.org
>> ---
>>
>> Changes since v1:
>> - handle CPU hotplug,
>> - cleanup implementation using function pointers: We can use function
>>   pointers to implement the operations rather than duplicating all the
>>   user-access code.
>> - refuse device pages: Performing cpu_opv operations on io map'd pages
>>   with preemption disabled could generate long preempt-off critical
>>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>>   a device page is received as parameter
>> - restrict op vector to 4216 bytes length sum: Restrict the operation
>>   vector to length sum of:
>>   - 4096 bytes (typical page size on most architectures, should be
>> enough for a string, or structures)
>>   - 15 * 8 bytes (typical operations on integers or pointers).
>>   The goal here is to keep the duration of preempt off critical section
>>   short, so we don't add significant scheduler latency.
>> - Add INIT_ONSTACK macro: Introduce the
>>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>>   stack to 0 on 32-bit architectures.
>> - 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-15 Thread Mathieu Desnoyers
- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpa...@gmail.com wrote:

> Hi Matthieu
> 
> On 14 November 2017 at 21:03, Mathieu Desnoyers
>  wrote:
>> This new cpu_opv system call executes a vector of operations on behalf
>> of user-space on a specific CPU with preemption disabled. It is inspired
>> from readv() and writev() system calls which take a "struct iovec" array
>> as argument.
> 
> Do you have a man page for this syscall already?

Hi Michael,

It's the next thing on my roadmap when the syscall reaches mainline.
That and membarrier commands man pages updates.

Thanks,

Mathieu

> 
> Thanks,
> 
> Michael
> 
> 
>> The operations available are: comparison, memcpy, add, or, and, xor,
>> left shift, right shift, and mb. The system call receives a CPU number
>> from user-space as argument, which is the CPU on which those operations
>> need to be performed. All preparation steps such as loading pointers,
>> and applying offsets to arrays, need to be performed by user-space
>> before invoking the system call. The "comparison" operation can be used
>> to check that the data used in the preparation step did not change
>> between preparation of system call inputs and operation execution within
>> the preempt-off critical section.
>>
>> The reason why we require all pointer offsets to be calculated by
>> user-space beforehand is because we need to use get_user_pages_fast() to
>> first pin all pages touched by each operation. This takes care of
>> faulting-in the pages. Then, preemption is disabled, and the operations
>> are performed atomically with respect to other thread execution on that
>> CPU, without generating any page fault.
>>
>> A maximum limit of 16 operations per cpu_opv syscall invocation is
>> enforced, so user-space cannot generate a too long preempt-off critical
>> section. Each operation is also limited a length of PAGE_SIZE bytes,
>> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
>> pages for source, 2 pages for destination if addresses are not aligned
>> on page boundaries). Moreover, a total limit of 4216 bytes is applied
>> to operation lengths.
>>
>> If the thread is not running on the requested CPU, a new
>> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
>> If the requested CPU is not part of the cpus allowed mask of the thread,
>> the system call fails with EINVAL. After the migration has been
>> performed, preemption is disabled, and the current CPU number is checked
>> again and compared to the requested CPU number. If it still differs, it
>> means the scheduler migrated us away from that CPU. Return EAGAIN to
>> user-space in that case, and let user-space retry (either requesting the
>> same CPU number, or a different one, depending on the user-space
>> algorithm constraints).
>>
>> Signed-off-by: Mathieu Desnoyers 
>> CC: "Paul E. McKenney" 
>> CC: Peter Zijlstra 
>> CC: Paul Turner 
>> CC: Thomas Gleixner 
>> CC: Andrew Hunter 
>> CC: Andy Lutomirski 
>> CC: Andi Kleen 
>> CC: Dave Watson 
>> CC: Chris Lameter 
>> CC: Ingo Molnar 
>> CC: "H. Peter Anvin" 
>> CC: Ben Maurer 
>> CC: Steven Rostedt 
>> CC: Josh Triplett 
>> CC: Linus Torvalds 
>> CC: Andrew Morton 
>> CC: Russell King 
>> CC: Catalin Marinas 
>> CC: Will Deacon 
>> CC: Michael Kerrisk 
>> CC: Boqun Feng 
>> CC: linux-...@vger.kernel.org
>> ---
>>
>> Changes since v1:
>> - handle CPU hotplug,
>> - cleanup implementation using function pointers: We can use function
>>   pointers to implement the operations rather than duplicating all the
>>   user-access code.
>> - refuse device pages: Performing cpu_opv operations on io map'd pages
>>   with preemption disabled could generate long preempt-off critical
>>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>>   a device page is received as parameter
>> - restrict op vector to 4216 bytes length sum: Restrict the operation
>>   vector to length sum of:
>>   - 4096 bytes (typical page size on most architectures, should be
>> enough for a string, or structures)
>>   - 15 * 8 bytes (typical operations on integers or pointers).
>>   The goal here is to keep the duration of preempt off critical section
>>   short, so we don't add significant scheduler latency.
>> - Add INIT_ONSTACK macro: Introduce the
>>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>>   stack to 0 on 32-bit architectures.
>> - Add CPU_MB_OP operation:
>>   Use-cases with:
>>   - two consecutive stores,
>>   - a mempcy followed by a store,
>>   require a memory barrier before the final store operation. A typical
>>   use-case is a store-release on the final store. Given that this is a
>>   slow path, just providing an explicit full barrier instruction should
>>   be sufficient.
>> - Add expect fault field:
>>   The use-case of list_pop brings interesting challenges. With rseq, we
>>   can use rseq_cmpnev_storeoffp_load(), and 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-14 Thread Michael Kerrisk (man-pages)
Hi Matthieu

On 14 November 2017 at 21:03, Mathieu Desnoyers
 wrote:
> This new cpu_opv system call executes a vector of operations on behalf
> of user-space on a specific CPU with preemption disabled. It is inspired
> from readv() and writev() system calls which take a "struct iovec" array
> as argument.

Do you have a man page spfr this syscall already?

Thanks,

Michael


> The operations available are: comparison, memcpy, add, or, and, xor,
> left shift, right shift, and mb. The system call receives a CPU number
> from user-space as argument, which is the CPU on which those operations
> need to be performed. All preparation steps such as loading pointers,
> and applying offsets to arrays, need to be performed by user-space
> before invoking the system call. The "comparison" operation can be used
> to check that the data used in the preparation step did not change
> between preparation of system call inputs and operation execution within
> the preempt-off critical section.
>
> The reason why we require all pointer offsets to be calculated by
> user-space beforehand is because we need to use get_user_pages_fast() to
> first pin all pages touched by each operation. This takes care of
> faulting-in the pages. Then, preemption is disabled, and the operations
> are performed atomically with respect to other thread execution on that
> CPU, without generating any page fault.
>
> A maximum limit of 16 operations per cpu_opv syscall invocation is
> enforced, so user-space cannot generate a too long preempt-off critical
> section. Each operation is also limited a length of PAGE_SIZE bytes,
> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
> pages for source, 2 pages for destination if addresses are not aligned
> on page boundaries). Moreover, a total limit of 4216 bytes is applied
> to operation lengths.
>
> If the thread is not running on the requested CPU, a new
> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
> If the requested CPU is not part of the cpus allowed mask of the thread,
> the system call fails with EINVAL. After the migration has been
> performed, preemption is disabled, and the current CPU number is checked
> again and compared to the requested CPU number. If it still differs, it
> means the scheduler migrated us away from that CPU. Return EAGAIN to
> user-space in that case, and let user-space retry (either requesting the
> same CPU number, or a different one, depending on the user-space
> algorithm constraints).
>
> Signed-off-by: Mathieu Desnoyers 
> CC: "Paul E. McKenney" 
> CC: Peter Zijlstra 
> CC: Paul Turner 
> CC: Thomas Gleixner 
> CC: Andrew Hunter 
> CC: Andy Lutomirski 
> CC: Andi Kleen 
> CC: Dave Watson 
> CC: Chris Lameter 
> CC: Ingo Molnar 
> CC: "H. Peter Anvin" 
> CC: Ben Maurer 
> CC: Steven Rostedt 
> CC: Josh Triplett 
> CC: Linus Torvalds 
> CC: Andrew Morton 
> CC: Russell King 
> CC: Catalin Marinas 
> CC: Will Deacon 
> CC: Michael Kerrisk 
> CC: Boqun Feng 
> CC: linux-...@vger.kernel.org
> ---
>
> Changes since v1:
> - handle CPU hotplug,
> - cleanup implementation using function pointers: We can use function
>   pointers to implement the operations rather than duplicating all the
>   user-access code.
> - refuse device pages: Performing cpu_opv operations on io map'd pages
>   with preemption disabled could generate long preempt-off critical
>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>   a device page is received as parameter
> - restrict op vector to 4216 bytes length sum: Restrict the operation
>   vector to length sum of:
>   - 4096 bytes (typical page size on most architectures, should be
> enough for a string, or structures)
>   - 15 * 8 bytes (typical operations on integers or pointers).
>   The goal here is to keep the duration of preempt off critical section
>   short, so we don't add significant scheduler latency.
> - Add INIT_ONSTACK macro: Introduce the
>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>   stack to 0 on 32-bit architectures.
> - Add CPU_MB_OP operation:
>   Use-cases with:
>   - two consecutive stores,
>   - a mempcy followed by a store,
>   require a memory barrier before the final store operation. A typical
>   use-case is a store-release on the final store. Given that this is a
>   slow path, just providing an explicit full barrier instruction should
>   be 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-14 Thread Michael Kerrisk (man-pages)
Hi Matthieu

On 14 November 2017 at 21:03, Mathieu Desnoyers
 wrote:
> This new cpu_opv system call executes a vector of operations on behalf
> of user-space on a specific CPU with preemption disabled. It is inspired
> from readv() and writev() system calls which take a "struct iovec" array
> as argument.

Do you have a man page spfr this syscall already?

Thanks,

Michael


> The operations available are: comparison, memcpy, add, or, and, xor,
> left shift, right shift, and mb. The system call receives a CPU number
> from user-space as argument, which is the CPU on which those operations
> need to be performed. All preparation steps such as loading pointers,
> and applying offsets to arrays, need to be performed by user-space
> before invoking the system call. The "comparison" operation can be used
> to check that the data used in the preparation step did not change
> between preparation of system call inputs and operation execution within
> the preempt-off critical section.
>
> The reason why we require all pointer offsets to be calculated by
> user-space beforehand is because we need to use get_user_pages_fast() to
> first pin all pages touched by each operation. This takes care of
> faulting-in the pages. Then, preemption is disabled, and the operations
> are performed atomically with respect to other thread execution on that
> CPU, without generating any page fault.
>
> A maximum limit of 16 operations per cpu_opv syscall invocation is
> enforced, so user-space cannot generate a too long preempt-off critical
> section. Each operation is also limited a length of PAGE_SIZE bytes,
> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
> pages for source, 2 pages for destination if addresses are not aligned
> on page boundaries). Moreover, a total limit of 4216 bytes is applied
> to operation lengths.
>
> If the thread is not running on the requested CPU, a new
> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
> If the requested CPU is not part of the cpus allowed mask of the thread,
> the system call fails with EINVAL. After the migration has been
> performed, preemption is disabled, and the current CPU number is checked
> again and compared to the requested CPU number. If it still differs, it
> means the scheduler migrated us away from that CPU. Return EAGAIN to
> user-space in that case, and let user-space retry (either requesting the
> same CPU number, or a different one, depending on the user-space
> algorithm constraints).
>
> Signed-off-by: Mathieu Desnoyers 
> CC: "Paul E. McKenney" 
> CC: Peter Zijlstra 
> CC: Paul Turner 
> CC: Thomas Gleixner 
> CC: Andrew Hunter 
> CC: Andy Lutomirski 
> CC: Andi Kleen 
> CC: Dave Watson 
> CC: Chris Lameter 
> CC: Ingo Molnar 
> CC: "H. Peter Anvin" 
> CC: Ben Maurer 
> CC: Steven Rostedt 
> CC: Josh Triplett 
> CC: Linus Torvalds 
> CC: Andrew Morton 
> CC: Russell King 
> CC: Catalin Marinas 
> CC: Will Deacon 
> CC: Michael Kerrisk 
> CC: Boqun Feng 
> CC: linux-...@vger.kernel.org
> ---
>
> Changes since v1:
> - handle CPU hotplug,
> - cleanup implementation using function pointers: We can use function
>   pointers to implement the operations rather than duplicating all the
>   user-access code.
> - refuse device pages: Performing cpu_opv operations on io map'd pages
>   with preemption disabled could generate long preempt-off critical
>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>   a device page is received as parameter
> - restrict op vector to 4216 bytes length sum: Restrict the operation
>   vector to length sum of:
>   - 4096 bytes (typical page size on most architectures, should be
> enough for a string, or structures)
>   - 15 * 8 bytes (typical operations on integers or pointers).
>   The goal here is to keep the duration of preempt off critical section
>   short, so we don't add significant scheduler latency.
> - Add INIT_ONSTACK macro: Introduce the
>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>   stack to 0 on 32-bit architectures.
> - Add CPU_MB_OP operation:
>   Use-cases with:
>   - two consecutive stores,
>   - a mempcy followed by a store,
>   require a memory barrier before the final store operation. A typical
>   use-case is a store-release on the final store. Given that this is a
>   slow path, just providing an explicit full barrier instruction should
>   be sufficient.
> - Add expect fault field:
>   The use-case of list_pop brings interesting challenges. With rseq, we
>   can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
>   compare it against NULL, add an offset, and load the target "next"
>   pointer from the object, all within a single req critical section.
>
>   Life is not so easy for cpu_opv in this use-case, mainly because we
>   need to pin all pages we are going to touch in the preempt-off
>   critical section beforehand. So we need to 

Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-14 Thread Mathieu Desnoyers
- On Nov 14, 2017, at 3:03 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:
[...]
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3b448ba82225..cab256c1720a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p,
> unsigned int cpu)
> #endif
> }
> 
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);

Testing on CONFIG_SMP=n showed that I needed to add empty static inline
(returning 0) for !SMP case.

Mathieu


> +
> /*
>  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>  */

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

2017-11-14 Thread Mathieu Desnoyers
- On Nov 14, 2017, at 3:03 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:
[...]
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3b448ba82225..cab256c1720a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p,
> unsigned int cpu)
> #endif
> }
> 
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);

Testing on CONFIG_SMP=n showed that I needed to add empty static inline
(returning 0) for !SMP case.

Mathieu


> +
> /*
>  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>  */

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com