Re: Revisiting uvm_loan() for 'direct' write pipes

2018-06-10 Thread Jaromír Doleček
2018-05-25 23:19 GMT+02:00 Jason Thorpe :
> BTW, I was thinking about this, and I think you need to also handle the case
> where you try the ubc_uiomove_direct() case, but then *fall back* onto the
> non-direct case if some magic error is returned.
> ...
> These cache flushes could be potentially very expensive.

I think that it would not be actually performance-wise make sense to
do this - the decision to fallback would have to be done within the
pmap_direct_process() function, which happens only after all the
preparation steps have been already done. If we abort the operation
here and go call the non-direct code path, it could be even slower
than the cache flush. For UBC the fallback would not be too terrible
to code and debug, but fallback would be very messy e.g. for the
uvm_loan() in sys_pipe.c.

I don't know what to do for such archs, but I guess it would need to
be evaluated if the direct map could be used in a way which actually
improves performance there for the general case.

2018-05-26 16:52 GMT+02:00 Thor Lancelot Simon :
> In this case, could we do better gathering the IPIs so the cost were
> amortized over many pages?

Yes, it could gather the IPI for this particular TLB flush. Seems e.g.
on my system it would be enough to gather at least 3-4 to match the
speed. There is some limit to how this could be scaled though - right
now we have only 16 TLB IPI slots, which would be enough for maybe 1
parallell pipe setup. I've noticed DragonflyBSD did some work on
asynchronous TLB IPIs, maybe it would be possible to look for
inspiration there.

I do plan some tweaks for the TLB flush IPIs. I want to change the x86
pmap to not trigger TLB flush IPIs to idle processors, as was
discussed earlier for kern/53124. This would reduce the performance
impact of the IPIs if there are many idle processors. Seems acpi
wakeup takes linear time depending on number of cpus - I think this is
also something we'd want to fix too.

That said, regardless which way we slice it, anything which triggers
IPIs or TLB flushes inherently just does not scale on MP systems. If
uvm_loan() is to be used as high-performance zero-copy optimization,
it simply must avoid those. Adding something which triggers interrupt
to all active processors for every processed page simply can't
perform.

Jaromir


Re: Revisiting uvm_loan() for 'direct' write pipes

2018-05-26 Thread Thor Lancelot Simon
On Fri, May 25, 2018 at 10:01:15PM +0200, Jarom??r Dole??ek wrote:
> 2018-05-21 21:49 GMT+02:00 Jarom??r Dole??ek :
> > It turned out uvm_loan() incurs most of the overhead. I'm still on my
> > way to figure what it is exactly which makes it so much slower than
> > uiomove().
> 
> I've now pinned the problem down to the pmap_page_protect(...,
> VM_PROT_READ), that code does page table manipulations and triggers
> synchronous IPIs. So basically the same problem as the UBC code in
> uvm_bio.c.

There's always going to be some critical size beneath which the cost of
the MMU manipulations (or, these days, the interprocessor communication
to cause other CPUs to do _their_ MMU manipulations) outweighs the benefit
of avoiding copies.  This problem's been known all the way as far back as
Mach on the VAX, where they discovered that for typical message sizes
to/from the microkernel, mapping instead of copying was definitely a lose.

In this case, could we do better gathering the IPIs so the cost were
amortized over many pages?

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: Revisiting uvm_loan() for 'direct' write pipes

2018-05-26 Thread Jason Thorpe


> On May 25, 2018, at 1:01 PM, Jaromír Doleček  
> wrote:
> 
> So, I'm actually thinking to change uvm_loan() to not enforce R/O
> mappings and leave page attributes without change. It would require
> the caller to deal with possible COW or PG_RDONLY if they need to do
> writes. In other words, allow using the 'loan' mechanics also for
> writes, and eventually use this also for UBC writes to replace the
> global PG_BUSY lock there.


There are important reasons why the current API revokes write permission on the 
pages that it loans out.  uvm_loan() returns a page array, so technically the 
kernel could map them read-write if it wanted to.  But the idea is that 
everyone has a read-only-but-COW view so that if the owner of the page modifies 
it, it gets a new copy of the page so that that loanee has a stable view.  It 
was designed to mirror the semantics of e.g. write(2).

In the case of the pipe code, it appears the sender blocks until the receiver 
has read the entire loaned region, so this is mitigated a bit (in theory, with 
your suggested change, it could be possible for some other thread in the 
sending process to scribble the loaned buffer, but this wouldn’t really be any 
different than in the non-direct-but-blocked case, because you never know how 
much as been uiomove()’d into the kernel at any given point, only that once the 
write() call returns that the kernel owns whatever data was written and expects 
it not to change).

I suppose I don’t really object to the change so long as there’s an explicit 
behavior choice in the loan API.

Huh, I just noticed that sosend() has the page loaning support there disabled 
in the MULTIPROCESSOR case.  That’s a bummer.

-- thorpej



Re: Revisiting uvm_loan() for 'direct' write pipes

2018-05-26 Thread Jason Thorpe


> On May 21, 2018, at 12:49 PM, Jaromír Doleček  
> wrote:
> 
> Mostly since I want to
> have at least one other consumer of the interface before I consider it
> as final, to make sure it covers the general use cases.


BTW, I was thinking about this, and I think you need to also handle the case 
where you try the ubc_uiomove_direct() case, but then *fall back* onto the 
non-direct case if some magic error is returned.  The idea here is that some 
CPUs will have to flush caches if the cache index of the direct-mapping of the 
page is incompatible with the cache index of any of the other mappings of the 
page (in the case of virtually-indexed caches).  These cache flushes could be 
potentially very expensive.

Either that, or have a global “opt-in to PMAP_DIRECT I/O” that affects all of 
the code paths you want to support, that can be called conditionally from 
machine-dependent code.

-- thorpej



Re: Revisiting uvm_loan() for 'direct' write pipes

2018-05-25 Thread Jaromír Doleček
2018-05-21 21:49 GMT+02:00 Jaromír Doleček :
> It turned out uvm_loan() incurs most of the overhead. I'm still on my
> way to figure what it is exactly which makes it so much slower than
> uiomove().

I've now pinned the problem down to the pmap_page_protect(...,
VM_PROT_READ), that code does page table manipulations and triggers
synchronous IPIs. So basically the same problem as the UBC code in
uvm_bio.c.

If I comment out the pmap_page_protect() in uvm_loan.c and hence do
not change vm_page attributes, the uvm_loan() + direct map pipe
variant manages about 13 GB/s, compared to about 12 GB/s for the
regular pipe. 8% speedup is not much, but as extra it removes all the
KVA limits. Since it should scale well, it should be possible to
reduce reduce the direct threshold, and reduce the size of the fixed
in-kernel pipe buffer to save kernel memory.

So, I'm actually thinking to change uvm_loan() to not enforce R/O
mappings and leave page attributes without change. It would require
the caller to deal with possible COW or PG_RDONLY if they need to do
writes. In other words, allow using the 'loan' mechanics also for
writes, and eventually use this also for UBC writes to replace the
global PG_BUSY lock there.

WYT?

Jaromir


Revisiting uvm_loan() for 'direct' write pipes

2018-05-21 Thread Jaromír Doleček
Hello,

I've been playing a little on revisiting kern/sys_pipe.c to take
advantage of the direct map in order to avoid the pmap_enter() et.al
via the new uvm_direct_process() interface. Mostly since I want to
have at least one other consumer of the interface before I consider it
as final, to make sure it covers the general use cases.

I've managed to get it working. However, the loan mechanism is way
slower than just copying the data via the intermediate 16KB pipe
buffer - I get about 12GB/s on regular pipe, and about 4GB/s on the
'direct' pipe, even when using huge input buffers like 1MB a time.

For regular pipe, CPU is about same for producer and consumer about
50%, for 'direct' pipe producer (the one who does the loan) is at
about 80% and consumer at 5%.

It turned out uvm_loan() incurs most of the overhead. I'm still on my
way to figure what it is exactly which makes it so much slower than
uiomove().

Any hints on which parts I should concentrate?

Jaromir