Re: Revisiting uvm_loan() for 'direct' write pipes
2018-05-25 23:19 GMT+02:00 Jason Thorpe : > BTW, I was thinking about this, and I think you need to also handle the case > where you try the ubc_uiomove_direct() case, but then *fall back* onto the > non-direct case if some magic error is returned. > ... > These cache flushes could be potentially very expensive. I think that it would not be actually performance-wise make sense to do this - the decision to fallback would have to be done within the pmap_direct_process() function, which happens only after all the preparation steps have been already done. If we abort the operation here and go call the non-direct code path, it could be even slower than the cache flush. For UBC the fallback would not be too terrible to code and debug, but fallback would be very messy e.g. for the uvm_loan() in sys_pipe.c. I don't know what to do for such archs, but I guess it would need to be evaluated if the direct map could be used in a way which actually improves performance there for the general case. 2018-05-26 16:52 GMT+02:00 Thor Lancelot Simon : > In this case, could we do better gathering the IPIs so the cost were > amortized over many pages? Yes, it could gather the IPI for this particular TLB flush. Seems e.g. on my system it would be enough to gather at least 3-4 to match the speed. There is some limit to how this could be scaled though - right now we have only 16 TLB IPI slots, which would be enough for maybe 1 parallell pipe setup. I've noticed DragonflyBSD did some work on asynchronous TLB IPIs, maybe it would be possible to look for inspiration there. I do plan some tweaks for the TLB flush IPIs. I want to change the x86 pmap to not trigger TLB flush IPIs to idle processors, as was discussed earlier for kern/53124. This would reduce the performance impact of the IPIs if there are many idle processors. Seems acpi wakeup takes linear time depending on number of cpus - I think this is also something we'd want to fix too. That said, regardless which way we slice it, anything which triggers IPIs or TLB flushes inherently just does not scale on MP systems. If uvm_loan() is to be used as high-performance zero-copy optimization, it simply must avoid those. Adding something which triggers interrupt to all active processors for every processed page simply can't perform. Jaromir
Re: Revisiting uvm_loan() for 'direct' write pipes
On Fri, May 25, 2018 at 10:01:15PM +0200, Jarom??r Dole??ek wrote: > 2018-05-21 21:49 GMT+02:00 Jarom??r Dole??ek : > > It turned out uvm_loan() incurs most of the overhead. I'm still on my > > way to figure what it is exactly which makes it so much slower than > > uiomove(). > > I've now pinned the problem down to the pmap_page_protect(..., > VM_PROT_READ), that code does page table manipulations and triggers > synchronous IPIs. So basically the same problem as the UBC code in > uvm_bio.c. There's always going to be some critical size beneath which the cost of the MMU manipulations (or, these days, the interprocessor communication to cause other CPUs to do _their_ MMU manipulations) outweighs the benefit of avoiding copies. This problem's been known all the way as far back as Mach on the VAX, where they discovered that for typical message sizes to/from the microkernel, mapping instead of copying was definitely a lose. In this case, could we do better gathering the IPIs so the cost were amortized over many pages? -- Thor Lancelot Simont...@panix.com "The two most common variations translate as follows: illegitimi non carborundum = the unlawful are not silicon carbide illegitimis non carborundum = the unlawful don't have silicon carbide."
Re: Revisiting uvm_loan() for 'direct' write pipes
> On May 25, 2018, at 1:01 PM, Jaromír Doleček > wrote: > > So, I'm actually thinking to change uvm_loan() to not enforce R/O > mappings and leave page attributes without change. It would require > the caller to deal with possible COW or PG_RDONLY if they need to do > writes. In other words, allow using the 'loan' mechanics also for > writes, and eventually use this also for UBC writes to replace the > global PG_BUSY lock there. There are important reasons why the current API revokes write permission on the pages that it loans out. uvm_loan() returns a page array, so technically the kernel could map them read-write if it wanted to. But the idea is that everyone has a read-only-but-COW view so that if the owner of the page modifies it, it gets a new copy of the page so that that loanee has a stable view. It was designed to mirror the semantics of e.g. write(2). In the case of the pipe code, it appears the sender blocks until the receiver has read the entire loaned region, so this is mitigated a bit (in theory, with your suggested change, it could be possible for some other thread in the sending process to scribble the loaned buffer, but this wouldn’t really be any different than in the non-direct-but-blocked case, because you never know how much as been uiomove()’d into the kernel at any given point, only that once the write() call returns that the kernel owns whatever data was written and expects it not to change). I suppose I don’t really object to the change so long as there’s an explicit behavior choice in the loan API. Huh, I just noticed that sosend() has the page loaning support there disabled in the MULTIPROCESSOR case. That’s a bummer. -- thorpej
Re: Revisiting uvm_loan() for 'direct' write pipes
> On May 21, 2018, at 12:49 PM, Jaromír Doleček > wrote: > > Mostly since I want to > have at least one other consumer of the interface before I consider it > as final, to make sure it covers the general use cases. BTW, I was thinking about this, and I think you need to also handle the case where you try the ubc_uiomove_direct() case, but then *fall back* onto the non-direct case if some magic error is returned. The idea here is that some CPUs will have to flush caches if the cache index of the direct-mapping of the page is incompatible with the cache index of any of the other mappings of the page (in the case of virtually-indexed caches). These cache flushes could be potentially very expensive. Either that, or have a global “opt-in to PMAP_DIRECT I/O” that affects all of the code paths you want to support, that can be called conditionally from machine-dependent code. -- thorpej
Re: Revisiting uvm_loan() for 'direct' write pipes
2018-05-21 21:49 GMT+02:00 Jaromír Doleček : > It turned out uvm_loan() incurs most of the overhead. I'm still on my > way to figure what it is exactly which makes it so much slower than > uiomove(). I've now pinned the problem down to the pmap_page_protect(..., VM_PROT_READ), that code does page table manipulations and triggers synchronous IPIs. So basically the same problem as the UBC code in uvm_bio.c. If I comment out the pmap_page_protect() in uvm_loan.c and hence do not change vm_page attributes, the uvm_loan() + direct map pipe variant manages about 13 GB/s, compared to about 12 GB/s for the regular pipe. 8% speedup is not much, but as extra it removes all the KVA limits. Since it should scale well, it should be possible to reduce reduce the direct threshold, and reduce the size of the fixed in-kernel pipe buffer to save kernel memory. So, I'm actually thinking to change uvm_loan() to not enforce R/O mappings and leave page attributes without change. It would require the caller to deal with possible COW or PG_RDONLY if they need to do writes. In other words, allow using the 'loan' mechanics also for writes, and eventually use this also for UBC writes to replace the global PG_BUSY lock there. WYT? Jaromir
Revisiting uvm_loan() for 'direct' write pipes
Hello, I've been playing a little on revisiting kern/sys_pipe.c to take advantage of the direct map in order to avoid the pmap_enter() et.al via the new uvm_direct_process() interface. Mostly since I want to have at least one other consumer of the interface before I consider it as final, to make sure it covers the general use cases. I've managed to get it working. However, the loan mechanism is way slower than just copying the data via the intermediate 16KB pipe buffer - I get about 12GB/s on regular pipe, and about 4GB/s on the 'direct' pipe, even when using huge input buffers like 1MB a time. For regular pipe, CPU is about same for producer and consumer about 50%, for 'direct' pipe producer (the one who does the loan) is at about 80% and consumer at 5%. It turned out uvm_loan() incurs most of the overhead. I'm still on my way to figure what it is exactly which makes it so much slower than uiomove(). Any hints on which parts I should concentrate? Jaromir