On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <t...@linutronix.de> wrote: > On Sat, 7 Nov 2015, Dan Williams wrote: >> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <t...@linutronix.de> wrote: >> > On Fri, 6 Nov 2015, H. Peter Anvin wrote: >> >> On 11/06/15 15:17, Dan Williams wrote: >> >> >> >> >> >> Is it really required to do that on all cpus? >> >> > >> >> > I believe it is, but I'll double check. >> >> > >> >> >> >> It's required on all CPUs on which the DAX memory may have been dirtied. >> >> This is similar to the way we flush TLBs. >> > >> > Right. And that's exactly the problem: "may have been dirtied" >> > >> > If DAX is used on 50% of the CPUs and the other 50% are plumming away >> > happily in user space or run low latency RT tasks w/o ever touching >> > it, then having an unconditional flush on ALL CPUs is just wrong >> > because you penalize the uninvolved cores with a completely pointless >> > SMP function call and drain their caches. >> > >> >> It's not wrong and pointless, it's all we have available outside of >> having the kernel remember every virtual address that might have been >> touched since the last fsync and sit in a loop flushing those virtual >> address cache line by cache line. >> >> There is a crossover point where wbinvd is better than a clwb loop >> that needs to be determined. > > This is a totally different issue and I'm well aware that there is a > tradeoff between wbinvd() and a clwb loop. wbinvd() might be more > efficient performance wise above some number of cache lines, but then > again it's draining all unrelated stuff as well, which can result in a > even larger performance hit. > > Now what really concerns me more is that you just unconditionally > flush on all CPUs whether they were involved in that DAX stuff or not. > > Assume that DAX using application on CPU 0-3 and some other unrelated > workload on CPU4-7. That flush will > > - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd) > > - Drain the cache for CPU4-7 for no reason if done with wbinvd() > > - Render Cache Allocation useless if done with wbinvd() > > And we are not talking about a few micro seconds here. Assume that > CPU4-7 have cache allocated and it's mostly dirty. We've measured the > wbinvd() impact on RT, back then when the graphic folks used it as a > big hammer. The maximum latency spike was way above one millisecond. > > We have similar issues with TLB flushing, but there we > > - are tracking where it was used and never flush on innocent cpus > > - one can design his application in a way that it uses different > processes so cross CPU flushing does not happen > > I know that this is not an easy problem to solve, but you should be > aware that various application scenarios are going to be massively > unhappy about that. >
Thanks for that explanation. Peter had alluded to it at KS, but I indeed did not know that it was as horrible as milliseconds of latency, hmm... One other mitigation that follows on with Dave's plan of per-inode DAX control, is to also track when an inode has a writable DAX mmap established. With that we could have a REQ_DAX flag to augment REQ_FLUSH to potentially reduce committing violence on the cache. In an earlier thread I also recall an idea to have an mmap flag that an app can use to say "yes, I'm doing a writable DAX mapping, but I'm taking care of the cache myself". We could track innocent cpus, but I'm thinking that would be a core change to write-protect pages when a thread migrates? In general I feel there's a limit for how much hardware workaround is reasonable to do in the core kernel vs waiting for the platform to offer better options... Sorry if I'm being a bit punchy, but I'm still feeling like I need to defend the notion that DAX may just need to be turned off in some situations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/