As v4.3 Linux now sports no direct usage of MTRR calls anymore, the exported symbols mtrr_add() are no longer directly available to drivers, they must the PAT compatible arch_phys_wc_add() from then on. This is a huge win for Linux on the x86 front for a few non-Xen related reasons, one being that in the future on x86 Linux we may soon be able to flip the switch of default ioremap_nocache() from UC- to UC for PAT systems. On Xen however this has other series of benefits, the biggest one in particular was that we didn't have to end up implementing MTRR hypervisor call support from Linux guests out. Another side benefit of that is that long term none of the Linux MTRR x86 code is ever called on Linux Xen guests as Xen guests go blessed with the MTRR MSR unset. If you look at the Linux MTRR code its a huge convoluted mess, which includes calling stop_machine() on each CPU during bootup, resume, and CPU online, and of course whenever an MTRR is would have been set.
I'd like to review some last concerns and notes. First, long ago Toshi had mentioned that even if the kernel does not use MTRR directly the BIOS may have, and in some cases this is very likely. Its only recently became clear why, he notes that as far as he can tell, BIOS can *only* use MTRR to specify a UC cache attribute on x86. I recently asked for confirmation on this [0], given that as I see it future BIOSes can deprecate MTRR it'd be beneficial to Linux as we avoid all that convoluted MTRR code, and thereby also enabling parity in functionality with Linux Xen guests in so far as MTRR. Toshi notes this is likely not possible though even in the future, so we'll see why. In the meantime this also means we should consider as a last measure for MTRR consideration on Xen at the very least BIOS use of MTRR, so we can ensure we translate the respective setup done by the BIOS to guests, even if they only use PAT. I started looking at what Xen does and I have a few notes on the existing hypervisor implementation and I'd like to review with you and confirm the behavior and see if we need anything else. Keir Fraser long ago in 2009 committed a change on Xen which clips the available RAM to Xen on the MTRRs [1], and later made it only applicable to Intel system [2] while also enabling a boot parameter [no-]e820-mtrr-clip to either force enable this for any system or force disable this clipping. Reviewing that logic it would seem that its trying to confirm that all MTRRs are set to WB by default, and if so, it leaves the MTRR ranges as part of the e820 memory given to a guest. If the default is not WB, it iterates over the variable ranges (Linux set_num_var_ranges() on Intel uses rdmsr(MSR_MTRRcap, config, dummy); as with the xen implementation for the mtrr_cap) and tries to look for the highest WB range (and notes "overlapping UC/WT ranges dominate"?), and once it has that it trims the memory up to the highest WB MTRR, if WB was not default. This seems to look very similar to what Linux does on mtrr_cleanup(), only, if I understand this correctly, Linux disregards this cleanup also if the number of variable WB MTRRs + number of variable UC entries matches the total number of variable ranges (see Linux mtrr_need_cleanup()). I wonder if this needs fixing / updating, but also since it seems we trim the MTRRs if any UC MTRR is encountered up to the last WB MTRR, I wonder if this would suffice to address BIOS UC concerns. If anything and I understood this, it would seem this just discards UC MTRRs from the guests. Fan control was mentioned as one example use of UC by the BIOS, is the BIOS then in full control of the fan when this is done? This seems to only be done for Intel as well matching Linux' mtrr_cleanup(), do we not need such considerations for non-Intel CPUs? Toshi noted a while ago as well that if BIOS/firmware enables MTRR but the kernel does not have it enabled one issue might have been any MTRRs set up by the BIOS and ensuring the mapping is respected, in particular UC settings, this concern is raised above. Another issue though is that the kernel would be "unable to verify if a large page mapping is aligned with MTRRs" [3], so mtrr_type_lookup() would have to return a valid type as it runs on the platform. For Linux this means we'd have to implement a get_mtrr() call for Xen... only to use the hypercall XENPF_read_memtype. I looked at this prospect but is rather odd, given MTRRs would be disabled on the MSR, I'm rather afraid of hacking on stuff on top of Linux's MTRR to address this requirement, we currently do not support enabling just one MTRR hook if MTRRs are disabled but you're a guest and we need to ask the hypervisor for the MTRR type... Supporting this for Xen to me seems like a terrible idea. Instead I wonder if we can address this concern on Xen by simply now tacking off all MTRRs completely from the memory given out to for the e820 map to guests, keeping all MTRRs internal to the hypervisor. Or does the cleanup solution in place already suffice? I would hope this is reasonable given that all drivers, on Linux at least, now don't use MTRR directly (with the exception of ivtv, a legacy driver, and ipath, on its way out of the kernel). With regards to the Linux cleanup code, some discussion on this also had stalled a while ago. Back in August 2015 Stewards proposed extending the mtrr_cleanup() code for Linux to support systems with more than 4 GiB of RAM where it seems mtrr_cleaup() will fail with large memory configurations because it limits chunk_size to 2GB, meaning each MTRR can only cover 2GB of memory [4]. He noted that some systems with say 256GiB or RAM may have ten variable MTRRs, it may not be possible to use MTRRs to cover all of memory. He extended the MTRR chunk size to support larger memory -- but since we no longer use MTRR on kernel drivers upstream I raised the question if we even need the MTRR cleanup code anymore. If we don't want to interfere with BIOS MTRR setup can we just ignore MTRRs on Linux on the e820 map as well? Or do we need that for some kernel interfaces? If so which are they? [0] http://lkml.kernel.org/r/20160315232916.gj1...@wotan.suse.de [1[ http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=522b335995907366ff995a36a8098bc6b1e4cdf1 [2] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=9d85142e675142191f64d73aa40791a03a9f7389 [3] http://lkml.kernel.org/r/1441322474.3277.78.ca...@hpe.com [4] http://lkml.kernel.org/r/55e47b4d.1050...@gmail.com [5] http://lkml.kernel.org/r/CAB=ne6x3ix5psp2u6owrav73cfp+jbh+ct0ek8bnvw1ft-5...@mail.gmail.com Luis