Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared
I thought you basically create an OperationRegion of SystemMemory type, and off you go. Maybe the OSPM in Linux is clever and protects some memory, I wouldn't know. I investigated this now, and it looks like acpi is using ioremap_cache(). We can hook into that and force non sharing. It's probably safe to assume that this is not used on real IO devices. I think there are still some other BIOS mappings that use just plain ioremap() though. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
But that was due to performance problems in hot paths. Nothing of this applies here. It applies because a new API that individual driver authors is being proposed and that's an ongoing maintenance burden that might be mitigated by hiding that implementation detail from leaf drivers. Right now we're only talking about 2 places to change, and none of those are actually in individual drivers, but in the virtio generic code and in the MSI code. While there might be drivers in the future that do it directly it will be always the exception, normal drivers don't have to deal with this. For me it both seems very straight forward and simple (but then I'm biased) You seem to be having a difficult time iterating this proposal toward consensus. I don't think the base principles are being contested as much as the semantics, scope, and need for the proposed API that is in the purview of all leaf driver developers. Right now there is no leaf driver changed at all. I'd rather see more concerted efforts focused/limited core changes rather than leaf driver changes until there is a clearer definition of hardened. A hardened driver is a driver that - Had similar security (not API) oriented review of its IO operations (mainly MMIO access, but also PCI config space) as a non privileged user interface (like a ioctl). That review should be focused on memory safety. - Had some fuzzing on these IO interfaces using to be released tools. What is the intersection of ioremap() users that are outside of the proposed probe authorization regime AND want confidential computing support? Right now it's zero I believe. That is there is other low level code that sets memory shared, but it's not using ioremap, but some other mechanisms. are needed for booting. For example in TDX we can't print something to the console with this mechanism, so you would never get any output before the initrd. Just seems like a nightmare for debugging anything. There really needs to be an authorization mechanism that works reasonably early. I can see a point of having user space overrides though, but we need to have a sane kernel default that works early. Right, as I suggested [1], just enough early authorization to bootstrap/debug initramfs and then that can authorize the remainder. But how do you debug the kernel then? Making early undebuggable seems just bad policy to me. I am not proposing making the early undebuggable. That's the implication of moving the policy into initrd. If only initrd can authorize then it won't be possible to authorize before initrd, thus the early console won't work. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote: On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote: Especially in this case where the virtio use case being opted-in is *already* in a path that has been authorized by the device-filter policy engine. That's a good point. Andi, how about setting a per-device flag if its ID has been allowed and then making pci_iomap create a shared mapping transparently? Yes for pci_iomap we could do that. If someone uses raw ioremap without a device it won't work, but I don't think that's the case for virtio at least. I suppose we could solve that problem if it actually happens. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared
Interesting. VT-d tradeoffs ... what are they? The connection to the device is not encrypted and also not authenticated. This is different that even talking to the (untrusted) host through shared memory where you at least still have a common key. Allowing hypervisor to write into BIOS looks like it will trivially lead to code execution, won't it? This is not about BIOS code executing. While the guest firmware runs it is protected of course. This is for BIOS structures like ACPI tables that are mapped by Linux. While AML can run byte code it can normally not write to arbitrary memory. The risk is more that all the Linux code dealing with this hasn't been hardened to deal with malicious input. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/12/2021 12:13 PM, Dan Williams wrote: On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena wrote: I suspect the true number is even higher because that doesn't include IO inside calls to other modules and indirect pointers, correct? Actually everything should be included. Smatch has cross-function db and I am using it for getting the call chains and it follows function pointers. Also since I am starting from a list of individual read IOs, every single base read IO in drivers/* should be covered as far as I can see. But if it uses some weird IO wrappers then the actual list might be higher. Why analyze individual IO calls? I thought the goal here was to disable entire classes of ioremap() users? This is everything that would need to be moved somewhere else if we didn't disable the entire classes of ioremap users. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/12/2021 11:36 AM, Reshetova, Elena wrote: The 5.15 tree has something like ~2.4k IO accesses (including MMIO and others) in init functions that also register drivers (thanks Elena for the number) To provide more numbers on this. What I can see so far from a smatch-based analysis, we have 409 __init style functions (.probe & builtin/module_ _platform_driver_probe excluded) for 5.15 with allyesconfig. The number of distinct individual IO reads (MSRs included) is much higher than 2.4k and on the range of 30k because quite often there are more than a single IO read in the same source function. The full list of accesses and the possible call paths is huge for manually looking at, but here is the list of the 409 functions if anyone wants to take a look: Thanks Elena. I suspect the true number is even higher because that doesn't include IO inside calls to other modules and indirect pointers, correct? -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/11/2021 10:31 PM, Christoph Hellwig wrote: On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote: The reason we have trouble is that it's not clear what does the API mean outside the realm of TDX. If we really, truly want an API that says "ioremap and it's a hardened driver" then I guess ioremap_hardened_driver is what you want. Yes. And why would be we ioremap the BIOS anyway? It is not I/O memory in any of the senses we generally use ioremap for. I/O memory is anything outside the kernel memory map. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
The "better safe-than-sorry" argument is hard to build consensus around. The spectre mitigations ran into similar problems where the community rightly wanted to see the details and instrument the problematic paths rather than blanket sprinkle lfence "just to be safe". But that was due to performance problems in hot paths. Nothing of this applies here. In this case the rules about when a driver is suitably "hardened" are vague and the overlapping policy engines are confusing. What is confusing exactly? For me it both seems very straight forward and simple (but then I'm biased) The policy is: - Have an allow list at driver registration. - Have an additional opt-in for MMIO mappings (and potentially config space, but that's not currently there) to cover init calls completely. I'd rather see more concerted efforts focused/limited core changes rather than leaf driver changes until there is a clearer definition of hardened. A hardened driver is a driver that - Had similar security (not API) oriented review of its IO operations (mainly MMIO access, but also PCI config space) as a non privileged user interface (like a ioctl). That review should be focused on memory safety. - Had some fuzzing on these IO interfaces using to be released tools. Right now it's only three virtio drivers (console, net, block) Really it's no different than what we do for every new unprivileged user interface. I.e. instead of jumping to the assertion that fixing up these init-path vulnerabilities are too big to fix, dig to the next level to provide more evidence that per-driver opt-in is the only viable option. For example, how many of these problematic paths are built-in to the average kernel config? I don't think arguments from "the average kernel config" (if such a thing even exists) are useful. That's would be just hand waving. A strawman might be to add a sprinkling error exits in the module_init() of the problematic drivers, and only fail if the module is built-in, and let modprobe policy handle the rest. That would be already hundreds of changes. I have no idea how such a thing could be maintained or sustained either. Really I don't even see how these alternatives can be considered. Tree sweeps should always be last resort. They're a pain for everyone. But here they're casually thrown around as alternatives to straight forward one or two line changes. Default policy in user space just seems to be a bad idea here. Who should know if a driver is hardened other than the kernel? Maintaining the list somewhere else just doesn't make sense to me. I do not understand the maintenance burden correlation of where the policy is driven vs where the list is maintained? All the hardening and auditing happens in the kernel tree. So it seems the natural place to store the result is in the kernel tree. But there's no single package for initrd, so you would need custom configurations for all the supported distros. Also we're really arguing about a list that currently has three entries. Even if I agreed with the contention that out-of-tree userspace would have a hard time tracking the "hardened" driver list there is still an in-tree userspace path to explore. E.g. perf maintains lists of things tightly coupled to the kernel, this authorized device list seems to be in the same category of data. You mean the event list? perf is in the kernel tree, so it's maintained together with the kernel. But we don't have a kernel initrd. Also there is the more practical problem that some devices are needed for booting. For example in TDX we can't print something to the console with this mechanism, so you would never get any output before the initrd. Just seems like a nightmare for debugging anything. There really needs to be an authorization mechanism that works reasonably early. I can see a point of having user space overrides though, but we need to have a sane kernel default that works early. Right, as I suggested [1], just enough early authorization to bootstrap/debug initramfs and then that can authorize the remainder. But how do you debug the kernel then? Making early undebuggable seems just bad policy to me. And if you fix if for the console why not add the two more entries for virtio net and block too? -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared
I mean ... it's already wide spread. I meant wide spread usage with confidential guests. If we support it with TDX it will be used with TDX. It has some security trade offs. The main reason to use TDX is security. Also when people take the VT-d tradeoffs they might be ok with the BIOS trade offs too. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared
Presumably bios code is in arch/x86 and drivers/acpi, right? Up to 200 calls the majority of which is likely private ... Yes. I don't have better ideas but the current setup will just result in people making their guests vulnerable whenever they want to allow device pass-through. Yes that's true. For current TDX our target is virtual devices only. But if pass through usage will be really wide spread we may need to revisit. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
Because it does not end with I/O operations, that's a trivial example. module unloading is famous for being racy: I just re-read that part of virtio drivers and sure enough we have bugs there, this is after they have presumably been audited, so a TDX guest is better off just disabling hot-unplug completely, and hotplug isn't far behind. These all shouldn't matter for a confidential guest. The only way it can be attacked is through IO, everything else is protected by hardware. Also it would all require doing something at the guest level, which we assume is not malicious. Malicious filesystems can exploit many linux systems unless you take pains to limit what is mounted and how. That's expected to be handled by authenticated dmcrypt and similar. Hardening at this level has been done for many years. Networking devices tend to get into the default namespaces and can do more or less whatever CAP_NET_ADMIN can. Etc. Networking should be already hardened, otherwise you would have much worse problems today. hange in your subsystem here. Well I commented on the API patch, not the virtio patch. If it's a way for a driver to say "I am hardened and audited" then I guess it should at least say so. This is handled by the central allow list. We intentionally didn't want each driver to declare itself, but have a central list where changes will get more scrutiny than random driver code. But then there are the additional opt-ins for the low level firewall. These are in the API. I don't see how it could be done at the driver level, unless you want to pass in a struct device everywhere? How about creating a defconfig that makes sense for TDX then? TDX can be used in many different ways, I don't think a defconfig is practical. In theory you could do some Kconfig dependency (at the pain point of having separate kernel binariees), but why not just do it at run time then if you maintain the list anyways. That's much easier and saner for everyone. In the past we usually always ended up with runtime mechanism for similar things anyways. Also it turns out that the filter mechanisms are needed for some arch drivers which are not even configurable, so alone it's probably not enough, I guess they aren't really needed though right, or you won't try to filter them? We're addressing most of them with the device filter for platform drivers. But since we cannot stop them doing ioremap IO in their init code they also need the low level firewall. Some others that cannot be addressed have explicit disables. So make them configurable? Why not just fix the runtime? It's much saner for everyone. Proposing to do things at build time sounds like we're in Linux 0.99 days. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/11/2021 12:58 AM, Christoph Hellwig wrote: Just as last time: This does not make any sense. ioremap is shared by definition. It's not necessarily shared with the host for confidential computing: for example BIOS mappings definitely should not be shared, but they're using ioremap today. But if you have a better term please propose something. I tried to clarify it with "shared_host", but I don't know a better term. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared
The connection is quite unfortunate IMHO. Can't there be an option that unbreaks drivers *without* opening up security holes by making BIOS shared? That would require new low level APIs that distinguish both cases, and a tree sweep. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
To which Andi replied One problem with removing the ioremap opt-in is that it's still possible for drivers to get at devices without going through probe. To which Greg replied: https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/ If there are in-kernel PCI drivers that do not do this, they need to be fixed today. Can you guys resolve the differences here? I addressed this in my other mail, but we may need more discussion. And once they are resolved, mention this in the commit log so I don't get to re-read the series just to find out nothing changed in this respect? I frankly do not believe we are anywhere near being able to harden an arbitrary kernel config against attack. Why not? Device filter and the opt-ins together are a fairly strong mechanism. And it's not that they're a lot of code or super complicated either. You're essentially objecting to a single line change in your subsystem here. How about creating a defconfig that makes sense for TDX then? TDX can be used in many different ways, I don't think a defconfig is practical. In theory you could do some Kconfig dependency (at the pain point of having separate kernel binariees), but why not just do it at run time then if you maintain the list anyways. That's much easier and saner for everyone. In the past we usually always ended up with runtime mechanism for similar things anyways. Also it turns out that the filter mechanisms are needed for some arch drivers which are not even configurable, so alone it's probably not enough, Anyone deviating from that better know what they are doing, this API tweaking is just putting policy into the kernel ... Hardening drivers is kernel policy. It cannot be done anywhere else. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/9/2021 1:39 PM, Dan Williams wrote: On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin wrote: On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: From: Andi Kleen For Confidential VM guests like TDX, the host is untrusted and hence the devices emulated by the host or any data coming from the host cannot be trusted. So the drivers that interact with the outside world have to be hardened by sharing memory with host on need basis with proper hardening fixes. For the PCI driver case, to share the memory with the host add pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. Signed-off-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan So I proposed to make all pci mappings shared, eliminating the need to patch drivers. To which Andi replied One problem with removing the ioremap opt-in is that it's still possible for drivers to get at devices without going through probe. To which Greg replied: https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/ If there are in-kernel PCI drivers that do not do this, they need to be fixed today. Can you guys resolve the differences here? I agree with you and Greg here. If a driver is accessing hardware resources outside of the bind lifetime of one of the devices it supports, and in a way that neither modrobe-policy nor device-authorization -policy infrastructure can block, that sounds like a bug report. The 5.15 tree has something like ~2.4k IO accesses (including MMIO and others) in init functions that also register drivers (thanks Elena for the number) Some are probably old drivers that could be fixed, but it's quite a few legitimate cases. For example for platform or ISA drivers that's the only way they can be implemented because they often have no other enumeration mechanism. For PCI drivers it's rarer, but also still can happen. One example that comes to mind here is the x86 Intel uncore drivers, which support a mix of MSR, ioremap and PCI config space accesses all from the same driver. This particular example can (and should be) fixed in other ways, but similar things also happen in other drivers, and they're not all broken. Even for the broken ones they're usually for some crufty old devices that has very few users, so it's likely untestable in practice. My point is just that the ecosystem of devices that Linux supports is messy enough that there are legitimate exceptions from the "First IO only in probe call only" rule. And we can't just fix them all. Even if we could it would be hard to maintain. Using a "firewall model" hooking into a few strategic points like we're proposing here is much saner for everyone. Now we can argue about the details. Right now what we're proposing has some redundancies: it has both a device model filter and low level filter for ioremap (this patch and some others). The low level filter is for catching issues that don't clearly fit into the "enumeration<->probe" model. You could call that redundant, but I would call it defense in depth or better safe than sorry. In theory it would be enough to have the low level opt-in only, but that would have the drawback that is something gets enumerated after all you would have all kind of weird device driver failures and in some cases even killed guests. So I think it makes sense to have Fix those drivers instead of sprinkling ioremap_shared in select places and with unclear rules about when a driver is allowed to do "shared" mappings. Only add it when the driver has been audited and hardened. But I agree we need on a documented process for this. I will work on some documentation for a proposal. But essentially I think it should be some variant of what Elena has outlined in her talk at Security Summit. https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf That is using extra auditing/scrutiny at review time, supported with some static code analysis that points to the interaction points, and code needs to be fuzzed explicitly. However short term it's only three virtio drivers, so this is not a urgent problem. Let the new device-authorization mechanism (with policy in userspace) Default policy in user space just seems to be a bad idea here. Who should know if a driver is hardened other than the kernel? Maintaining the list somewhere else just doesn't make sense to me. Also there is the more practical problem that some devices are needed for booting. For example in TDX we can't print something to the console with this mechanism, so you would never get any output before the initrd. Just seems like a nightmare for debugging anything. There really needs to be an authorization mechanism that works reasonably early. I can see a point of having user space overrides though, but we need to have a
Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest
On 10/2/2021 4:14 AM, Greg Kroah-Hartman wrote: On Sat, Oct 02, 2021 at 07:04:28AM -0400, Michael S. Tsirkin wrote: On Fri, Oct 01, 2021 at 08:49:28AM -0700, Andi Kleen wrote: Do you have a list of specific drivers and kernel options that you feel you now "trust"? For TDX it's currently only virtio net/block/console But we expect this list to grow slightly over time, but not at a high rate (so hopefully <10) Well there are already >10 virtio drivers and I think it's reasonable that all of these will be used with encrypted guests. The list will grow. What is keeping "all" drivers from being on this list? It would be too much work to harden them all, and it would be pointless because all these drivers are never legitimately needed in a virtualized environment which only virtualize a very small number of devices. How exactly are you determining what should, and should not, be allowed? Everything that has had reasonable effort at hardening can be added. But if someone proposes to add a driver that should trigger additional scrutiny in code review. We should also request them to do some fuzzing. It's a bit similar to someone trying to add a new syscall interface. That also triggers much additional scrutiny for good reasons and people start fuzzing it. How can drivers move on, or off, of it over time? Adding something is submitting a patch to the allow list. I'm not sure the "off" case would happen, unless the driver is completely removed, or maybe it has some unfixable security problem. But that is all rather unlikely. And why not just put all of that into userspace and have it pick and choose? That should be the end-goal here, you don't want to encode policy like this in the kernel, right? How would user space know what drivers have been hardened? This is really something that the kernel needs to determine. I don't think we can outsource it to anyone else. Also BTW of course user space can still override it, but really the defaults should be a kernel policy. There's also the additional problem that one of the goals of confidential guest is to just move existing guest virtual images into them without much changes. So it's better for such a case if as much as possible of the policy is in the kernel. But that's more a secondary consideration, the first point is really the important part. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices
Forget about trust for the moment. Let's say the goal is to prevent the kernel from creating any bindings other that those in some small "allowed" set. To fully specify one of the allowed bindings, you would have to provide both a device ID and a driver name. But in practice this isn't necessary, since a device with a given ID will bind to only one driver in almost all cases, and hence giving just the device ID is enough. So to do what they want, all that's needed is to forbid any bindings except where the device ID is "allowed". Or to put it another way, where the device's authorized flag (which can be initialized based on the device ID) is set. (The opposite approach, in which the drivers are "allowed" rather than the device IDs, apparently has already been discussed and rejected. I'm not convinced that was a good decision, but...) Does this seem like a fair description of the situation? Yes. That's roughly what the patchkit under discussion implements. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest
On 10/1/2021 12:03 AM, Greg Kroah-Hartman wrote: On Thu, Sep 30, 2021 at 12:04:05PM -0700, Kuppuswamy, Sathyanarayanan wrote: On 9/30/21 8:23 AM, Greg Kroah-Hartman wrote: On Thu, Sep 30, 2021 at 08:18:18AM -0700, Kuppuswamy, Sathyanarayanan wrote: On 9/30/21 6:36 AM, Dan Williams wrote: And in particular, not all virtio drivers are hardened - I think at this point blk and scsi drivers have been hardened - so treating them all the same looks wrong. My understanding was that they have been audited, Sathya? Yes, AFAIK, it has been audited. Andi also submitted some patches related to it. Andi, can you confirm. What is the official definition of "audited"? In our case (Confidential Computing platform), the host is an un-trusted entity. So any interaction with host from the drivers will have to be protected against the possible attack from the host. For example, if we are accessing a memory based on index value received from host, we have to make sure it does not lead to out of bound access or when sharing the memory with the host, we need to make sure only the required region is shared with the host and the memory is un-shared after use properly. You have not defined the term "audited" here at all in any way that can be reviewed or verified by anyone from what I can tell. You have only described a new model that you wish the kernel to run in, one in which it does not trust the hardware at all. That is explicitly NOT what the kernel has been designed for so far, It has been already done for a few USB/TB drivers, but yes not for the majority of the kernel. and if you wish to change that, lots of things need to be done outside of simply running some fuzzers on a few random drivers. The goal is to do similar work as USB/TB did, but do it for a small set of virtio drivers and use a custom allow list for those for the specific secure guest cases. (there are some other goals, but let's not discuss them here for now) For one example, how do you ensure that the memory you are reading from hasn't been modified by the host between writing to it the last time you did? It's similar techniques as we do on user space accesses. For example if you bound check some value the code needs to ensure it is cached in private memory, not reread from MMIO or shared memory. Of course that's a good idea anyways for performance because MMIO is slow. In the concrete cases of virtio the main problem was the free list in shared memory, but that has been addressed now. Do you have a list of specific drivers and kernel options that you feel you now "trust"? For TDX it's currently only virtio net/block/console But we expect this list to grow slightly over time, but not at a high rate (so hopefully <10) If so, how long does that trust last for? Until someonen else modifies that code? What about modifications to functions that your "audited" code touches? Who is doing this auditing? How do you know the auditing has been done correctly? Who has reviewed and audited the tools that are doing the auditing? Where is the specification that has been agreed on how the auditing must be done? And so on... Well, I mean we already have a similar situation with user space APIs. So it's not a new problem. For those we've done it for many years, with audits and extra fuzzing. There are people working on the audit and fuzzing today. How exactly it will be ensured long term is still be worked out, but I expect we can work out something. I feel like there are a lot of different things all being mixed up here into one "oh we want this to happen!" type of thread. Agreed. The thread ended up about a lot of stuff which is outside the scope of the patches. Please let's just stick to the one request that I had here, which was to move the way that busses are allowed to authorize the devices they wish to control into a generic way instead of being bus-specific logic. Any requests outside of that type of functionality are just that, outside the scope of this patchset and should get their own patch series and discussion. Yes that's the intention. This patch kit is only about controlling what devices can enumerate. Also please let's avoid the "trusted" term. It's really misleading and confusing in the context of confidential computing. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices
If all you want to do is prevent someone from loading a bunch of drivers that you have identified as unhardened, why not just use a modprobe blacklist? That wouldn't help for builtin drivers, we cannot control initcalls. This LWN article has more details on the background. https://lwn.net/Articles/865918/ -Andi Am I missing something? Alan Stern ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest
On 9/30/2021 12:04 PM, Kuppuswamy, Sathyanarayanan wrote: On 9/30/21 8:23 AM, Greg Kroah-Hartman wrote: On Thu, Sep 30, 2021 at 08:18:18AM -0700, Kuppuswamy, Sathyanarayanan wrote: On 9/30/21 6:36 AM, Dan Williams wrote: And in particular, not all virtio drivers are hardened - I think at this point blk and scsi drivers have been hardened - so treating them all the same looks wrong. My understanding was that they have been audited, Sathya? Yes, AFAIK, it has been audited. Andi also submitted some patches related to it. Andi, can you confirm. What is the official definition of "audited"? In our case (Confidential Computing platform), the host is an un-trusted entity. So any interaction with host from the drivers will have to be protected against the possible attack from the host. For example, if we are accessing a memory based on index value received from host, we have to make sure it does not lead to out of bound access or when sharing the memory with the host, we need to make sure only the required region is shared with the host and the memory is un-shared after use properly. Elena can share more details, but it was achieved with static analysis and fuzzing. Here is a presentation she is sharing about the work at the Linux Security Summit: https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf Andi, can talk more about the specific driver changes that came out of this effort. The original virtio was quite easy to exploit because it put its free list into the shared ring buffer. We had a patchkit to harden virtio originally, but after some discussion we instead switched to Jason Wang's patchkit to move the virtio metadata into protected memory, which fixed near all of the issues. These patches have been already merged. There is one additional patch to limit the virtio modes. There's an ongoing effort to audit (mostly finished I believe) and fuzz the three virtio drivers (fuzzing is still ongoing). There was also a range of changes outside virtio for code outside the device model. Most of it was just disabling it though. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest
On 9/30/2021 8:18 AM, Kuppuswamy, Sathyanarayanan wrote: On 9/30/21 6:36 AM, Dan Williams wrote: And in particular, not all virtio drivers are hardened - I think at this point blk and scsi drivers have been hardened - so treating them all the same looks wrong. My understanding was that they have been audited, Sathya? Yes, AFAIK, it has been audited. Andi also submitted some patches related to it. Andi, can you confirm. We also authorize the virtio at PCI ID level. And currently we allow console, block and net virtio PCI devices. { PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, VIRTIO_TRANS_ID_NET) }, { PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, VIRTIO_TRANS_ID_BLOCK) }, { PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, VIRTIO_TRANS_ID_CONSOLE) }, The only drivers that are being audited and fuzzed are these three virtio drivers (in addition to some other x86 code outside the driver model) -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices
I don't think the current mitigations under discussion here are about keeping the system working. In fact most encrypted VM configs tend to stop booting as a preferred way to handle security issues. Maybe we should avoid the "trusted" term here. We're only really using it because USB is using it and we're now using a common framework like Greg requested. But I don't think it's the right way to think about it. We usually call the drivers "hardened". The requirement for a hardened driver is that all interactions through MMIO/port/config space IO/MSRs are sanitized and do not cause memory safety issues or other information leaks. Other than that there is no requirement on the functionality. In particular DOS is ok since a malicious hypervisor can decide to not run the guest at any time anyways. Someone loading an malicious driver inside the guest would be out of scope. If an attacker can do that inside the guest you already violated the security mechanisms and there are likely easier ways to take over the guest or leak data. The goal of the device filter mechanism is to prevent loading unhardened drivers that could be exploited without them being themselves malicious. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices
On 9/30/2021 10:23 AM, Greg Kroah-Hartman wrote: On Thu, Sep 30, 2021 at 10:17:09AM -0700, Andi Kleen wrote: The "it" that I referred to is the claim that no driver should be touching hardware in their module init call. Andi seems to think such drivers are worth working around with a special remap API. Andi is wrong. While overall it's a small percentage of the total, there are still quite a few drivers that do touch hardware in init functions. Sometimes for good reasons -- they need to do some extra probing to discover something that is not enumerated -- sometimes just because it's very old legacy code that predates the modern driver model. Are any of them in the kernel today? PCI drivers should not be messing with this, we have had well over a decade to fix that up. It's not just PCI driver, it's every driver that can do io port / MMIO / MSR / config space accesses. Maybe read the excellent article from Jon on this: https://lwn.net/Articles/865918/ The legacy drivers could be fixed, but nobody really wants to touch them anymore and they're impossible to test. Pointers to them? For example if you look over old SCSI drivers in drivers/scsi/*.c there is a substantial number that has a module init longer than just registering a driver. As a single example look at drivers/scsi/BusLogic.c There were also quite a few platform drivers like this. The drivers that probe something that is not enumerated in a standard way have no choice, it cannot be implemented in a different way. PCI devices are not enumerated in a standard way??? The pci devices are enumerated in a standard way, but typically the driver also needs something else outside PCI that needs some other probing mechanism. So instead we're using a "firewall" the prevents these drivers from doing bad things by not allowing ioremap access unless opted in, and also do some filtering on the IO ports The device filter is still the primary mechanism, the ioremap filtering is just belts and suspenders for those odd cases. That's horrible, don't try to protect the kernel from itself. Just fix the drivers. I thought we had already established this last time we discussed it. That's completely impractical. We cannot harden thousands of drivers, especially since it would be all wasted work since nobody will ever need them in virtual guests. Even if we could harden them how would such a work be maintained long term? Using a firewall and filtering mechanism is much saner for everyone. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices
The "it" that I referred to is the claim that no driver should be touching hardware in their module init call. Andi seems to think such drivers are worth working around with a special remap API. Andi is wrong. While overall it's a small percentage of the total, there are still quite a few drivers that do touch hardware in init functions. Sometimes for good reasons -- they need to do some extra probing to discover something that is not enumerated -- sometimes just because it's very old legacy code that predates the modern driver model. The legacy drivers could be fixed, but nobody really wants to touch them anymore and they're impossible to test. The drivers that probe something that is not enumerated in a standard way have no choice, it cannot be implemented in a different way. So instead we're using a "firewall" the prevents these drivers from doing bad things by not allowing ioremap access unless opted in, and also do some filtering on the IO ports The device filter is still the primary mechanism, the ioremap filtering is just belts and suspenders for those odd cases. If you want we can send an exact list, we did some analysis using a patched smatch tool. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
Hmm, yes that's true. I guess we can make it default to opt-in for pci_iomap. It only really matters for device less ioremaps. OK. And same thing for other things with device, such as devm_platform_ioremap_resource. If we agree on all that, this will basically remove virtio changes from the picture ;) Hi we revisited this now. One problem with removing the ioremap opt-in is that it's still possible for drivers to get at devices without going through probe. For example they can walk the PCI device list. Some drivers do that for various reasons. So if we remove the opt-in we would need to audit and possibly fix all that, which would be potentially a lot of churn. That's why I think it's better to keep the opt-in. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
And we've been avoiding that drivers can self declare auditing, we've been trying to have a separate centralized list so that it's easier to enforce and avoids any cut'n'paste mistakes. -Andi Now I'm confused. What is proposed here seems to be basically that, drivers need to declare auditing by replacing ioremap with ioremap_shared. Auditing is declared on the device model level using a central allow list. Can we not have an init call allow list instead of, or in addition to, a device allow list? That would be quite complicated and intrusive. In fact I'm not even sure how to do maintain something like this. There are a lot of needed initcalls, they would all need to be marked. How can we distinguish them? It would be a giant auditing project. And of course how would you prevent it from bitrotting? Basically it would be hundreds of changes all over the tree, just to avoid two changes in virtio and MSI. Approach of just stopping the initcalls from doing bad things is much less intrusive. But this cannot do anything to initcalls that run before probe, Can't we extend module_init so init calls are validated against the allow list? See above. Also the problem isn't really with modules (we rely on udev not loading them), but with builtin initcalls that's why an extra level of defense of ioremap opt-in is useful. OK even assuming this, why is pci_iomap opt-in useful? That never happens before probe - there's simply no pci_device then. Hmm, yes that's true. I guess we can make it default to opt-in for pci_iomap. It only really matters for device less ioremaps. It looks suspiciously like drivers self-declaring auditing to me which we both seem to agree is undesirable. What exactly is the difference? Just allow listing the ioremaps is not self declaration because the device will still not initialize due to the central device filter. If you want to use it that has to be changed. It's just an additional safety net to contain code running before probe. Or are you just trying to disable anything that runs before probe? Well anything that could do dangerous host interactions (like processing ioremap data) A lot of things are harmless and can be allowed, or already blocked elsewhere (e.g. we have a IO port filter). This just handles the ioremap/MMIO case. In that case I don't see a reason to touch pci drivers though. These should be fine with just the device model list. That won't stop initcalls. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
On 8/30/2021 1:59 PM, Michael S. Tsirkin wrote: Or we can add _audited to the name. ioremap_shared_audited? But it's not the mapping that has to be done in handled special way. It's any data we get from device, not all of it coming from IO, e.g. there's DMA and interrupts that all have to be validated. Wouldn't you say that what is really wanted is just not running unaudited drivers in the first place? Yes. And we've been avoiding that drivers can self declare auditing, we've been trying to have a separate centralized list so that it's easier to enforce and avoids any cut'n'paste mistakes. -Andi Now I'm confused. What is proposed here seems to be basically that, drivers need to declare auditing by replacing ioremap with ioremap_shared. Auditing is declared on the device model level using a central allow list. But this cannot do anything to initcalls that run before probe, that's why an extra level of defense of ioremap opt-in is useful. But it's not the primary mechanism to declare a driver audited, that's the allow list. The ioremap is just another mechanism to avoid having to touch a lot of legacy drivers. If we agree on that then the original proposed semantics of "ioremap_shared" may be acceptable? -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
On 8/29/2021 3:26 PM, Michael S. Tsirkin wrote: On Sun, Aug 29, 2021 at 09:17:53AM -0700, Andi Kleen wrote: Also I changing this single call really that bad? It's not that we changing anything drastic here, just give the low level subsystem a better hint about the intention. If you don't like the function name, could make it an argument instead? My point however is that the API should say that the driver has been audited, We have that status in the struct device. If you want to tie the ioremap to that we could define a ioremap_device() with a device argument and decide based on that. Or we can add _audited to the name. ioremap_shared_audited? not that the mapping has been done in some special way. For example the mapping can be in some kind of wrapper, not directly in the driver. However you want the driver validated, not the wrapper. Here's an idea: I don't think magic differences of API behavior based on some define are a good idea. That's easy to miss. That's a "COME FROM" in API design. Also it wouldn't handle the case that a driver has both private and shared ioremaps, e.g. for BIOS structures. And we've been avoiding that drivers can self declare auditing, we've been trying to have a separate centralized list so that it's easier to enforce and avoids any cut'n'paste mistakes. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
All this makes sense but ioremap is such a random place to declare driver has been audited, and it's baked into the binary with no way for userspace to set policy. Again all we will end up with is gradual replacement of all ioremap calls with ioremap_shared as people discover a given driver does not work in a VM. No the device filter will prevent that. They would need to submit patches to the central list. Or they can override it at the command line, but there is already an option to force all ioremaps to be shared. So if you just want some driver to run without caring about security you can already do that without modifying it. Besides the shared concept only makes sense for virtual devices, so if you don't have a device model for a device this will never happen either. So I don't think your scenario of all ioremaps becoming shared will ever happen. How are you going to know driver has actually been audited? what the quality of the audit was? did the people doing the auditing understand what they are auditing for? No way, right? First the primary purpose of the ioremap policy is to avoid messing with all the legacy drivers (which is 99+% of the code base) How to handle someone maliciously submitting a driver to the kernel is a completely different problem. To some degree there is trust of course. If someone says they audited and a maintainer trusts them with their statement, but they actually didn't there is a problem, but it's larger than just TDX. But in such a case the community code audit will also focus on such areas, and people interested in confidential computing security will also start taking a look. And we're also working on fuzzing, so these drivers will be fuzzed at some point, so mistakes will be eventually found. But in any case the ioremap policy is mainly to prevent having to handle this for all legacy drivers. I would rather change the few drivers that are actually needed, than all the other drivers. That's really the fundamental problem this is addressing: how to get reasonable security with all the legacy drivers out of the box without touching them. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
Let's be frank, even without encryption disabling most drivers - especially weird ones that poke at hardware before probe - is far safer than keeping them, but one loses a bunch of features. Usually we don't lose features at all. None of the legacy drivers are needed on a guest (or even a modern native system). It's all just all for old hardware. Maybe in 20+ years it can be all removed, but we can't wait that long. IOW all this hardening is nice but which security/feature tradeoff to take it a policy decision, not something kernel should do imho. There's no mechanism to push this kind of policy to user space. Users don't have control what initcalls run. At the time they execute there isn't even any user space yet. Even if they could somehow control them it's very unlikely they would understand them and make an informed decision. Doing it at build time is not feasible either since we want to run on standard distribution kernels. For modules we have a policy mechanism (prevent udev probing by preventing enumeration), and that is implemented, but only handling modules is not enough. The compiled in drivers have to be handled too, otherwise you have gaping holes in the protection. We don't prevent users manually loading modules that might probe, but that is a policy decision that users actually sensibly make in user space. Also I changing this single call really that bad? It's not that we changing anything drastic here, just give the low level subsystem a better hint about the intention. If you don't like the function name, could make it an argument instead? -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
On 8/24/2021 1:31 PM, Bjorn Helgaas wrote: On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote: On 8/24/2021 11:55 AM, Bjorn Helgaas wrote: [+cc Rajat; I still don't know what "shared memory with a hypervisor in a confidential guest" means, A confidential guest is a guest which uses memory encryption to isolate itself from the host. It doesn't trust the host. But it still needs to communicate with the host for IO, so it has some special memory areas that are explicitly marked shared. These are used to do IO with the host. All their usage needs to be carefully hardened to avoid any security attacks on the guest, that's why we want to limit this interaction only to a small set of hardened drivers. For MMIO, the set is currently only virtio and MSI-X. Good material for the commit log next time around. Thanks! This is all in the patch intro too, which should make it into the merge commits. I don't think we can reexplain the basic concepts for every individual patch in a large patch kit. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
On 8/24/2021 11:55 AM, Bjorn Helgaas wrote: [+cc Rajat; I still don't know what "shared memory with a hypervisor in a confidential guest" means, A confidential guest is a guest which uses memory encryption to isolate itself from the host. It doesn't trust the host. But it still needs to communicate with the host for IO, so it has some special memory areas that are explicitly marked shared. These are used to do IO with the host. All their usage needs to be carefully hardened to avoid any security attacks on the guest, that's why we want to limit this interaction only to a small set of hardened drivers. For MMIO, the set is currently only virtio and MSI-X. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
I see. Hmm. It's a bit of a random thing to do it at the map time though. E.g. DMA is all handled transparently behind the DMA API. Hardening is much more than just replacing map with map_shared and I suspect what you will end up with is basically vendors replacing map with map shared to make things work for their users and washing their hands. That concept exists too. There is a separate allow list for the drivers. So just adding shared to a driver is not enough, until it's also added to the allowlist Users can of course chose to disable the allowlist, but they need to understand the security implications. I would say an explicit flag in the driver that says "hardened" and refusing to init a non hardened one would be better. We have that too (that's the device filtering) But the problem is that device filtering just stops the probe functions, not the initcalls, and lot of legacy drivers do MMIO interactions before going into probe. In some cases it's unavoidable because of the device doesn't have a separate enumeration mechanism it needs some kind of probing to even check for its existence And since we don't want to change all of them it's far safer to make the ioremap opt-in. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
On 8/24/2021 12:07 AM, Christoph Hellwig wrote: On Mon, Aug 23, 2021 at 05:30:54PM -0700, Kuppuswamy, Sathyanarayanan wrote: On 8/23/21 4:56 PM, Michael S. Tsirkin wrote: Add a new variant of pci_iomap for mapping all PCI resources of a devices as shared memory with a hypervisor in a confidential guest. Signed-off-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan I'm a bit puzzled by this part. So why should the guest*not* map pci memory as shared? And if the answer is never (as it seems to be) then why not just make regular pci_iomap DTRT? It is in the context of confidential guest (where VMM is un-trusted). So we don't want to make all PCI resource as shared. It should be allowed only for hardened drivers/devices. Well, assuming the host can do any damage when mapped shared that also means not mapping it shared will completely break the drivers. There are several cases: - We have driver filtering active to protect you against attacks from the host against unhardened drivers. In this case the drivers not working is the intended behavior. - There is an command allow list override for some new driver, but the driver is hardened and shared The other drivers will still not work, but that's also the intended behavior - Driver filtering is disabled or the allow list override is used to enable some non hardened/enabled driver There is a command line option to override the ioremap sharing default, it will allow all drivers to do ioremap. We would really prefer to make it more finegrained, but it's not possible in this case. Other drivers are likely attackable. - Driver filtering is disabled (allowing attacks on the drivers) and the command line option for forced sharing is set. All drivers initialize and can talk to the host through MMIO. Lots of unhardened drivers are likely attackable. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}
On 8/23/2021 6:04 PM, Dan Williams wrote: On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan wrote: On 8/23/21 4:56 PM, Michael S. Tsirkin wrote: Add a new variant of pci_iomap for mapping all PCI resources of a devices as shared memory with a hypervisor in a confidential guest. Signed-off-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan I'm a bit puzzled by this part. So why should the guest*not* map pci memory as shared? And if the answer is never (as it seems to be) then why not just make regular pci_iomap DTRT? It is in the context of confidential guest (where VMM is un-trusted). So we don't want to make all PCI resource as shared. It should be allowed only for hardened drivers/devices. That's confusing, isn't device authorization what keeps unaudited drivers from loading against untrusted devices? I'm feeling like Michael that this should be a detail that drivers need not care about explicitly, in which case it does not need to be exported because the detail can be buried in lower levels. We originally made it default (similar to AMD), but it during code audit we found a lot of drivers who do ioremap early outside the probe function. Since it would be difficult to change them all we made it opt-in, which ensures that only drivers that have been enabled can talk with the host at all and can't be attacked. That made the problem of hardening all these drivers a lot more practical. Currently we only really need virtio and MSI-X shared, so for changing two places in the tree you avoid a lot of headache elsewhere. Note there is still a command line option to override if you want to allow and load other drivers. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 09/15] pci: Consolidate pci_iomap* and pci_iomap*wc
Why is it safe to make pci_iomap_wc_range() support IO ports when it didn't before? That might be desirable, but I think it *is* a functional change here. None of the callers use it to map IO ports, so it will be a no-op for them. But you're right, it's a (minor) functional change. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
For most Linux drivers, a report that a misbehaving device can corrupt host memory is a bug, not a feature. If a USB device can corrupt kernel memory, that's a serious bug. If a USB-C device can corrupt kernel memory, that's also a serious bug, although, sadly, we probably have lots of these bugs. If a Firewire device can corrupt kernel memory, news at 11. If a Bluetooth or WiFi peer can corrupt kernel memory, people write sonnets about it and give it clever names. Why is virtio special? Well for most cases it's pointless because they don't have any memory protection anyways. Why break compatibility if it does not buy you anything? Anyways if you want to enable the restricted mode for something else, it's easy to do. The cases where it matters seem to already work on it, like the user space virtio ring. My changes for boundary checking are enabled unconditionally anyways, as well as the other patchkits. This one: int arch_has_restricted_virtio_memory_access(void) +{ + return is_tdx_guest(); +} I'm looking at a fairly recent kernel, and I don't see anything for s390 wired up in vring_use_dma_api. It's not using vring_use_dma_api, but enforces the DMA API at virtio ring setup time, same as SEV/TDX. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
We do not need an increasing pile of kludges Do you mean disabling features is a kludge? If yes I disagree with that characterization. to make TDX and SEV “secure”. We need the actual loaded driver to be secure. The virtio architecture is full of legacy nonsense, and there is no good reason for SEV and TDX to be a giant special case. I don't know where you see a "giant special case". Except for the limited feature negotiation all the changes are common, and the disabling of features (which is not new BTW, but already done e.g. with forcing DMA API in some cases) can be of course used by all these other technologies too. But it just cannot be done by default for everything because it would break compatibility. So every technology with such requirements has to explicitly opt-in. As I said before, real PCIe (Thunderbolt/USB-C or anything else) has the exact same problem. The fact that TDX has encrypted memory is, at best, a poor proxy for the actual condition. The actual condition is that the host does not trust the device to implement the virtio protocol correctly. Right they can do similar limitations of feature sets. But again it cannot be default. TDX and SEV use the arch hook to enforce DMA API, so that part is also solved. Can you point me to the code you’re referring to? See 4/8 in this patch kit. It uses an existing hook which is already used in tree by s390. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
Tell that to every crypto downgrade attack ever. That's exactly what this patch addresses. I see two credible solutions: 1. Actually harden the virtio driver. That's exactly what this patchkit, and the alternative approaches, like Jason's, are doing. 2. Have a new virtio-modern driver and use it for modern use cases. Maybe rename the old driver virtio-legacy or virtio-insecure. They can share code. In most use cases the legacy driver is not insecure because there is no memory protection anyways. Yes maybe such a split would be a good idea for maintenance and maybe performance reasons, but at least from the security perspective I don't see any need for it. Another snag you may hit: virtio’s heuristic for whether to use proper DMA ops or to bypass them is a giant kludge. I’m very slightly optimistic that getting the heuristic wrong will make the driver fail to operate but won’t allow the host to take over the guest, but I’m not really convinced. And I wrote that code! A virtio-modern mode probably should not have a heuristic, and the various iommu-bypassing modes should be fixed to work at the bus level, not the device level TDX and SEV use the arch hook to enforce DMA API, so that part is also solved. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
On 6/3/2021 10:33 AM, Andy Lutomirski wrote: On 6/2/21 5:41 PM, Andi Kleen wrote: Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. I read this as "the virtio driver is buggy. Let's disable most of the buggy code in one special case in which we need a driver without bugs. In all the other cases (e.g. hardware virtio device connected over USB-C), driver bugs are still allowed." My understanding is most of the other modes (except for split with separate descriptors) are obsolete and just there for compatibility. As long as they're deprecated they won't harm anyone. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
Ok, but what I meant is this, if we don't read from the descriptor ring, and validate all the other metadata supplied by the device (used id and len). Then there should be no way for the device to suppress the dma flags to write to the indirect descriptor table. Or do you have an example how it can do that? I don't. If you can validate everything it's probably ok The only drawback is even more code to audit and test. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 6/8] dma: Add return value to dma_unmap_page
What it looks like to me is abusing SWIOTLB's internal housekeeping to keep track of virtio-specific state. The DMA API does not attempt to validate calls in general since in many cases the additional overhead would be prohibitive. It has always been callers' responsibility to keep track of what they mapped and make sure sync/unmap calls match, and there are many, many, subtle and not-so-subtle ways for things to go wrong if they don't. If virtio is not doing a good enough job of that, what's the justification for making it the DMA API's problem? In this case it's not prohibitive at all. Just adding a few error returns, and checking the overlap (which seems to have been already solved anyways) I would argue the error returns are good practice anyways, so that API users can check that something bad happening and abort. The DMA API was never very good at proper error handling, but there's no reason at all to continue being bad it forever. AFAIK the rest just works anyways, so it's not really a new problem to be solved. A new callback is used to avoid changing all the IOMMU drivers. Nit: presumably by "IOMMU drivers" you actually mean arch DMA API backends? Yes Furthermore, AFAICS it's still not going to help against exfiltrating guest memory by over-unmapping the original SWIOTLB slot *without* going past the end of the whole buffer, That would be just exfiltrating data that is already shared, unless I'm misunderstanding you. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
I agree, but I want to know why indirect descriptor needs to be disabled. The table can't be wrote by the device since it's not coherent swiotlb mapping. I had all kinds of problems with uninitialized entries in the indirect table. So I gave up on it and concluded it would be too difficult to secure. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring
It looks to me all the evils came from the fact that we depends on the descriptor ring. So the checks in this patch could is unnecessary if we don't even read from the descriptor ring which could be manipulated by the device. This is what my series tries to achieve: https://www.spinics.net/lists/kvm/msg241825.html I would argue that you should boundary check in any case. It was always a bug to not have boundary checks in such a data structure with multiple users, trust or not. But yes your patch series is interesting and definitely makes sense for TDX too. Best would be to have both I guess, and always check the boundaries everywhere. So what's the merge status of your series? -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks
On 6/2/2021 6:48 PM, Konrad Rzeszutek Wilk wrote: On Wed, Jun 02, 2021 at 05:41:30PM -0700, Andi Kleen wrote: swiotlb currently only uses the start address of a DMA to check if something is in the swiotlb or not. But with virtio and untrusted hosts the host could give some DMA mapping that crosses the swiotlb boundaries, potentially leaking or corrupting data. Add size checks to all the swiotlb checks and reject any DMAs that cross the swiotlb buffer boundaries. I seem to be only CC-ed on this and #7, so please bear with me. You weren't cc'ed originally so if you get partial emails it must be through some list. But could you explain to me why please: commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 (swiotlb/stable/for-linus-5.12) Author: Martin Radev Date: Tue Jan 12 16:07:29 2021 +0100 swiotlb: Validate bounce size in the sync/unmap path does not solve the problem as well? Thanks. I missed that patch, race condition. One major difference of my patch is that it supports an error return, which allows virtio to error out. This is important in virtio because otherwise you'll end up with uninitialized memory on the target without any indication. This uninitialized memory could be an potential attack vector on the guest memory, e.g. if the attacker finds some way to echo it out again. But the error return could be added to your infrastructure too and what would make this patch much shorter. I'll take a look at that. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Virtio hardening for TDX
Note that it's probably needed by other cases as well: 1) Other encrypted VM technology 2) VDUSE[1] 3) Smart NICs Right. I don't see any reason why these shouldn't work. You may just need to add the enable for the lockdown, but you can reuse the basic infrastructure. We have already had discussions and some patches have been posted[2][3][4]. Thanks. Yes [2] is indeed an alternative. We considered this at some point, but since we don't care about DOS in our case it seemed simpler to just harden the existing code. But yes if it's there it's useful for TDX too. FWIW I would argue that the descriptor boundary checking should be added in any case, security case or separated metadata or not, because it can catch bugs and is very cheap. Checking boundaries is good practice. [4] would be an independent issue, that's something we didn't catch. Also the swiotlb hardening implemented in this patchkit doesn't seem to be in any of the other patches. So I would say my patches are mostly orthogonal to these patches below and not conflicting, even though they address a similar problem space. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
So we will see huge performance regression without indirect descriptor. We need to consider to address this. A regression would be when some existing case would be slower. That's not the case because the behavior for the existing cases does not change. Anyways when there are performance problems they can be addressed, but first is to make it secure. -Andi Thanks break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v1 5/8] dma: Use size for swiotlb boundary checks
swiotlb currently only uses the start address of a DMA to check if something is in the swiotlb or not. But with virtio and untrusted hosts the host could give some DMA mapping that crosses the swiotlb boundaries, potentially leaking or corrupting data. Add size checks to all the swiotlb checks and reject any DMAs that cross the swiotlb buffer boundaries. Signed-off-by: Andi Kleen --- drivers/iommu/dma-iommu.c | 13 ++--- drivers/xen/swiotlb-xen.c | 11 ++- include/linux/dma-mapping.h | 4 ++-- include/linux/swiotlb.h | 8 +--- kernel/dma/direct.c | 8 kernel/dma/direct.h | 8 kernel/dma/mapping.c| 4 ++-- net/xdp/xsk_buff_pool.c | 2 +- 8 files changed, 30 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..7ef13198721b 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, __iommu_dma_unmap(dev, dma_addr, size); - if (unlikely(is_swiotlb_buffer(phys))) + if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys, } iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask); - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys)) + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size)) swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs); return iova; } @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(phys, size, dir); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_cpu(dev, phys, size, dir); } @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device *dev, return; phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_device(dev, phys, size, dir); if (!dev_is_dma_coherent(dev)) @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir); - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_cpu(dev, sg_phys(sg), sg->length, dir); } @@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device *dev, return; for_each_sg(sgl, sg, nelems, i) { - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_device(dev, sg_phys(sg), sg->length, dir); - if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_device(sg_phys(sg), sg->length, dir); } diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 24d11861ac7d..333846af8d35 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t p, size_t size) return 0; } -static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) +static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr, +size_t size) { unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr)); unsigned long xen_pfn = bfn_to_local_pfn(bfn); @@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) * in our domain. Therefore _only_ check address within our domain. */ if (pfn_valid(PFN_DOWN(paddr))) - return is_swiotlb_buffer(paddr); + return is_swiotlb_buffer(paddr, size); return 0; } @@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, } /* NOTE: We use dev_addr here, not paddr! */ - if (is_xen_swiotlb_buffer(hwdev, dev_addr)) + if (is_xen_swiotlb_buffer(hwdev, dev_addr, size)) swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); } @@ -448,7 +449,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, dma_addr_t dma_addr, xen_dma_sync_for_cpu(dev, dma_addr, size, dir); } - if (is_xen_swiotlb
[PATCH v1 3/8] virtio: Harden split buffer detachment
Harden the split buffer detachment path by adding boundary checking. Note that when this fails we may fail to unmap some swiotlb mapping, which could result in a leak and a DOS. But that's acceptable because an malicious host can DOS us anyways. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index d37ff5a0ff58..1e9aa1e95e1b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq) return needs_kick; } -static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, -void **ctx) +static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, + void **ctx) { unsigned int i, j; __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); + /* We'll leak DMA mappings when this happens, but nothing +* can be done about that. In the worst case the host +* could DOS us, but it can of course do that anyways. +*/ + if (!inside_split_ring(vq, head)) + return -EIO; + /* Clear data ptr. */ vq->split.desc_state[head].data = NULL; @@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, while (vq->split.vring.desc[i].flags & nextflag) { vring_unmap_one_split(vq, &vq->split.vring.desc[i]); i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); + if (!inside_split_ring(vq, i)) + return -EIO; vq->vq.num_free++; } @@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, /* Free the indirect table, if any, now that it's unmapped. */ if (!indir_desc) - return; + return 0; len = virtio32_to_cpu(vq->vq.vdev, vq->split.vring.desc[head].len); @@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, } else if (ctx) { *ctx = vq->split.desc_state[head].indir_desc; } + return 0; } static inline bool more_used_split(const struct vring_virtqueue *vq) @@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, void *ret; unsigned int i; u16 last_used; + int err; START_USE(vq); @@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, /* detach_buf_split clears data, so grab it now. */ ret = vq->split.desc_state[i].data; - detach_buf_split(vq, i, ctx); + err = detach_buf_split(vq, i, ctx); + if (err) { + END_USE(vq); + return NULL; + } + vq->last_used_idx++; /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before @@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; detach_buf_split(vq, i, NULL); + /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->split.avail_idx_shadow); -- 2.25.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v1 8/8] virtio: Error out on endless free lists
Error out with a warning when the free list loops longer than the maximum size while freeing descriptors. While technically we don't care about DOS it is still better to abort it early. We ran into this problem while fuzzing the virtio interactions where the fuzzed code would get stuck for a long time. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 244a5b62d85c..96adaa4c5404 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -685,6 +685,11 @@ static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, if (!inside_split_ring(vq, i)) return -EIO; vq->vq.num_free++; + if (WARN_ONCE(vq->vq.num_free > + vq->split.queue_size_in_bytes / + sizeof(struct vring_desc), + "Virtio freelist corrupted")) + return -EIO; } vring_unmap_one_split(vq, &vq->split.vring.desc[i]); -- 2.25.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v1 1/8] virtio: Force only split mode with protected guest
When running under TDX the virtio host is untrusted. The bulk of the kernel memory is encrypted and protected, but the virtio ring is in special shared memory that is shared with the untrusted host. This means virtio needs to be hardened against any attacks from the host through the ring. Of course it's impossible to prevent DOS (the host can chose at any time to stop doing IO), but there should be no buffer overruns or similar that might give access to any private memory in the guest. virtio has a lot of modes, most are difficult to harden. The best for hardening seems to be split mode without indirect descriptors. This also simplifies the hardening job because it's only a single code path. Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71e16b53e9c1..f35629fa47b1 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #ifdef DEBUG @@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev) unsigned int i; for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) { + + /* +* In protected guest mode disallow packed or indirect +* because they ain't hardened. +*/ + switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); -- 2.25.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX
In virtio the host decides whether the guest uses the DMA API or not using the strangely named VIRTIO_F_ACCESS_PLATFORM bit (which really indicates whether the DMA API is used or not) For hardened virtio on TDX we want to enforce that that swiotlb is always used, which requires using the DMA API. While IO wouldn't really work without the swiotlb, it might be possible that an attacker forces swiotlbless IO to manipulate memory in the guest. So we want to force the DMA API (which then forces swiotlb), but without relying on the host. There is already an arch_has_restricted_memory_acces hook for this, which is currently used only by s390. Enable the config option for the hook for x86 and enable it for TDX. Signed-off-by: Andi Kleen --- arch/x86/Kconfig | 1 + arch/x86/mm/mem_encrypt_common.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1531a0f905ed..3d804fce31b9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST select X86_X2APIC select SECURITY_LOCKDOWN_LSM select X86_MEM_ENCRYPT_COMMON + select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS help Provide support for running in a trusted domain on Intel processors equipped with Trusted Domain eXtenstions. TDX is a new Intel diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index 24c9117547b4..2244d1f033ab 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -9,6 +9,7 @@ #include #include +#include #include /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ @@ -37,3 +38,9 @@ void __init mem_encrypt_init(void) amd_mem_encrypt_init(); } +int arch_has_restricted_virtio_memory_access(void) +{ + return is_tdx_guest(); +} +EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access); + -- 2.25.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v1 7/8] virtio: Abort IO when descriptor points outside forced swiotlb
Now that we have a return value for unmapping DMA mappings that are outside the forced swiotlb, use that to abort the IO operation. This prevents the host from subverting a read to access some data in the guest address space, which it might then get access somehow in another IO operation. It can subvert reads to point to other reads or other writes, but since it controls IO it can do that anyways. This is only done for the split code path, which is the only one supported with confidential guests. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 1e9aa1e95e1b..244a5b62d85c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -365,29 +365,31 @@ static int vring_mapping_error(const struct vring_virtqueue *vq, * Split ring specific functions - *_split(). */ -static void vring_unmap_one_split(const struct vring_virtqueue *vq, +static int vring_unmap_one_split(const struct vring_virtqueue *vq, struct vring_desc *desc) { u16 flags; + int ret; if (!vq->use_dma_api) - return; + return 0; flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); if (flags & VRING_DESC_F_INDIRECT) { - dma_unmap_single(vring_dma_dev(vq), + ret = dma_unmap_single(vring_dma_dev(vq), virtio64_to_cpu(vq->vq.vdev, desc->addr), virtio32_to_cpu(vq->vq.vdev, desc->len), (flags & VRING_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE); } else { - dma_unmap_page(vring_dma_dev(vq), + ret = dma_unmap_page(vring_dma_dev(vq), virtio64_to_cpu(vq->vq.vdev, desc->addr), virtio32_to_cpu(vq->vq.vdev, desc->len), (flags & VRING_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE); } + return ret; } static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, @@ -609,6 +611,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, break; if (!inside_split_ring(vq, i)) break; + /* +* Ignore unmapping errors since +* we're aborting anyways. +*/ vring_unmap_one_split(vq, &desc[i]); i = virtio16_to_cpu(_vq->vdev, desc[i].next); } @@ -671,7 +677,10 @@ static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, i = head; while (vq->split.vring.desc[i].flags & nextflag) { - vring_unmap_one_split(vq, &vq->split.vring.desc[i]); + int ret; + ret = vring_unmap_one_split(vq, &vq->split.vring.desc[i]); + if (ret) + return ret; i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); if (!inside_split_ring(vq, i)) return -EIO; @@ -878,6 +887,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) continue; /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; + /* Ignore unmap errors because there is nothing to abort */ detach_buf_split(vq, i, NULL); /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; -- 2.25.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Virtio hardening for TDX
[v1: Initial post] With confidential computing like TDX the guest doesn't trust the host anymore. The host is allowed to DOS of course, but it is not allowed to read or write any guest memory not explicitely shared with it. This has implication for virtio. Traditionally virtio didn't assume the other side of the communication channel is malicious, and therefore didn't do any boundary checks in virtio ring data structures. This patchkit does hardening for virtio. In a TDX like model the only host memory accesses allowed are in the virtio ring, as well as the (forced) swiotlb buffer. This patch kit does various changes to ensure there can be no access outside these two areas. It is possible for the host to break the communication, but this should result in a IO error on the guest, but no memory safety violations. virtio is quite complicated with many modes. To simplify the task we enforce that virtio is only in split mode without indirect descriptors, when running as a TDX guest. We also enforce use of the DMA API. Then these code paths are hardened against any corruptions on the ring. This patchkit has components in three subsystems: - Hardening changes to virtio, all in the generic virtio-ring - Hardening changes to kernel/dma swiotlb to harden swiotlb against malicious pointers. It requires an API change which needed a tree sweep. - A single x86 patch to enable the arch_has_restricted_memory_access for TDX It depends on Sathya's earlier patchkit that adds the basic infrastructure for TDX. This is only needed for the "am I running in TDX" part. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v1 2/8] virtio: Add boundary checks to virtio ring
In protected guest mode we don't trust the host. This means we need to make sure the host cannot subvert us through virtio communication. In general it can corrupt our virtio data and cause a DOS, but it should not be able to access any data that is not explicitely under IO. Also boundary checking so that the free list (which is accessible to the host) cannot point outside the virtio ring. Note it could still contain loops or similar, but these should only cause an DOS, not a memory corruption or leak. When we detect any out of bounds descriptor trigger an IO error. We also use a WARN() (in case it was a software bug instead of an attack). This implies that a malicious host can flood the guest kernel log, but that's only a DOS and acceptable in the threat model. This patch only hardens the initial consumption of the free list, the freeing comes later. Any of these errors can cause DMA memory leaks, but there is nothing we can do about that and that would be just a DOS. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 46 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f35629fa47b1..d37ff5a0ff58 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, return desc; } +/* assumes no indirect mode */ +static inline bool inside_split_ring(struct vring_virtqueue *vq, +unsigned index) +{ + return !WARN(index >= vq->split.vring.num, + "desc index %u out of bounds (%u)\n", + index, vq->split.vring.num); +} + static inline int virtqueue_add_split(struct virtqueue *_vq, struct scatterlist *sgs[], unsigned int total_sg, @@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, unsigned int i, n, avail, descs_used, prev, err_idx; int head; bool indirect; + int io_err; START_USE(vq); @@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, dma_addr_t addr = vring_map_single( vq, desc, total_sg * sizeof(struct vring_desc), DMA_TO_DEVICE); + io_err = -ENOMEM; if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* We're using some buffers from the free list. */ vq->vq.num_free -= descs_used; + io_err = -EIO; + if (!inside_split_ring(vq, head)) + goto unmap_release; + /* Update free pointer */ if (indirect) vq->free_head = virtio16_to_cpu(_vq->vdev, @@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* Put entry in available array (but don't update avail->idx until they * do sync). */ avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1); + + if (avail >= vq->split.vring.num) + goto unmap_release; + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head); /* Descriptors and available array need to be set before we expose the @@ -576,6 +607,8 @@ static inline int virtqueue_add_split(struct virtque
[PATCH v1 6/8] dma: Add return value to dma_unmap_page
In some situations when we know swiotlb is forced and we have to deal with untrusted hosts, it's useful to know if a mapping was in the swiotlb or not. This allows us to abort any IO operation that would access memory outside the swiotlb. Otherwise it might be possible for a malicious host to inject any guest page in a read operation. While it couldn't directly access the results of the read() inside the guest, there might scenarios where data is echoed back with a write(), and that would then leak guest memory. Add a return value to dma_unmap_single/page. Most users of course will ignore it. The return value is set to EIO if we're in forced swiotlb mode and the buffer is not inside the swiotlb buffer. Otherwise it's always 0. A new callback is used to avoid changing all the IOMMU drivers. Signed-off-by: Andi Kleen --- drivers/iommu/dma-iommu.c | 17 +++-- include/linux/dma-map-ops.h | 3 +++ include/linux/dma-mapping.h | 7 --- kernel/dma/mapping.c| 6 +- 4 files changed, 23 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7ef13198721b..babe46f2ae3a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr, iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist); } -static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, +static int __iommu_dma_unmap_swiotlb_check(struct device *dev, + dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, phys = iommu_iova_to_phys(domain, dma_addr); if (WARN_ON(!phys)) - return; + return -EIO; __iommu_dma_unmap(dev, dma_addr, size); if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); + else if (swiotlb_force == SWIOTLB_FORCE) + return -EIO; + return 0; } static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, @@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, return dma_handle; } -static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir); - __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs); + return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir, + attrs); } /* @@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s int i; for_each_sg(sg, s, nents, i) - __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s), + __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); } @@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = { .mmap = iommu_dma_mmap, .get_sgtable= iommu_dma_get_sgtable, .map_page = iommu_dma_map_page, - .unmap_page = iommu_dma_unmap_page, + .unmap_page_check = iommu_dma_unmap_page_check, .map_sg = iommu_dma_map_sg, .unmap_sg = iommu_dma_unmap_sg, .sync_single_for_cpu= iommu_dma_sync_single_for_cpu, diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d53a96a3d64..0ed0190f7949 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,9 @@ struct dma_map_ops { u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); unsigned long (*get_merge_boundary)(struct device *dev); + int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs); }; #ifdef CONFIG_DMA_OPS diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 37fbd12bd4ab..25b8382f8601 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir, unsigned long attrs); -void dma_unmap_page_attrs(struct device *dev,
Re: [PATCH 08/62] x86/boot/compressed/64: Add IDT Infrastructure
Joerg Roedel writes: > + addq$8, %rsp > + > + /* > + * Make sure we return to __KERNEL_CS - the CS selector on > + * the IRET frame might still be from an old BIOS GDT > + */ > + movq$__KERNEL_CS, 8(%rsp) This doesn't make sense. Either it's running on the correct CS before the exception or not. Likely there's some other problem here that you patched over with this hack. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] Remove 32-bit Xen PV guest support
Juergen Gross writes: > The long term plan has been to replace Xen PV guests by PVH. The first > victim of that plan are now 32-bit PV guests, as those are used only > rather seldom these days. Xen on x86 requires 64-bit support and with > Grub2 now supporting PVH officially since version 2.04 there is no > need to keep 32-bit PV guest support alive in the Linux kernel. > Additionally Meltdown mitigation is not available in the kernel running > as 32-bit PV guest, so dropping this mode makes sense from security > point of view, too. Normally we have a deprecation period for feature removals like this. You would make the kernel print a warning for some releases, and when no user complains you can then remove. If a user complains you can't. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86, kbuild: revert macrofying inline assembly code
Masahiro Yamada writes: > Revert the following 9 commits: FWIW the -Wa additional also broke LTO builds because it doesn't really support -Wa for individual files. So I'm glad they got reverted. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 0/4] Enable LBR for the guest
> On the other side, it seems that the (guest) kernel driver also works > without > the above being supported, should we change it to report error and stop > using the PMU features when the check of the above two fails (at > intel_pmu_init())? You could add the extra check for the LBR code yes, although it already checks if the LBR MSRs work, so in practice it's likely already covered. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest
> 1) vCPU context switching and guest side task switching are not identical. > That is, when the vCPU is scheduled out, the guest task on the vCPU may not guest task lifetime has nothing to do with this. It's completely independent of what you do here on the VCPU level. > run out its time slice yet, so the task will continue to run when the vCPU > is > scheduled in by the host (lbr wasn't save by the guest task when the vCPU is > scheduled out in this case). > > It is possible to have the vCPU which runs the guest task (in use of lbr) > scheduled > out, followed by a new host task being scheduled in on the pCPU to run. > It is not guaranteed that the new host task does not use the LBR feature on > the > pCPU. Sure it may use the LBR, and the normal perf context switch will switch it and everything works fine. It's like any other per-task LBR user. > > 2) Sometimes, people may want this usage: "perf record -b > ./qemu-system-x86_64 ...", > which will need lbr to be used in KVM as well. In this obscure case you can disable LBR support for the guest. The common case is far more important. It sounds like you didn't do any performance measurements. I expect the performance of your current solution to be terrible. e.g. a normal perf PMI does at least 1 MSR reads and 4+ MSR writes for a single counter. With multiple counters it gets worse. For each of those you'll need to exit. Adding something to the entry/exit list is similar to the cost of doing explicit RD/WRMSRs. On Skylake we have 32*3=96 MSRs for the LBRs. So with the 5 exits and entries, you're essentually doing 5*2*96=18432 extra MSR accesses for each PMI. MSR access is 100+ cycles at least, for writes it is far more expensive. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 0/4] Enable LBR for the guest
On Mon, Sep 25, 2017 at 12:44:52PM +0800, Wei Wang wrote: > This patch series enables the Last Branch Recording feature for the > guest. Instead of trapping each LBR stack MSR access, the MSRs are > passthroughed to the guest. Those MSRs are switched (i.e. load and > saved) on VMExit and VMEntry. > > Test: > Try "perf record -b ./test_program" on guest. I don't see where you expose the PERF capabilities MSR? That's normally needed for LBR too to report the version number. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest
> +static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx) > +{ > + int i; > + struct perf_lbr_stack lbr_stack; > + > + perf_get_lbr_stack(&lbr_stack); > + > + add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0); > + add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0); > + > + for (i = 0; i < lbr_stack.lbr_nr; i++) { > + add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0); > + add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0); > + if (lbr_stack.lbr_info) > + add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0, > + 0); > + } That will be really expensive and add a lot of overhead to every entry/exit. perf can already context switch the LBRs on task context switch. With that you can just switch LBR_SELECT, which is *much* cheaper because there are far less context switches than exit/entries. It implies that when KVM is running it needs to prevent perf from enabling LBRs in the context of KVM, but that should be straight forward. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH -tip v3 13/23] x86/trap: Use NOKPROBE_SYMBOL macro in trap.c
On Wed, Nov 20, 2013 at 04:22:21AM +, Masami Hiramatsu wrote: > Use NOKPROBE_SYMBOL macro to protect functions from kprobes > instead of __kprobes annotation in trap.c. > This also applies __always_inline annotation for some cases, > because NOKPROBE_SYMBOL() will inhibit inlining by referring > the symbol address. NOKPROBE_SYMBOL seems to add a reference from some variable to the function? With LTO we can optimize away unused functions, but not when there are references to the symbol. So this would likely prevent optimizations with LTO. I prefer a simpler "__kprobe" annotation. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Sat, Jun 01, 2013 at 01:28:00PM -0700, Jeremy Fitzhardinge wrote: > On 06/01/2013 01:14 PM, Andi Kleen wrote: > > FWIW I use the paravirt spinlock ops for adding lock elision > > to the spinlocks. > > Does lock elision still use the ticketlock algorithm/structure, or are > they different? If they're still basically ticketlocks, then it seems > to me that they're complimentary - hle handles the fastpath, and pv the > slowpath. It uses the ticketlock algorithm/structure, but: - it needs to know that the lock is free with an own operation - it has an additional field for strong adaptation state (but that field is independent of the low level lock implementation, so can be used with any kind of lock) So currently it inlines the ticket lock code into its own. Doing pv on the slow path would be possible, but would need some additional (minor) hooks I think. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
FWIW I use the paravirt spinlock ops for adding lock elision to the spinlocks. This needs to be done at the top level (so the level you're removing) However I don't like the pv mechanism very much and would be fine with using an static key hook in the main path like I do for all the other lock types. It also uses interrupt ops patching, for that it would be still needed though. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3] x86: use a read-only IDT alias on all CPUs
Ingo Molnar writes: > > This looks very nice to me now. Peter, any objections? it seems pointless without randomized main kernel text location, because the IDT will be still at a known per kernel fixed writable location in the direct mapping. As long as such randomization is not there it just wastes a TLB entry. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
> How is the compiler meant to optimise away "cond" if it's a function > call? Inlines can be optimized away. These tests are usually inlines. > What did I miss? If nothing, then I will revert this particular change > and Rafael will need to be sure his patch is not using VM_BUG_ON to call > a function with side-effects. Do you have an example where the code is actually different, or are you just speculating? FWIW for my config both generates the same code: size vmlinux-andi-vmbug vmlinux-vmbug-nothing textdata bss dec hex filename 118097041457352 1159168 14426224 dc2070 vmlinux-andi-vmbug 118097041457352 1159168 14426224 dc2070 vmlinux-vmbug-nothing -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
> I'm not fully sure what putback_balloon_page does, but if it just tests > a bit (without non variable test_bit) it should be ok. ^ should be "without variable ..." -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
> > What shocked me actually is that VM_BUG_ON code is executed on > !CONFIG_DEBUG_VM builds and has been since 2.6.36 due to commit [4e60c86bd: > gcc-4.6: mm: fix unused but set warnings]. I thought the whole point of > VM_BUG_ON was to avoid expensive and usually unnecessary checks. Andi, > was this deliberate? The idea was that the compiler optimizes it away anyways. I'm not fully sure what putback_balloon_page does, but if it just tests a bit (without non variable test_bit) it should be ok. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote: > On Sat, 31 Mar 2012, Andi Kleen wrote: > > > > So if a guest exits due to an external event it's easy to inspect the > > > state of that guest and avoid to schedule away when it was interrupted > > > in a spinlock held section. That guest/host shared state needs to be > > > > On a large system under high contention sleeping can perform surprisingly > > well. pthread mutexes have a tendency to beat kernel spinlocks there. > > So avoiding sleeping locks completely (that is what pv locks are > > essentially) is not necessarily that good. > > Care to back that up with numbers and proper trace evidence instead of > handwaving? E.g. my plumbers presentations on lock and mm scalability from last year has some graphs that show this very clearly, plus some additional data on the mutexes. This compares to the glibc futex locks, which perform much better than the kernel mutex locks on larger systems under higher contention Given your tone I will not supply an URL. I'm sure you can find it if you need it. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
> So if a guest exits due to an external event it's easy to inspect the > state of that guest and avoid to schedule away when it was interrupted > in a spinlock held section. That guest/host shared state needs to be On a large system under high contention sleeping can perform surprisingly well. pthread mutexes have a tendency to beat kernel spinlocks there. So avoiding sleeping locks completely (that is what pv locks are essentially) is not necessarily that good. Your proposal is probably only a good idea on low contention and relatively small systems. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: netlink versus pid namespaces
On Fri, Oct 15, 2010 at 07:19:03PM +0400, Alexey Kuznetsov wrote: > Hello! > > > netlink uses pids (or really tids I hope?) to address sockets > > associated with processes. > > Not really. It uses port number which is called "pid" occasionally. Bad name. > Autobind function simply selects tgid of calling process as the first guess. Thanks for the clarification, Alexey. I guess I should have read more code :/ -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
netlink versus pid namespaces
Hi, I have been trying to figure out how pid namespaces interact with netlink. netlink uses pids (or really tids I hope?) to address sockets associated with processes. The netlink code passes around pids without caring much about the pid namespace. It does pass around some information about the network namespace, but that doesn't help here because the pid namespace is not necessarily related to the net namespace. When the netlink consumer runs in kernel (like rtnetlink) and happens to run in the same process context while receiving and processing the data it should do the right thing because it has the same pid namespace. If it runs in some other process that is not guaranteed and it may actually send the reply back to the wrong pid. When a process receives netlink in user space and it isn't in the same pid space as the sender it is unlikely that the reply gets back. Anything I'm missing here? Does netlink need to be extended? Or perhaps forbid passing netlink between name spaces? Thanks, -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
On Wed, Nov 04, 2009 at 03:17:36PM +0200, Michael S. Tsirkin wrote: > On Wed, Nov 04, 2009 at 02:15:33PM +0100, Andi Kleen wrote: > > On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote: > > > On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote: > > > > > Fine? > > > > > > > > I cannot say -- are there paths that could drop the device beforehand? > > > > > > Do you mean drop the mm reference? > > > > No the reference to the device, which owns the mm for you. > > The device is created when file is open and destroyed > when file is closed. So I think the fs code handles the > reference counting for me: it won't call file cleanup > callback while some userspace process has the file open. > Right? Yes. But the semantics when someone inherits such a fd through exec or through file descriptor passing would be surely "interesting" You would still do IO on the old VM. I guess it would be a good way to confuse memory accounting schemes or administrators @) It would be all saner if this was all a single atomic step. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote: > On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote: > > > Fine? > > > > I cannot say -- are there paths that could drop the device beforehand? > > Do you mean drop the mm reference? No the reference to the device, which owns the mm for you. > > > (as in do you hold a reference to it?) > > By design I think I always have a reference to mm before I use it. > > This works like this: > ioctl SET_OWNER - calls get_task_mm, I think this gets a reference to mm > ioctl SET_BACKEND - checks that SET_OWNER was run, starts virtqueue > ioctl RESET_OWNER - stops virtqueues, drops the reference to mm > file close - stops virtqueues, if we still have it then drops mm > > This is why I think I can call use_mm/unuse_mm while virtqueue is running, > safely. > Makes sense? Do you protect against another thread doing RESET_OWNER in parallel while RESET_OWNER runs? -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
> Fine? I cannot say -- are there paths that could drop the device beforehand? (as in do you hold a reference to it?) -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
"Michael S. Tsirkin" writes: Haven't really read the whole thing, just noticed something at a glance. > +/* Expects to be always run from workqueue - which acts as > + * read-size critical section for our kind of RCU. */ > +static void handle_tx(struct vhost_net *net) > +{ > + struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX]; > + unsigned head, out, in, s; > + struct msghdr msg = { > + .msg_name = NULL, > + .msg_namelen = 0, > + .msg_control = NULL, > + .msg_controllen = 0, > + .msg_iov = vq->iov, > + .msg_flags = MSG_DONTWAIT, > + }; > + size_t len, total_len = 0; > + int err, wmem; > + size_t hdr_size; > + struct socket *sock = rcu_dereference(vq->private_data); > + if (!sock) > + return; > + > + wmem = atomic_read(&sock->sk->sk_wmem_alloc); > + if (wmem >= sock->sk->sk_sndbuf) > + return; > + > + use_mm(net->dev.mm); I haven't gone over all this code in detail, but that isolated reference count use looks suspicious. What prevents the mm from going away before you increment, if it's not the current one? -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 00/54] [Announce] Microsoft Hyper-V drivers for Linux
Greg KH writes: > > It's the "normal" way of cleaning up code. Start with a known-working > piece of stuff, and incrementally clean it up. I have to agree with David. This is essentially unreviewable in the original state. It would be good if you could do all the obvious cleanups first and then post a consolidated series. Most "obvious" cleanups should not change the binary, so you can easily check they are correct by comparing the assembler code. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
Anthony Liguori <[EMAIL PROTECTED]> writes: > > What we would rather do in KVM, is have the VFs appear in the host as > standard network devices. We would then like to back our existing PV > driver to this VF directly bypassing the host networking stack. A key > feature here is being able to fill the VF's receive queue with guest > memory instead of host kernel memory so that you can get zero-copy > receive traffic. This will perform just as well as doing passthrough > (at least) and avoid all that ugliness of dealing with SR-IOV in the > guest. But you shift a lot of ugliness into the host network stack again. Not sure that is a good trade off. Also it would always require context switches and I believe one of the reasons for the PV/VF model is very low latency IO and having heavyweight switches to the host and back would be against that. -Andi -- [EMAIL PROTECTED] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] KVM PV Guest: Implement paravirtualized DMA
Amit Shah <[EMAIL PROTECTED]> writes: > + > +static struct page *page; > +static unsigned long page_gfn; Bad variable names > + > +const struct dma_mapping_ops *orig_dma_ops; I suspect real dma ops stacking will need some further thought than your simple hacks > + > + match = find_matching_pt_dev(&pt_devs_head, &pv_pci_info); > + if (match) { > + r = match->is_pv; > + goto out; > + } > + > + memcpy(page_address(page), &pv_pci_info, sizeof(pv_pci_info)); Note that on 32bit page_address() might be not mapped. > + > + npages = get_order(size) + 1; Are you sure that's correct? It looks quite bogus. order is a 2 logarithm, normally npages = 1 << order if you want npages from order the correct need 1 << order Haven't read further, but to be honest the code doesn't seem to be anywhere near merging quality. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: PV DMA for PCI passthrough devices for KVM
Amit Shah <[EMAIL PROTECTED]> writes: > This patchset implements PVDMA for handling DMA requests from > devices assigned to the guest from the host machine. You forgot to post a high level design overview of how this works, what it is good for, what are the design trade offs etc.? Include that in the first patch. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86 DMA: Handle devices assigned to the guest by the host
Amit Shah <[EMAIL PROTECTED]> writes: > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c > index 388b113..678cafb 100644 > --- a/arch/x86/kernel/pci-dma.c > +++ b/arch/x86/kernel/pci-dma.c > @@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size, > dma_addr_t *dma_handle, > memset(memory, 0, size); > if (!mmu) { > *dma_handle = bus; > + if (unlikely(dma_ops->is_pv_device) && > + unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) { First double unlikely in a condition is useless. Just drop them. And then ->is_xyz() in a generic vops interface is about as ugly and non generic as you can get. dma_alloc_coherent is not performance critical, so you should rather change the interface that ->alloc_coherent is always called and the other handlers handle the !mmu case correctly. In fact they need that already I guess (e.g. on DMAR there is not really a nommu case) -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] dm-band: The I/O bandwidth controller: Document
Ryo Tsuruta <[EMAIL PROTECTED]> writes: > Here is the document of dm-band. Could you please address in the document how the intended use cases/feature set etc. differs from CFQ2 io priorities? Thanks, -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/10] add missing parameter for lookup_address
On Fri, Jan 18, 2008 at 12:26:13PM -0800, Chris Wright wrote: > * Glauber de Oliveira Costa ([EMAIL PROTECTED]) wrote: > > lookup_address() receives two parameters, but efi_64.c call > > is passing only one. It's actually preventing the tree from compiling > > > > Signed-off-by: Glauber de Oliveira Costa <[EMAIL PROTECTED]> > > Good catch, I know I don't test with CONFIG_EFI=y Ah that came probably from the CPA patchset which added the parameter. Sorry for that. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/8] unify paravirt parts of system.h
> It probably is safe to remove... but we currently support '2.8.95 > kernel loads/resumes 2.6.24 image'... which would break if 2.8 uses > cr8. No it won't. 2.8 would just restore some random useless value. If 2.8 wants to use CR8 it would have to re-initialize it -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 19/19] unify set_tss_desc
> +static inline void pack_tss(tss_desc *tss, unsigned long addr, > +unsigned size, unsigned entry) > +{ > +#ifdef CONFIG_X86_64 > + set_tssldt_descriptor(tss, > + addr, entry, size); > +#else > + pack_descriptor(tss, (unsigned long)addr, > + size, > + 0x80 | entry, 0); It would be better if you just unified that first instead of ifdef'ing. I guess i386 could get a set_tssldt_descriptor() like wrapper around pack_descriptor(). -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/19] unify desc_struct
> +/* > + * FIXME: Acessing the desc_struct through its fields is more elegant, > + * and should be the one valid thing to do. However, a lot of open code > + * still touches the a and b acessors, and doing this allow us to do it > + * incrementally. We keep the signature as a struct, rather than an union, > + * so we can get rid of it transparently in the future -- glommer > + */ > +#define raw_desc_struct struct { unsigned int a, b; } > +#define detailed_desc_struct \ > + struct { \ > + u16 limit0; \ > + u16 base0; \ > + unsigned base1 : 8, type : 4, s : 1, dpl : 2, p : 1;\ > + unsigned limit : 4, avl : 1, l : 1, d : 1, g : 1, base2 :8;\ > + } The standard clean way to do this is with a anonymous union. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/8] unify paravirt parts of system.h
On Tue, Dec 04, 2007 at 09:18:33PM +0200, Avi Kivity wrote: > Glauber de Oliveira Costa wrote: >> This patch moves the i386 control registers manipulation functions, >> wbinvd, and clts functions to system.h. They are essentially the same >> as in x86_64, except for the cr8 register, which we add. >> >> + >> +static inline unsigned long native_read_cr8(void) >> +{ >> +unsigned long cr8; >> +asm volatile("mov %%cr8,%0" : "=r" (cr8), "=m" (__force_order)); >> +return cr8; >> +} >> + >> > > There is no cr8 register on i386. This had better be protected by an > #ifdef. > > (you're likely not getting an error since it's a static inline, so the asm > is never emitted) Linux never uses that register. The only user is suspend save/restore, but that' bogus because it wasn't ever initialized by Linux in the first place. It could be probably all safely removed. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/10] allow sched clock to be overridden by paravirt
On Tue, Dec 04, 2007 at 09:09:57AM -0200, Glauber de Oliveira Costa wrote: > This patch turns the sched_clock into native_sched_clock. > sched clock becomes a weak symbol, which can then give its > place to a paravirt definition. I ended up regretting that solution on i386, it would be better to solve that more clearly with ifdefs. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24
> > Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid > > upsetting Xen This patch was the wrong fix anyways as discussed; David Chinner finally came up with a better one. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 24/24] make vsmp a paravirt client
> If vsmp is selected, PARAVIRT will be too, and the interrupt code will > be patched. > the vsmp option triggers a select statement. > > the ifdef only exists because, as I said, the code itself will be always > compiled in, to avoid an ifdef in setup_64.c. So it's just a taking it > from here, putting it there issue. Peeking around needlessly in pci config space at each boot just to avoid an ifdef is not an good idea. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 24/24] make vsmp a paravirt client
> The vsmp_64.c file is now compiled unconditionally, according to which > me and kiran agreed to. The detection code is always run, but will only > trigger when a suitable box is found. Accordingly, the paravirt structs > are only touched when PARAVIRT is on. Otherwise, we don't even have the > symbols. That seems dumb. What good is it if it doesn't patch the interrupt code? -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 24/24] make vsmp a paravirt client
On Fri, Nov 09, 2007 at 04:43:05PM -0200, Glauber de Oliveira Costa wrote: > This patch makes vsmp a paravirt client. It now uses the whole > infrastructure provided by pvops. When we detect we're running > a vsmp box, we change the irq-related paravirt operations (and so, > it have to happen quite early), and the patching function The PARAVIRT ifdefs look wrong. Surely you don't need them at all because it cannot work at all without paravirt. Also you got some white space damage. And the "EM64T based comment" is wrong because there are AMD based vSMPs too. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24
> Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid > upsetting Xen This should be probably done unconditionally because it's a undefined dangerous condition everywhere. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.
On Tuesday 18 September 2007 23:34, Rusty Russell wrote: > On Tue, 2007-09-18 at 12:57 +0200, Andi Kleen wrote: > > On Friday 14 September 2007 07:21, Rusty Russell wrote: > > > It's pretty widely used, > > > > Is it? By whom? > > Hi Andi, > > Please stop asking for facts! It's was easy claim to make, and hard to > disprove 8) > > > > and the distributions will turn it on. > > > > That's no reason to make it default y. Please undo that. default y > > is near always a bad idea. > > How about a "select" based on Xen, lguest or VMI? There's no other > reason to enable it, after all. I did an patch to do that recently because the current setup is indeed unobvious. But I had to drop it again because it ended up with Kconfig warnings. about undefined symbols on x86-64. The problem is that lguest is visible in Kconfig for all architectures and it warns if you select something that doesn't exist on all architectures. The only workaround would have been to define PARAVIRT for all architectures, which I considered too ugly. I think Sam stated recently he wanted to remove that warning but it needed some infrastructure work. > > Also I would still consider it experimental. > > After 9 months in mainline and three kernel versions, Well it changed a lot each release. > I'd hope not. > It's been pretty damn stable (ok, you broke it once, but maybe that's > because you consider it experimental). Is there a significant user base? At least the Xen port seems to have specific requirements and essentially only work on xen-unstable (?) [or at least some very new Xen version] which probably very few people use. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.
> Actually if I understand the functionality of paravirt correctly that is > not correct. I believe that will turn on the paravirt bits which allow > it to run under things such as VMI or Xen. You don't understand it correctly then. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.
> Why is making something default y a bad idea? > Those most likely to care can turn it off. Is there a harmful effect > from leaving it on if its not being used? Running yes "" | make oldconfig to upgrade kernel configs is standard practice and you definitely don't want to have all kinds of random new unnecessary features be turned on then. Besides paravirt by itself is pretty useless; you need typically quite complex other options set to do any meaningfull virtualization. The only reason to use default y is in options that are not user visible and have a reasonable default or things that cause direct boot failures when upgrading old configurations. That all doesn't apply here. Besides it's bad taste and taste is very important. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.
On Friday 14 September 2007 07:21, Rusty Russell wrote: > It's pretty widely used, Is it? By whom? > and the distributions will turn it on. That's no reason to make it default y. Please undo that. default y is near always a bad idea. Also I would still consider it experimental. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] Fix preemptible lazy mode bug
>I agree. The patch is a nop. I just got overly paranoid. The whole >thing is just very prone to bugs. So do we need a patch for .23 or not? >; it does > seem perfectly acceptable though for the mm code to use kmap or vmap > (not kmap_atomic) internally somewhere in the pagetable code. i386 does it all the time for highmem pagetables in fact. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization