Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-14 Thread Andi Kleen




I thought you basically create an OperationRegion of SystemMemory type,
and off you go. Maybe the OSPM in Linux is clever and protects
some memory, I wouldn't know.



I investigated this now, and it looks like acpi is using 
ioremap_cache(). We can hook into that and force non sharing. It's 
probably safe to assume that this is not used on real IO devices.


I think there are still some other BIOS mappings that use just plain 
ioremap() though.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen




But that was due to performance problems in hot paths. Nothing of this
applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.


Right now we're only talking about 2 places to change, and none of those 
are actually in individual drivers, but in the virtio generic code and 
in the MSI code.


While there might be drivers in the future that do it directly it will 
be always the exception, normal drivers don't have to deal with this.





For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

Right now there is no leaf driver changed at all.



I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened.

A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations
(mainly MMIO access, but also PCI config space) as a non privileged user
interface (like a ioctl). That review should be focused on memory safety.

- Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?



Right now it's zero I believe.

That is there is other low level code that sets memory shared, but it's 
not using ioremap, but some other mechanisms.




are needed

for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.

But how do you debug the kernel then? Making early undebuggable seems
just bad policy to me.

I am not proposing making the early undebuggable.



That's the implication of moving the policy into initrd.


If only initrd can authorize then it won't be possible to authorize 
before initrd, thus the early console won't work.


-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote:

On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:

Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?


Yes for pci_iomap we could do that.

If someone uses raw ioremap without a device it won't work, but I don't 
think that's the case for virtio at least.


I suppose we could solve that problem if it actually happens.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-12 Thread Andi Kleen




Interesting. VT-d tradeoffs ... what are they?


The connection to the device is not encrypted and also not authenticated.

This is different that even talking to the (untrusted) host through 
shared memory where you at least still have a common key.



Allowing hypervisor to write into BIOS looks like it will
trivially lead to code execution, won't it?


This is not about BIOS code executing. While the guest firmware runs it 
is protected of course. This is for BIOS structures like ACPI tables 
that are mapped by Linux. While AML can run byte code it can normally 
not write to arbitrary memory.


The risk is more that all the Linux code dealing with this hasn't been 
hardened to deal with malicious input.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 12:13 PM, Dan Williams wrote:

On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
 wrote:



I suspect the true number is even higher because that doesn't include IO
inside calls to other modules and indirect pointers, correct?

Actually everything should be included. Smatch has cross-function db and
I am using it for getting the call chains and it follows function pointers.
Also since I am starting from a list of individual read IOs, every single
base read IO in drivers/* should be covered as far as I can see. But if it uses
some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?


This is everything that would need to be moved somewhere else if we 
didn't disable the entire classes of ioremap users.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 11:36 AM, Reshetova, Elena wrote:

The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
others) in init functions that also register drivers (thanks Elena for
the number)

To provide more numbers on this. What I can see so far from a smatch-based
analysis, we have 409 __init style functions (.probe & builtin/module_
_platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
range of 30k because quite often there are more than a single IO read in the
same source function. The full list of accesses and the possible call paths is 
huge
for manually looking at, but here is the list of the 409 functions if anyone 
wants
to take a look:



Thanks Elena.


I suspect the true number is even higher because that doesn't include IO 
inside calls to other modules and indirect pointers, correct?



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/11/2021 10:31 PM, Christoph Hellwig wrote:

On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:

The reason we have trouble is that it's not clear what does the API mean
outside the realm of TDX.
If we really, truly want an API that says "ioremap and it's a hardened
driver" then I guess ioremap_hardened_driver is what you want.

Yes.  And why would be we ioremap the BIOS anyway?  It is not I/O memory
in any of the senses we generally use ioremap for.


I/O memory is anything outside the kernel memory map.

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen




The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe".


But that was due to performance problems in hot paths. Nothing of this 
applies here.



In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.


What is confusing exactly?

For me it both seems very straight forward and simple (but then I'm biased)

The policy is:

- Have an allow list at driver registration.

- Have an additional opt-in for MMIO mappings (and potentially config 
space, but that's not currently there) to cover init calls completely.




I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened.


A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations 
(mainly MMIO access, but also PCI config space) as a non privileged user 
interface (like a ioctl). That review should be focused on memory safety.


- Had some fuzzing on these IO interfaces using to be released tools.

Right now it's only three virtio drivers (console, net, block)

Really it's no different than what we do for every new unprivileged user 
interface.




I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config?


I don't think arguments from "the average kernel config" (if such a 
thing even exists) are useful. That's would be just hand waving.




A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.



That would be already hundreds of changes. I have no idea how such a 
thing could be maintained or sustained either.


Really I don't even see how these alternatives can be considered. Tree 
sweeps should always be last resort. They're a pain for everyone. But 
here they're casually thrown around as alternatives to straight forward 
one or two line changes.








Default policy in user space just seems to be a bad idea here. Who
should know if a driver is hardened other than the kernel? Maintaining
the list somewhere else just doesn't make sense to me.

I do not understand the maintenance burden correlation of where the
policy is driven vs where the list is maintained?


All the hardening and auditing happens in the kernel tree. So it seems 
the natural place to store the result is in the kernel tree.


But there's no single package for initrd, so you would need custom 
configurations for all the supported distros.


Also we're really arguing about a list that currently has three entries.



  Even if I agreed
with the contention that out-of-tree userspace would have a hard time
tracking the "hardened" driver list there is still an in-tree
userspace path to explore. E.g. perf maintains lists of things tightly
coupled to the kernel, this authorized device list seems to be in the
same category of data.


You mean the event list? perf is in the kernel tree, so it's maintained 
together with the kernel.


But we don't have a kernel initrd.






Also there is the more practical problem that some devices are needed
for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.


But how do you debug the kernel then? Making early undebuggable seems 
just bad policy to me.


And if you fix if for the console why not add the two more entries for 
virtio net and block too?



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-12 Thread Andi Kleen




I mean ... it's already wide spread.



I meant wide spread usage with confidential guests.


If we support it with TDX
it will be used with TDX.


It has some security trade offs. The main reason to use TDX is security. 
Also when people take the VT-d tradeoffs they might be ok with the BIOS 
trade offs too.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-11 Thread Andi Kleen




Presumably bios code is in arch/x86 and drivers/acpi, right?
Up to 200 calls the majority of which is likely private ...


Yes.


I don't have better ideas but the current setup will just
result in people making their guests vulnerable whenever they
want to allow device pass-through.



Yes that's true. For current TDX our target is virtual devices only. But 
if pass through usage will be really wide spread we may need to revisit.



-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Andi Kleen




Because it does not end with I/O operations, that's a trivial example.
module unloading is famous for being racy: I just re-read that part of
virtio drivers and sure enough we have bugs there, this is after
they have presumably been audited, so a TDX guest is better off
just disabling hot-unplug completely, and hotplug isn't far behind.


These all shouldn't matter for a confidential guest. The only way it can 
be attacked is through IO, everything else is protected by hardware.



Also it would all require doing something at the guest level, which we 
assume is not malicious.




Malicious filesystems can exploit many linux systems unless
you take pains to limit what is mounted and how.


That's expected to be handled by authenticated dmcrypt and similar. 
Hardening at this level has been done for many years.




Networking devices tend to get into the default namespaces and can
do more or less whatever CAP_NET_ADMIN can.
Etc.



Networking should be already hardened, otherwise you would have much 
worse problems today.





hange in your subsystem here.
Well I commented on the API patch, not the virtio patch.
If it's a way for a driver to say "I am hardened
and audited" then I guess it should at least say so.



This is handled by the central allow list. We intentionally didn't want 
each driver to declare itself, but have a central list where changes 
will get more scrutiny than random driver code.


But then there are the additional opt-ins for the low level firewall. 
These are in the API. I don't see how it could be done at the driver 
level, unless you want to pass in a struct device everywhere?



How about creating a defconfig that makes sense for TDX then?

TDX can be used in many different ways, I don't think a defconfig is
practical.

In theory you could do some Kconfig dependency (at the pain point of having
separate kernel binariees), but why not just do it at run time then if you
maintain the list anyways. That's much easier and saner for everyone. In the
past we usually always ended up with runtime mechanism for similar things
anyways.

Also it turns out that the filter mechanisms are needed for some arch
drivers which are not even configurable, so alone it's probably not enough,


I guess they aren't really needed though right, or you won't try to
filter them?


We're addressing most of them with the device filter for platform 
drivers. But since we cannot stop them doing ioremap IO in their init 
code they also need the low level firewall.


Some others that cannot be addressed have explicit disables.



So make them configurable?


Why not just fix the runtime? It's much saner for everyone. Proposing to 
do things at build time sounds like we're in Linux 0.99 days.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Andi Kleen



On 10/11/2021 12:58 AM, Christoph Hellwig wrote:

Just as last time:  This does not make any sense.  ioremap is shared
by definition.


It's not necessarily shared with the host for confidential computing: 
for example BIOS mappings definitely should not be shared, but they're 
using ioremap today.


But if you have a better term please propose something. I tried to 
clarify it with "shared_host", but I don't know a better term.



-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-10 Thread Andi Kleen




The connection is quite unfortunate IMHO.
Can't there be an option
that unbreaks drivers *without* opening up security holes by
making BIOS shared?


That would require new low level APIs that distinguish both cases, and a 
tree sweep.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-10 Thread Andi Kleen




To which Andi replied
One problem with removing the ioremap opt-in is that
it's still possible for drivers to get at devices without going through 
probe.

To which Greg replied:
https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
If there are in-kernel PCI drivers that do not do this, they need to be
fixed today.

Can you guys resolve the differences here?



I addressed this in my other mail, but we may need more discussion.




And once they are resolved, mention this in the commit log so
I don't get to re-read the series just to find out nothing
changed in this respect?

I frankly do not believe we are anywhere near being able to harden
an arbitrary kernel config against attack.


Why not? Device filter and the opt-ins together are a fairly strong 
mechanism.


And it's not that they're a lot of code or super complicated either.

You're essentially objecting to a single line change in your subsystem here.



How about creating a defconfig that makes sense for TDX then?


TDX can be used in many different ways, I don't think a defconfig is 
practical.


In theory you could do some Kconfig dependency (at the pain point of 
having separate kernel binariees), but why not just do it at run time 
then if you maintain the list anyways. That's much easier and saner for 
everyone. In the past we usually always ended up with runtime mechanism 
for similar things anyways.


Also it turns out that the filter mechanisms are needed for some arch 
drivers which are not even configurable, so alone it's probably not enough,




Anyone deviating from that better know what they are doing,
this API tweaking is just putting policy into the kernel  ...


Hardening drivers is kernel policy. It cannot be done anywhere else.


-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-10 Thread Andi Kleen



On 10/9/2021 1:39 PM, Dan Williams wrote:

On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:

On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:

From: Andi Kleen 

For Confidential VM guests like TDX, the host is untrusted and hence
the devices emulated by the host or any data coming from the host
cannot be trusted. So the drivers that interact with the outside world
have to be hardened by sharing memory with host on need basis
with proper hardening fixes.

For the PCI driver case, to share the memory with the host add
pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.

Signed-off-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 


So I proposed to make all pci mappings shared, eliminating the need
to patch drivers.

To which Andi replied
 One problem with removing the ioremap opt-in is that
 it's still possible for drivers to get at devices without going 
through probe.

To which Greg replied:
https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
 If there are in-kernel PCI drivers that do not do this, they need to be
 fixed today.

Can you guys resolve the differences here?

I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report.


The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
others) in init functions that also register drivers (thanks Elena for 
the number)


Some are probably old drivers that could be fixed, but it's quite a few 
legitimate cases. For example for platform or ISA drivers that's the 
only way they can be implemented because they often have no other 
enumeration mechanism. For PCI drivers it's rarer, but also still can 
happen. One example that comes to mind here is the x86 Intel uncore 
drivers, which support a mix of MSR, ioremap and PCI config space 
accesses all from the same driver. This particular example can (and 
should be) fixed in other ways, but similar things also happen in other 
drivers, and they're not all broken. Even for the broken ones they're 
usually for some crufty old devices that has very few users, so it's 
likely untestable in practice.


My point is just that the ecosystem of devices that Linux supports is 
messy enough that there are legitimate exceptions from the "First IO 
only in probe call only" rule.


And we can't just fix them all. Even if we could it would be hard to 
maintain.


Using a "firewall model" hooking into a few strategic points like we're 
proposing here is much saner for everyone.


Now we can argue about the details. Right now what we're proposing has 
some redundancies: it has both a device model filter and low level 
filter for ioremap (this patch and some others). The low level filter is 
for catching issues that don't clearly fit into the 
"enumeration<->probe" model. You could call that redundant, but I would 
call it defense in depth or better safe than sorry. In theory it would 
be enough to have the low level opt-in only, but that would have the 
drawback that is something gets enumerated after all you would have all 
kind of weird device driver failures and in some cases even killed 
guests. So I think it makes sense to have




Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings.


Only add it when the driver has been audited and hardened.

But I agree we need on a documented process for this. I will work on 
some documentation for a proposal. But essentially I think it should be 
some variant of what Elena has outlined in her talk at Security Summit.


https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf

That is using extra auditing/scrutiny at review time, supported with 
some static code analysis that points to the interaction points, and 
code needs to be fuzzed explicitly.


However short term it's only three virtio drivers, so this is not a 
urgent problem.



Let the new
device-authorization mechanism (with policy in userspace)



Default policy in user space just seems to be a bad idea here. Who 
should know if a driver is hardened other than the kernel? Maintaining 
the list somewhere else just doesn't make sense to me.


Also there is the more practical problem that some devices are needed 
for booting. For example in TDX we can't print something to the console 
with this mechanism, so you would never get any output before the 
initrd. Just seems like a nightmare for debugging anything. There really 
needs to be an authorization mechanism that works reasonably early.


I can see a point of having user space overrides though, but we need to 
have a 

Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-02 Thread Andi Kleen



On 10/2/2021 4:14 AM, Greg Kroah-Hartman wrote:

On Sat, Oct 02, 2021 at 07:04:28AM -0400, Michael S. Tsirkin wrote:

On Fri, Oct 01, 2021 at 08:49:28AM -0700, Andi Kleen wrote:

   Do you have a list of specific drivers and kernel options that you
feel you now "trust"?

For TDX it's currently only virtio net/block/console

But we expect this list to grow slightly over time, but not at a high rate
(so hopefully <10)

Well there are already >10 virtio drivers and I think it's reasonable
that all of these will be used with encrypted guests. The list will
grow.

What is keeping "all" drivers from being on this list?


It would be too much work to harden them all, and it would be pointless 
because all these drivers are never legitimately needed in a virtualized 
environment which only virtualize a very small number of devices.



  How exactly are
you determining what should, and should not, be allowed?


Everything that has had reasonable effort at hardening can be added. But 
if someone proposes to add a driver that should trigger additional 
scrutiny in code review. We should also request them to do some fuzzing.


It's a bit similar to someone trying to add a new syscall interface. 
That also triggers much additional scrutiny for good reasons and people 
start fuzzing it.




   How can
drivers move on, or off, of it over time?


Adding something is submitting a patch to the allow list.

I'm not sure the "off" case would happen, unless the driver is 
completely removed, or maybe it has some unfixable security problem. But 
that is all rather unlikely.





And why not just put all of that into userspace and have it pick and
choose?  That should be the end-goal here, you don't want to encode
policy like this in the kernel, right?


How would user space know what drivers have been hardened? This is 
really something that the kernel needs to determine. I don't think we 
can outsource it to anyone else.


Also BTW of course user space can still override it, but really the 
defaults should be a kernel policy.


There's also the additional problem that one of the goals of 
confidential guest is to just move existing guest virtual images into 
them without much changes. So it's better for such a case if as much as 
possible of the policy is in the kernel. But that's more a secondary 
consideration, the first point is really the important part.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-10-01 Thread Andi Kleen




Forget about trust for the moment.  Let's say the goal is to prevent
the kernel from creating any bindings other that those in some small
"allowed" set.  To fully specify one of the allowed bindings, you
would have to provide both a device ID and a driver name.  But in
practice this isn't necessary, since a device with a given ID will
bind to only one driver in almost all cases, and hence giving just
the device ID is enough.

So to do what they want, all that's needed is to forbid any bindings
except where the device ID is "allowed".  Or to put it another way,
where the device's authorized flag (which can be initialized based on
the device ID) is set.

(The opposite approach, in which the drivers are "allowed" rather
than the device IDs, apparently has already been discussed and
rejected.  I'm not convinced that was a good decision, but...)

Does this seem like a fair description of the situation?


Yes. That's roughly what the patchkit under discussion implements.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-01 Thread Andi Kleen



On 10/1/2021 12:03 AM, Greg Kroah-Hartman wrote:

On Thu, Sep 30, 2021 at 12:04:05PM -0700, Kuppuswamy, Sathyanarayanan wrote:


On 9/30/21 8:23 AM, Greg Kroah-Hartman wrote:

On Thu, Sep 30, 2021 at 08:18:18AM -0700, Kuppuswamy, Sathyanarayanan wrote:


On 9/30/21 6:36 AM, Dan Williams wrote:

And in particular, not all virtio drivers are hardened -
I think at this point blk and scsi drivers have been hardened - so
treating them all the same looks wrong.

My understanding was that they have been audited, Sathya?

Yes, AFAIK, it has been audited. Andi also submitted some patches
related to it. Andi, can you confirm.

What is the official definition of "audited"?


In our case (Confidential Computing platform), the host is an un-trusted
entity. So any interaction with host from the drivers will have to be
protected against the possible attack from the host. For example, if we
are accessing a memory based on index value received from host, we have
to make sure it does not lead to out of bound access or when sharing the
memory with the host, we need to make sure only the required region is
shared with the host and the memory is un-shared after use properly.

You have not defined the term "audited" here at all in any way that can
be reviewed or verified by anyone from what I can tell.

You have only described a new model that you wish the kernel to run in,
one in which it does not trust the hardware at all.  That is explicitly
NOT what the kernel has been designed for so far,


It has been already done for a few USB/TB drivers, but yes not for the 
majority of the kernel.



  and if you wish to
change that, lots of things need to be done outside of simply running
some fuzzers on a few random drivers.


The goal is to do similar work as USB/TB did, but do it for a small set 
of virtio drivers and use a custom allow list for those for the specific 
secure guest cases.


(there are some other goals, but let's not discuss them here for now)




For one example, how do you ensure that the memory you are reading from
hasn't been modified by the host between writing to it the last time you
did?


It's similar techniques as we do on user space accesses. For example if 
you bound check some value the code needs to ensure it is cached in 
private memory, not reread from MMIO or shared memory. Of course that's 
a good idea anyways for performance because MMIO is slow.


In the concrete cases of virtio the main problem was the free list in 
shared memory, but that has been addressed now.





  Do you have a list of specific drivers and kernel options that you
feel you now "trust"?


For TDX it's currently only virtio net/block/console

But we expect this list to grow slightly over time, but not at a high 
rate (so hopefully <10)




If so, how long does that trust last for?  Until
someonen else modifies that code?  What about modifications to functions
that your "audited" code touches?  Who is doing this auditing?  How do
you know the auditing has been done correctly?  Who has reviewed and
audited the tools that are doing the auditing?  Where is the
specification that has been agreed on how the auditing must be done?
And so on...


Well, I mean we already have a similar situation with user space APIs. 
So it's not a new problem. For those we've done it for many years, with 
audits and extra fuzzing.


There are people working on the audit and fuzzing today. How exactly it 
will be ensured long term is still be worked out, but I expect we can 
work out something.




I feel like there are a lot of different things all being mixed up here
into one "oh we want this to happen!" type of thread.




Agreed. The thread ended up about a lot of stuff which is outside the 
scope of the patches.



   Please let's just
stick to the one request that I had here, which was to move the way that
busses are allowed to authorize the devices they wish to control into a
generic way instead of being bus-specific logic.

Any requests outside of that type of functionality are just that,
outside the scope of this patchset and should get their own patch series
and discussion.



Yes that's the intention. This patch kit is only about controlling what 
devices can enumerate.


Also please let's avoid the "trusted" term. It's really misleading and 
confusing in the context of confidential computing.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-09-30 Thread Andi Kleen




If all you want to do is prevent someone from loading a bunch of
drivers that you have identified as unhardened, why not just use a
modprobe blacklist?


That wouldn't help for builtin drivers, we cannot control initcalls.

This LWN article has more details on the background.

https://lwn.net/Articles/865918/

-Andi



Am I missing something?

Alan Stern

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-09-30 Thread Andi Kleen



On 9/30/2021 12:04 PM, Kuppuswamy, Sathyanarayanan wrote:



On 9/30/21 8:23 AM, Greg Kroah-Hartman wrote:
On Thu, Sep 30, 2021 at 08:18:18AM -0700, Kuppuswamy, Sathyanarayanan 
wrote:



On 9/30/21 6:36 AM, Dan Williams wrote:

And in particular, not all virtio drivers are hardened -
I think at this point blk and scsi drivers have been hardened - so
treating them all the same looks wrong.

My understanding was that they have been audited, Sathya?


Yes, AFAIK, it has been audited. Andi also submitted some patches
related to it. Andi, can you confirm.


What is the official definition of "audited"?



In our case (Confidential Computing platform), the host is an un-trusted
entity. So any interaction with host from the drivers will have to be
protected against the possible attack from the host. For example, if we
are accessing a memory based on index value received from host, we have
to make sure it does not lead to out of bound access or when sharing the
memory with the host, we need to make sure only the required region is
shared with the host and the memory is un-shared after use properly.

Elena can share more details, but it was achieved with static analysis
and fuzzing. Here is a presentation she is sharing about the work at the
Linux Security Summit:
https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf 



Andi, can talk more about the specific driver changes that came out of 
this

effort.


The original virtio was quite easy to exploit because it put its free 
list into the shared ring buffer.


We had a patchkit to harden virtio originally, but after some discussion 
we instead switched to Jason Wang's patchkit to move the virtio metadata 
into protected memory, which fixed near all of the issues. These patches 
have been already merged. There is one additional patch to limit the 
virtio modes.


There's an ongoing effort to audit (mostly finished I believe) and fuzz 
the three virtio drivers (fuzzing is still ongoing).


There was also a range of changes outside virtio for code outside the 
device model. Most of it was just disabling it though.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-09-30 Thread Andi Kleen



On 9/30/2021 8:18 AM, Kuppuswamy, Sathyanarayanan wrote:



On 9/30/21 6:36 AM, Dan Williams wrote:

And in particular, not all virtio drivers are hardened -
I think at this point blk and scsi drivers have been hardened - so
treating them all the same looks wrong.

My understanding was that they have been audited, Sathya?


Yes, AFAIK, it has been audited. Andi also submitted some patches
related to it. Andi, can you confirm.

We also authorize the virtio at PCI ID level. And currently we allow
console, block and net virtio PCI devices.

{ PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, VIRTIO_TRANS_ID_NET) },
{ PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, VIRTIO_TRANS_ID_BLOCK) },
{ PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, VIRTIO_TRANS_ID_CONSOLE) },

The only drivers that are being audited and fuzzed are these three 
virtio drivers (in addition to some other x86 code outside the driver model)


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-09-30 Thread Andi Kleen




I don't think the current mitigations under discussion here are about
keeping the system working. In fact most encrypted VM configs tend to
stop booting as a preferred way to handle security issues.


Maybe we should avoid the "trusted" term here. We're only really using 
it because USB is using it and we're now using a common framework like 
Greg requested. But I don't think it's the right way to think about it.


We usually call the drivers "hardened". The requirement for a hardened 
driver is that all interactions through MMIO/port/config space IO/MSRs 
are sanitized and do not cause memory safety issues or other information 
leaks. Other than that there is no requirement on the functionality. In 
particular DOS is ok since a malicious hypervisor can decide to not run 
the guest at any time anyways.


Someone loading an malicious driver inside the guest would be out of 
scope. If an attacker can do that inside the guest you already violated 
the security mechanisms and there are likely easier ways to take over 
the guest or leak data.


The goal of the device filter mechanism is to prevent loading unhardened 
drivers that could be exploited without them being themselves malicious.



-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-09-30 Thread Andi Kleen



On 9/30/2021 10:23 AM, Greg Kroah-Hartman wrote:

On Thu, Sep 30, 2021 at 10:17:09AM -0700, Andi Kleen wrote:

The "it" that I referred to is the claim that no driver should be
touching hardware in their module init call. Andi seems to think
such drivers are worth working around with a special remap API.

Andi is wrong.

While overall it's a small percentage of the total, there are still quite a
few drivers that do touch hardware in init functions. Sometimes for good
reasons -- they need to do some extra probing to discover something that is
not enumerated -- sometimes just because it's very old legacy code that
predates the modern driver model.

Are any of them in the kernel today?

PCI drivers should not be messing with this, we have had well over a
decade to fix that up.



It's not just PCI driver, it's every driver that can do io port / MMIO / 
MSR / config space accesses.



Maybe read the excellent article from Jon on this:

https://lwn.net/Articles/865918/





The legacy drivers could be fixed, but nobody really wants to touch them
anymore and they're impossible to test.

Pointers to them?


For example if you look over old SCSI drivers in drivers/scsi/*.c there 
is a substantial number that has a module init longer than just 
registering a driver. As a single example look at drivers/scsi/BusLogic.c


There were also quite a few platform drivers like this.





The drivers that probe something that is not enumerated in a standard way
have no choice, it cannot be implemented in a different way.

PCI devices are not enumerated in a standard way???


The pci devices are enumerated in a standard way, but typically the 
driver also needs something else outside PCI that needs some other 
probing mechanism.






So instead we're using a "firewall" the prevents these drivers from doing
bad things by not allowing ioremap access unless opted in, and also do some
filtering on the IO ports The device filter is still the primary mechanism,
the ioremap filtering is just belts and suspenders for those odd cases.

That's horrible, don't try to protect the kernel from itself.  Just fix
the drivers.


I thought we had already established this last time we discussed it.

That's completely impractical. We cannot harden thousands of drivers, 
especially since it would be all wasted work since nobody will ever need 
them in virtual guests. Even if we could harden them how would such a 
work be maintained long term? Using a firewall and filtering mechanism 
is much saner for everyone.


-Andi




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-09-30 Thread Andi Kleen




The "it" that I referred to is the claim that no driver should be
touching hardware in their module init call. Andi seems to think
such drivers are worth working around with a special remap API.

Andi is wrong.


While overall it's a small percentage of the total, there are still 
quite a few drivers that do touch hardware in init functions. Sometimes 
for good reasons -- they need to do some extra probing to discover 
something that is not enumerated -- sometimes just because it's very old 
legacy code that predates the modern driver model.


The legacy drivers could be fixed, but nobody really wants to touch them 
anymore and they're impossible to test.


The drivers that probe something that is not enumerated in a standard 
way have no choice, it cannot be implemented in a different way.


So instead we're using a "firewall" the prevents these drivers from 
doing bad things by not allowing ioremap access unless opted in, and 
also do some filtering on the IO ports The device filter is still the 
primary mechanism, the ioremap filtering is just belts and suspenders 
for those odd cases.


If you want we can send an exact list, we did some analysis using a 
patched smatch tool.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-09-24 Thread Andi Kleen




Hmm, yes that's true. I guess we can make it default to opt-in for
pci_iomap.

It only really matters for device less ioremaps.

OK. And same thing for other things with device, such as
devm_platform_ioremap_resource.
If we agree on all that, this will basically remove virtio
changes from the picture ;)


Hi we revisited this now. One problem with removing the ioremap opt-in 
is that it's still possible for drivers to get at devices without going 
through probe. For example they can walk the PCI device list. Some 
drivers do that for various reasons. So if we remove the opt-in we would 
need to audit and possibly fix all that, which would be potentially a 
lot of churn. That's why I think it's better to keep the opt-in.



-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-09-10 Thread Andi Kleen




And we've been avoiding that drivers can self declare auditing, we've been
trying to have a separate centralized list so that it's easier to enforce
and avoids any cut'n'paste mistakes.

-Andi

Now I'm confused. What is proposed here seems to be basically that,
drivers need to declare auditing by replacing ioremap with
ioremap_shared.

Auditing is declared on the device model level using a central allow list.

Can we not have an init call allow list instead of, or in addition to, a
device allow list?



That would be quite complicated and intrusive. In fact I'm not even sure 
how to do maintain something like this. There are a lot of needed 
initcalls, they would all need to be marked. How can we distinguish 
them? It would be a giant auditing project. And of course how would you 
prevent it from bitrotting?



Basically it would be hundreds of changes all over the tree, just to 
avoid two changes in virtio and MSI. Approach of just stopping the 
initcalls from doing bad things is much less intrusive.





But this cannot do anything to initcalls that run before probe,

Can't we extend module_init so init calls are validated against the
allow list?


See above.


Also the problem isn't really with modules (we rely on udev not loading 
them), but with builtin initcalls






that's why
an extra level of defense of ioremap opt-in is useful.

OK even assuming this, why is pci_iomap opt-in useful?
That never happens before probe - there's simply no pci_device then.



Hmm, yes that's true. I guess we can make it default to opt-in for 
pci_iomap.


It only really matters for device less ioremaps.



It looks suspiciously like drivers self-declaring auditing to me which
we both seem to agree is undesirable. What exactly is the difference?



Just allow listing the ioremaps is not self declaration because the 
device will still not initialize due to the central device filter. If 
you want to use it that has to be changed.


It's just an additional safety net to contain code running before probe.




Or are you just trying to disable anything that runs before probe?



Well anything that could do dangerous host interactions (like processing 
ioremap data) A lot of things are harmless and can be allowed, or 
already blocked elsewhere (e.g. we have a IO port filter). This just 
handles the ioremap/MMIO case.



In that case I don't see a reason to touch pci drivers though.
These should be fine with just the device model list.



That won't stop initcalls.


-Andi





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-30 Thread Andi Kleen



On 8/30/2021 1:59 PM, Michael S. Tsirkin wrote:



Or we can add _audited to the name. ioremap_shared_audited?

But it's not the mapping that has to be done in handled special way.
It's any data we get from device, not all of it coming from IO, e.g.
there's DMA and interrupts that all have to be validated.
Wouldn't you say that what is really wanted is just not running
unaudited drivers in the first place?



Yes.





And we've been avoiding that drivers can self declare auditing, we've been
trying to have a separate centralized list so that it's easier to enforce
and avoids any cut'n'paste mistakes.

-Andi

Now I'm confused. What is proposed here seems to be basically that,
drivers need to declare auditing by replacing ioremap with
ioremap_shared.


Auditing is declared on the device model level using a central allow list.

But this cannot do anything to initcalls that run before probe, that's 
why an extra level of defense of ioremap opt-in is useful. But it's not 
the primary mechanism to declare a driver audited, that's the allow 
list. The ioremap is just another mechanism to avoid having to touch a 
lot of legacy drivers.


If we agree on that then the original proposed semantics of 
"ioremap_shared" may be acceptable?


-Andi



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-29 Thread Andi Kleen


On 8/29/2021 3:26 PM, Michael S. Tsirkin wrote:

On Sun, Aug 29, 2021 at 09:17:53AM -0700, Andi Kleen wrote:

Also I changing this single call really that bad? It's not that we changing
anything drastic here, just give the low level subsystem a better hint about
the intention. If you don't like the function name, could make it an
argument instead?

My point however is that the API should say that the
driver has been audited,


We have that status in the struct device. If you want to tie the ioremap 
to that we could define a ioremap_device() with a device argument and 
decide based on that.


Or we can add _audited to the name. ioremap_shared_audited?


not that the mapping has been
done in some special way. For example the mapping can be
in some kind of wrapper, not directly in the driver.
However you want the driver validated, not the wrapper.

Here's an idea:



I don't think magic differences of API behavior based on some define are 
a good idea.  That's easy to miss.


That's a "COME FROM" in API design.

Also it wouldn't handle the case that a driver has both private and 
shared ioremaps, e.g. for BIOS structures.


And we've been avoiding that drivers can self declare auditing, we've 
been trying to have a separate centralized list so that it's easier to 
enforce and avoids any cut'n'paste mistakes.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-29 Thread Andi Kleen

All this makes sense but ioremap is such a random place to declare
driver has been audited, and it's baked into the binary with no way for
userspace to set policy.

Again all we will end up with is gradual replacement of all ioremap
calls with ioremap_shared as people discover a given driver does not
work in a VM.


No the device filter will prevent that. They would need to submit 
patches to the central list.


Or they can override it at the command line, but there is already an 
option to force all ioremaps to be shared. So if you just want some 
driver to run without caring about security you can already do that 
without modifying it.


Besides the shared concept only makes sense for virtual devices, so if 
you don't have a device model for a device this will never happen either.


So I don't think your scenario of all ioremaps becoming shared will ever 
happen.




How are you going to know driver has actually been
audited? what the quality of the audit was? did the people doing the
auditing understand what they are auditing for?  No way, right?


First the primary purpose of the ioremap policy is to avoid messing with 
all the legacy drivers (which is 99+% of the code base)


How to handle someone maliciously submitting a driver to the kernel is a 
completely different problem. To some degree there is trust of course. 
If someone says they audited and a maintainer trusts them with their 
statement, but they actually didn't there is a problem, but it's larger 
than just TDX. But in such a case the community code audit will also 
focus on such areas, and people interested in confidential computing 
security will also start taking a look.


And we're also working on fuzzing, so these drivers will be fuzzed at 
some point, so mistakes will be eventually found.


But in any case the ioremap policy is mainly to prevent having to handle 
this for all legacy drivers.


I would rather change the few drivers that are actually needed, than all 
the other drivers.


That's really the fundamental problem this is addressing: how to get 
reasonable security with all the legacy drivers out of the box without 
touching them.



-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-29 Thread Andi Kleen




Let's be frank, even without encryption disabling most drivers -
especially weird ones that poke at hardware before probe -
is far safer than keeping them, but one loses a bunch of features.


Usually we don't lose features at all. None of the legacy drivers are 
needed on a guest (or even a modern native system). It's all just all 
for old hardware. Maybe in 20+ years it can be all removed, but we can't 
wait that long.



IOW all this hardening is nice but which security/feature tradeoff
to take it a policy decision, not something kernel should do
imho.


There's no mechanism to push this kind of policy to user space. Users 
don't have control what initcalls run. At the time they execute there 
isn't even any user space yet.


Even if they could somehow control them it's very unlikely they would 
understand them and make an informed decision.


Doing it at build time is not feasible either since we want to run on 
standard distribution kernels.


For modules we have a policy mechanism (prevent udev probing by 
preventing enumeration), and that is implemented, but only handling 
modules is not enough. The compiled in drivers have to be handled too, 
otherwise you have gaping holes in the protection. We don't prevent 
users manually loading modules that might probe, but that is a policy 
decision that users actually sensibly make in user space.


Also I changing this single call really that bad? It's not that we 
changing anything drastic here, just give the low level subsystem a 
better hint about the intention. If you don't like the function name, 
could make it an argument instead?


-Andi






___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-24 Thread Andi Kleen



On 8/24/2021 1:31 PM, Bjorn Helgaas wrote:

On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote:

On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:

[+cc Rajat; I still don't know what "shared memory with a hypervisor
in a confidential guest" means,

A confidential guest is a guest which uses memory encryption to isolate
itself from the host. It doesn't trust the host. But it still needs to
communicate with the host for IO, so it has some special memory areas that
are explicitly marked shared. These are used to do IO with the host. All
their usage needs to be carefully hardened to avoid any security attacks on
the guest, that's why we want to limit this interaction only to a small set
of hardened drivers. For MMIO, the set is currently only virtio and MSI-X.

Good material for the commit log next time around.  Thanks!


This is all in the patch intro too, which should make it into the merge 
commits.


I don't think we can reexplain the basic concepts for every individual 
patch in a large patch kit.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-24 Thread Andi Kleen



On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:

[+cc Rajat; I still don't know what "shared memory with a hypervisor
in a confidential guest" means,


A confidential guest is a guest which uses memory encryption to isolate 
itself from the host. It doesn't trust the host. But it still needs to 
communicate with the host for IO, so it has some special memory areas 
that are explicitly marked shared. These are used to do IO with the 
host. All their usage needs to be carefully hardened to avoid any 
security attacks on the guest, that's why we want to limit this 
interaction only to a small set of hardened drivers. For MMIO, the set 
is currently only virtio and MSI-X.


-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-24 Thread Andi Kleen




I see. Hmm. It's a bit of a random thing to do it at the map time
though. E.g. DMA is all handled transparently behind the DMA API.
Hardening is much more than just replacing map with map_shared
and I suspect what you will end up with is basically
vendors replacing map with map shared to make things work
for their users and washing their hands.


That concept exists too. There is a separate allow list for the drivers. 
So just adding shared to a driver is not enough, until it's also added 
to the allowlist


Users can of course chose to disable the allowlist, but they need to 
understand the security implications.





I would say an explicit flag in the driver that says "hardened"
and refusing to init a non hardened one would be better.



We have that too (that's the device filtering)

But the problem is that device filtering just stops the probe functions, 
not the initcalls, and lot of legacy drivers do MMIO interactions before 
going into probe. In some cases it's unavoidable because of the device 
doesn't have a separate enumeration mechanism it needs some kind of 
probing to even check for its existence And since we don't want to 
change all of them it's far safer to make the ioremap opt-in.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-24 Thread Andi Kleen



On 8/24/2021 12:07 AM, Christoph Hellwig wrote:

On Mon, Aug 23, 2021 at 05:30:54PM -0700, Kuppuswamy, Sathyanarayanan wrote:


On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:

Add a new variant of pci_iomap for mapping all PCI resources
of a devices as shared memory with a hypervisor in a confidential
guest.

Signed-off-by: Andi Kleen
Signed-off-by: Kuppuswamy 
Sathyanarayanan

I'm a bit puzzled by this part. So why should the guest*not*  map
pci memory as shared? And if the answer is never (as it seems to be)
then why not just make regular pci_iomap DTRT?

It is in the context of confidential guest (where VMM is un-trusted). So
we don't want to make all PCI resource as shared. It should be allowed
only for hardened drivers/devices.

Well, assuming the host can do any damage when mapped shared that also
means not mapping it shared will completely break the drivers.


There are several cases:

- We have driver filtering active to protect you against attacks from 
the host against unhardened drivers.


In this case the drivers not working is the intended behavior.

- There is an command allow list override for some new driver, but the 
driver is hardened and shared


The other drivers will still not work, but that's also the intended behavior

- Driver filtering is disabled or the allow list override is used to 
enable some non hardened/enabled driver


There is a command line option to override the ioremap sharing default, 
it will allow all drivers to do ioremap. We would really prefer to make 
it more finegrained, but it's not possible in this case. Other drivers 
are likely attackable.


- Driver filtering is disabled (allowing attacks on the drivers) and the 
command line option for forced sharing is set.


All drivers initialize and can talk to the host through MMIO. Lots of 
unhardened drivers are likely attackable.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-23 Thread Andi Kleen



On 8/23/2021 6:04 PM, Dan Williams wrote:

On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
 wrote:



On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:

Add a new variant of pci_iomap for mapping all PCI resources
of a devices as shared memory with a hypervisor in a confidential
guest.

Signed-off-by: Andi Kleen
Signed-off-by: Kuppuswamy 
Sathyanarayanan

I'm a bit puzzled by this part. So why should the guest*not*  map
pci memory as shared? And if the answer is never (as it seems to be)
then why not just make regular pci_iomap DTRT?

It is in the context of confidential guest (where VMM is un-trusted). So
we don't want to make all PCI resource as shared. It should be allowed
only for hardened drivers/devices.

That's confusing, isn't device authorization what keeps unaudited
drivers from loading against untrusted devices? I'm feeling like
Michael that this should be a detail that drivers need not care about
explicitly, in which case it does not need to be exported because the
detail can be buried in lower levels.


We originally made it default (similar to AMD), but it during code audit 
we found a lot of drivers who do ioremap early outside the probe 
function. Since it would be difficult to change them all we made it 
opt-in, which ensures that only drivers that have been enabled can talk 
with the host at all and can't be attacked. That made the problem of 
hardening all these drivers a lot more practical.


Currently we only really need virtio and MSI-X shared, so for changing 
two places in the tree you avoid a lot of headache elsewhere.


Note there is still a command line option to override if you want to 
allow and load other drivers.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 09/15] pci: Consolidate pci_iomap* and pci_iomap*wc

2021-08-12 Thread Andi Kleen




Why is it safe to make pci_iomap_wc_range() support IO ports when it
didn't before?  That might be desirable, but I think it *is* a
functional change here.


None of the callers use it to map IO ports, so it will be a no-op for 
them. But you're right, it's a (minor) functional change.


-Andi




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andi Kleen




For most Linux drivers, a report that a misbehaving device can corrupt
host memory is a bug, not a feature.  If a USB device can corrupt kernel
memory, that's a serious bug.  If a USB-C device can corrupt kernel
memory, that's also a serious bug, although, sadly, we probably have
lots of these bugs.  If a Firewire device can corrupt kernel memory,
news at 11.  If a Bluetooth or WiFi peer can corrupt kernel memory,
people write sonnets about it and give it clever names.  Why is virtio
special?


Well for most cases it's pointless because they don't have any memory 
protection anyways.


Why break compatibility if it does not buy you anything?

Anyways if you want to enable the restricted mode for something else, 
it's easy to do. The cases where it matters seem to already work on it, 
like the user space virtio ring.


My changes for boundary checking are enabled unconditionally anyways, as 
well as the other patchkits.





This one:

int arch_has_restricted_virtio_memory_access(void)
+{
+   return is_tdx_guest();
+}

I'm looking at a fairly recent kernel, and I don't see anything for s390
wired up in vring_use_dma_api.


It's not using vring_use_dma_api, but enforces the DMA API at virtio 
ring setup time, same as SEV/TDX.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andi Kleen



We do not need an increasing pile of kludges


Do you mean disabling features is a kludge?

If yes I disagree with that characterization.



to make TDX and SEV “secure”.  We need the actual loaded driver to be secure.  
The virtio architecture is full of legacy nonsense,
and there is no good reason for SEV and TDX to be a giant special case.


I don't know where you see a "giant special case". Except for the 
limited feature negotiation all the changes are common, and the 
disabling of features (which is not new BTW, but already done e.g. with 
forcing DMA API in some cases) can be of course used by all these other 
technologies too. But it just cannot be done by default for everything 
because it would break compatibility. So every technology with such 
requirements has to explicitly opt-in.





As I said before, real PCIe (Thunderbolt/USB-C or anything else) has the exact 
same problem.  The fact that TDX has encrypted memory is, at best, a poor proxy 
for the actual condition.  The actual condition is that the host does not trust 
the device to implement the virtio protocol correctly.


Right they can do similar limitations of feature sets. But again it 
cannot be default.







TDX and SEV use the arch hook to enforce DMA API, so that part is also
solved.


Can you point me to the code you’re referring to?


See 4/8 in this patch kit. It uses an existing hook which is already 
used in tree by s390.



-Andi



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andi Kleen



Tell that to every crypto downgrade attack ever.


That's exactly what this patch addresses.



I see two credible solutions:

1. Actually harden the virtio driver.
That's exactly what this patchkit, and the alternative approaches, like 
Jason's, are doing.


2. Have a new virtio-modern driver and use it for modern use cases. Maybe 
rename the old driver virtio-legacy or virtio-insecure.  They can share code.


In most use cases the legacy driver is not insecure because there is no 
memory protection anyways.


Yes maybe such a split would be a good idea for maintenance and maybe 
performance reasons, but at least from the security perspective I don't 
see any need for it.




Another snag you may hit: virtio’s heuristic for whether to use proper DMA ops 
or to bypass them is a giant kludge. I’m very slightly optimistic that getting 
the heuristic wrong will make the driver fail to operate but won’t allow the 
host to take over the guest, but I’m not really convinced. And I wrote that 
code!  A virtio-modern mode probably should not have a heuristic, and the 
various iommu-bypassing modes should be fixed to work at the bus level, not the 
device level


TDX and SEV use the arch hook to enforce DMA API, so that part is also 
solved.



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andi Kleen



On 6/3/2021 10:33 AM, Andy Lutomirski wrote:

On 6/2/21 5:41 PM, Andi Kleen wrote:

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

I read this as "the virtio driver is buggy.  Let's disable most of the
buggy code in one special case in which we need a driver without bugs.
In all the other cases (e.g. hardware virtio device connected over
USB-C), driver bugs are still allowed."


My understanding is most of the other modes (except for split with separate 
descriptors) are obsolete and just there for compatibility. As long as they're 
deprecated they won't harm anyone.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andi Kleen



Ok, but what I meant is this, if we don't read from the descriptor 
ring, and validate all the other metadata supplied by the device (used 
id and len). Then there should be no way for the device to suppress 
the dma flags to write to the indirect descriptor table.


Or do you have an example how it can do that?


I don't. If you can validate everything it's probably ok

The only drawback is even more code to audit and test.

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 6/8] dma: Add return value to dma_unmap_page

2021-06-03 Thread Andi Kleen




What it looks like to me is abusing SWIOTLB's internal housekeeping to 
keep track of virtio-specific state. The DMA API does not attempt to 
validate calls in general since in many cases the additional overhead 
would be prohibitive. It has always been callers' responsibility to 
keep track of what they mapped and make sure sync/unmap calls match, 
and there are many, many, subtle and not-so-subtle ways for things to 
go wrong if they don't. If virtio is not doing a good enough job of 
that, what's the justification for making it the DMA API's problem?


In this case it's not prohibitive at all. Just adding a few error 
returns, and checking the overlap (which seems to have been already 
solved anyways) I would argue the error returns are good practice 
anyways, so that API users can check that something bad happening and 
abort.  The DMA API was never very good at proper error handling, but 
there's no reason at all to continue being bad it forever.


AFAIK the rest just works anyways, so it's not really a new problem to 
be solved.





A new callback is used to avoid changing all the IOMMU drivers.


Nit: presumably by "IOMMU drivers" you actually mean arch DMA API 
backends?

Yes


 Furthermore, AFAICS it's still not going to help against exfiltrating 
guest memory by over-unmapping the original SWIOTLB slot *without* 
going past the end of the whole buffer,


That would be just exfiltrating data that is already shared, unless I'm 
misunderstanding you.


-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Andi Kleen





I agree, but I want to know why indirect descriptor needs to be 
disabled. The table can't be wrote by the device since it's not 
coherent swiotlb mapping.


I had all kinds of problems with uninitialized entries in the indirect 
table. So I gave up on it and concluded it would be too difficult to secure.



-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Andi Kleen



It looks to me all the evils came from the fact that we depends on the 
descriptor ring.


So the checks in this patch could is unnecessary if we don't even read 
from the descriptor ring which could be manipulated by the device.


This is what my series tries to achieve:

https://www.spinics.net/lists/kvm/msg241825.html


I would argue that you should boundary check in any case. It was always 
a bug to not have boundary checks in such a data structure with multiple 
users, trust or not.


But yes your patch series is interesting and definitely makes sense for 
TDX too.


Best would be to have both I guess, and always check the boundaries 
everywhere.


So what's the merge status of your series?

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks

2021-06-02 Thread Andi Kleen



On 6/2/2021 6:48 PM, Konrad Rzeszutek Wilk wrote:

On Wed, Jun 02, 2021 at 05:41:30PM -0700, Andi Kleen wrote:

swiotlb currently only uses the start address of a DMA to check if something
is in the swiotlb or not. But with virtio and untrusted hosts the host
could give some DMA mapping that crosses the swiotlb boundaries,
potentially leaking or corrupting data. Add size checks to all the swiotlb
checks and reject any DMAs that cross the swiotlb buffer boundaries.

I seem to be only CC-ed on this and #7, so please bear with me.
You weren't cc'ed originally so if you get partial emails it must be 
through some list.


But could you explain to me why please:

commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 (swiotlb/stable/for-linus-5.12)
Author: Martin Radev 
Date:   Tue Jan 12 16:07:29 2021 +0100

 swiotlb: Validate bounce size in the sync/unmap path

does not solve the problem as well?


Thanks. I missed that patch, race condition.

One major difference of my patch is that it supports an error return, 
which allows virtio to error out. This is important in virtio because 
otherwise you'll end up with uninitialized memory on the target without 
any indication. This uninitialized memory could be an potential attack 
vector on the guest memory, e.g. if the attacker finds some way to echo 
it out again.


But the error return could be added to your infrastructure too and what 
would make this patch much shorter. I'll take a look at that.


-Andi





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Virtio hardening for TDX

2021-06-02 Thread Andi Kleen



Note that it's probably needed by other cases as well:

1) Other encrypted VM technology
2) VDUSE[1]
3) Smart NICs


Right. I don't see any reason why these shouldn't work. You may just 
need to add the enable for the lockdown, but you can reuse the basic 
infrastructure.




We have already had discussions and some patches have been 
posted[2][3][4].


Thanks.

Yes [2] is indeed an alternative. We considered this at some point, but 
since we don't care about DOS in our case it seemed simpler to just 
harden the existing code.  But yes if it's there it's useful for TDX too.


FWIW I would argue that the descriptor boundary checking should be added 
in any case, security case or separated metadata or not, because it can 
catch bugs and is very cheap. Checking boundaries is good practice.


[4] would be an independent issue, that's something we didn't catch.

Also the swiotlb hardening implemented in this patchkit doesn't seem to 
be in any of the other patches.


So I would say my patches are mostly orthogonal to these patches below 
and not conflicting, even though they address a similar problem space.


-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Andi Kleen


So we will see huge performance regression without indirect 
descriptor. We need to consider to address this.


A regression would be when some existing case would be slower.

That's not the case because the behavior for the existing cases does not 
change.


Anyways when there are performance problems they can be addressed, but 
first is to make it secure.


-Andi




Thanks



  break;
  case VIRTIO_RING_F_EVENT_IDX:
  break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct 
virtio_device *vdev)

  case VIRTIO_F_ACCESS_PLATFORM:
  break;
  case VIRTIO_F_RING_PACKED:
+    if (protected_guest_has(VM_MEM_ENCRYPT))
+    goto clear;
  break;
  case VIRTIO_F_ORDER_PLATFORM:
  break;
+    clear:
  default:
  /* We don't understand this bit. */
  __virtio_clear_bit(vdev, i);



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v1 5/8] dma: Use size for swiotlb boundary checks

2021-06-02 Thread Andi Kleen
swiotlb currently only uses the start address of a DMA to check if something
is in the swiotlb or not. But with virtio and untrusted hosts the host
could give some DMA mapping that crosses the swiotlb boundaries,
potentially leaking or corrupting data. Add size checks to all the swiotlb
checks and reject any DMAs that cross the swiotlb buffer boundaries.

Signed-off-by: Andi Kleen 
---
 drivers/iommu/dma-iommu.c   | 13 ++---
 drivers/xen/swiotlb-xen.c   | 11 ++-
 include/linux/dma-mapping.h |  4 ++--
 include/linux/swiotlb.h |  8 +---
 kernel/dma/direct.c |  8 
 kernel/dma/direct.h |  8 
 kernel/dma/mapping.c|  4 ++--
 net/xdp/xsk_buff_pool.c |  2 +-
 8 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..7ef13198721b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
__iommu_dma_unmap(dev, dma_addr, size);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
+   if (unlikely(is_swiotlb_buffer(phys, size)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
@@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
}
 
iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
+   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size))
swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
return iova;
 }
@@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
*dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);
 
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(phys, size))
swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
 
@@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device 
*dev,
return;
 
phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(phys, size))
swiotlb_sync_single_for_device(dev, phys, size, dir);
 
if (!dev_is_dma_coherent(dev))
@@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(sg_phys(sg), sg->length))
swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
sg->length, dir);
}
@@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device 
*dev,
return;
 
for_each_sg(sgl, sg, nelems, i) {
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(sg_phys(sg), sg->length))
swiotlb_sync_single_for_device(dev, sg_phys(sg),
   sg->length, dir);
-
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
}
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 24d11861ac7d..333846af8d35 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t 
p, size_t size)
return 0;
 }
 
-static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr)
+static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr,
+size_t size)
 {
unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr));
unsigned long xen_pfn = bfn_to_local_pfn(bfn);
@@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr)))
-   return is_swiotlb_buffer(paddr);
+   return is_swiotlb_buffer(paddr, size);
return 0;
 }
 
@@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
}
 
/* NOTE: We use dev_addr here, not paddr! */
-   if (is_xen_swiotlb_buffer(hwdev, dev_addr))
+   if (is_xen_swiotlb_buffer(hwdev, dev_addr, size))
swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
 }
 
@@ -448,7 +449,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, 
dma_addr_t dma_addr,
xen_dma_sync_for_cpu(dev, dma_addr, size, dir);
}
 
-   if (is_xen_swiotlb

[PATCH v1 3/8] virtio: Harden split buffer detachment

2021-06-02 Thread Andi Kleen
Harden the split buffer detachment path by adding boundary checking. Note
that when this fails we may fail to unmap some swiotlb mapping, which could
result in a leak and a DOS. But that's acceptable because an malicious host
can DOS us anyways.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index d37ff5a0ff58..1e9aa1e95e1b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue 
*_vq)
return needs_kick;
 }
 
-static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
-void **ctx)
+static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+   void **ctx)
 {
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
 
+   /* We'll leak DMA mappings when this happens, but nothing
+* can be done about that. In the worst case the host
+* could DOS us, but it can of course do that anyways.
+*/
+   if (!inside_split_ring(vq, head))
+   return -EIO;
+
/* Clear data ptr. */
vq->split.desc_state[head].data = NULL;
 
@@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
while (vq->split.vring.desc[i].flags & nextflag) {
vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next);
+   if (!inside_split_ring(vq, i))
+   return -EIO;
vq->vq.num_free++;
}
 
@@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
 
/* Free the indirect table, if any, now that it's unmapped. */
if (!indir_desc)
-   return;
+   return 0;
 
len = virtio32_to_cpu(vq->vq.vdev,
vq->split.vring.desc[head].len);
@@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
} else if (ctx) {
*ctx = vq->split.desc_state[head].indir_desc;
}
+   return 0;
 }
 
 static inline bool more_used_split(const struct vring_virtqueue *vq)
@@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
void *ret;
unsigned int i;
u16 last_used;
+   int err;
 
START_USE(vq);
 
@@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
 
/* detach_buf_split clears data, so grab it now. */
ret = vq->split.desc_state[i].data;
-   detach_buf_split(vq, i, ctx);
+   err = detach_buf_split(vq, i, ctx);
+   if (err) {
+   END_USE(vq);
+   return NULL;
+   }
+
vq->last_used_idx++;
/* If we expect an interrupt for the next entry, tell host
 * by writing event index and flush out the write before
@@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
detach_buf_split(vq, i, NULL);
+   /* Don't need to check for error because nothing is returned */
vq->split.avail_idx_shadow--;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
-- 
2.25.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 8/8] virtio: Error out on endless free lists

2021-06-02 Thread Andi Kleen
Error out with a warning when the free list loops longer
than the maximum size while freeing descriptors. While technically
we don't care about DOS it is still better to abort it early.

We ran into this problem while fuzzing the virtio interactions
where the fuzzed code would get stuck for a long time.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 244a5b62d85c..96adaa4c5404 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,6 +685,11 @@ static int detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
if (!inside_split_ring(vq, i))
return -EIO;
vq->vq.num_free++;
+   if (WARN_ONCE(vq->vq.num_free >
+   vq->split.queue_size_in_bytes /
+   sizeof(struct vring_desc),
+   "Virtio freelist corrupted"))
+   return -EIO;
}
 
vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
-- 
2.25.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Andi Kleen
When running under TDX the virtio host is untrusted. The bulk
of the kernel memory is encrypted and protected, but the virtio
ring is in special shared memory that is shared with the
untrusted host.

This means virtio needs to be hardened against any attacks from
the host through the ring. Of course it's impossible to prevent DOS
(the host can chose at any time to stop doing IO), but there
should be no buffer overruns or similar that might give access to
any private memory in the guest.

virtio has a lot of modes, most are difficult to harden.

The best for hardening seems to be split mode without indirect
descriptors. This also simplifies the hardening job because
it's only a single code path.

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..f35629fa47b1 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef DEBUG
@@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev)
unsigned int i;
 
for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
+
+   /*
+* In protected guest mode disallow packed or indirect
+* because they ain't hardened.
+*/
+
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;
break;
case VIRTIO_RING_F_EVENT_IDX:
break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev)
case VIRTIO_F_ACCESS_PLATFORM:
break;
case VIRTIO_F_RING_PACKED:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;
break;
case VIRTIO_F_ORDER_PLATFORM:
break;
+   clear:
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.25.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX

2021-06-02 Thread Andi Kleen
In virtio the host decides whether the guest uses the DMA
API or not using the strangely named VIRTIO_F_ACCESS_PLATFORM
bit (which really indicates whether the DMA API is used or not)

For hardened virtio on TDX we want to enforce that that swiotlb is
always used, which requires using the DMA API.  While IO wouldn't
really work without the swiotlb, it might be possible that an
attacker forces swiotlbless IO to manipulate memory in the guest.

So we want to force the DMA API (which then forces swiotlb),
but without relying on the host.

There is already an arch_has_restricted_memory_acces hook for
this, which is currently used only by s390. Enable
the config option for the hook for x86 and enable it for TDX.

Signed-off-by: Andi Kleen 
---
 arch/x86/Kconfig | 1 +
 arch/x86/mm/mem_encrypt_common.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1531a0f905ed..3d804fce31b9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
select X86_MEM_ENCRYPT_COMMON
+   select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
help
  Provide support for running in a trusted domain on Intel processors
  equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 24c9117547b4..2244d1f033ab 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
@@ -37,3 +38,9 @@ void __init mem_encrypt_init(void)
amd_mem_encrypt_init();
 }
 
+int arch_has_restricted_virtio_memory_access(void)
+{
+   return is_tdx_guest();
+}
+EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);
+
-- 
2.25.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 7/8] virtio: Abort IO when descriptor points outside forced swiotlb

2021-06-02 Thread Andi Kleen
Now that we have a return value for unmapping DMA mappings that
are outside the forced swiotlb, use that to abort the IO operation.

This prevents the host from subverting a read to access some
data in the guest address space, which it might then get access somehow in
another IO operation. It can subvert reads to point to other
reads or other writes, but since it controls IO it can do
that anyways.

This is only done for the split code path, which is the only
one supported with confidential guests.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e9aa1e95e1b..244a5b62d85c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -365,29 +365,31 @@ static int vring_mapping_error(const struct 
vring_virtqueue *vq,
  * Split ring specific functions - *_split().
  */
 
-static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+static int vring_unmap_one_split(const struct vring_virtqueue *vq,
  struct vring_desc *desc)
 {
u16 flags;
+   int ret;
 
if (!vq->use_dma_api)
-   return;
+   return 0;
 
flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
 
if (flags & VRING_DESC_F_INDIRECT) {
-   dma_unmap_single(vring_dma_dev(vq),
+   ret = dma_unmap_single(vring_dma_dev(vq),
 virtio64_to_cpu(vq->vq.vdev, desc->addr),
 virtio32_to_cpu(vq->vq.vdev, desc->len),
 (flags & VRING_DESC_F_WRITE) ?
 DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
-   dma_unmap_page(vring_dma_dev(vq),
+   ret = dma_unmap_page(vring_dma_dev(vq),
   virtio64_to_cpu(vq->vq.vdev, desc->addr),
   virtio32_to_cpu(vq->vq.vdev, desc->len),
   (flags & VRING_DESC_F_WRITE) ?
   DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
+   return ret;
 }
 
 static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
@@ -609,6 +611,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
break;
if (!inside_split_ring(vq, i))
break;
+   /*
+* Ignore unmapping errors since
+* we're aborting anyways.
+*/
vring_unmap_one_split(vq, &desc[i]);
i = virtio16_to_cpu(_vq->vdev, desc[i].next);
}
@@ -671,7 +677,10 @@ static int detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
i = head;
 
while (vq->split.vring.desc[i].flags & nextflag) {
-   vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
+   int ret;
+   ret = vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
+   if (ret)
+   return ret;
i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next);
if (!inside_split_ring(vq, i))
return -EIO;
@@ -878,6 +887,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
continue;
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
+   /* Ignore unmap errors because there is nothing to abort */
detach_buf_split(vq, i, NULL);
/* Don't need to check for error because nothing is returned */
vq->split.avail_idx_shadow--;
-- 
2.25.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Virtio hardening for TDX

2021-06-02 Thread Andi Kleen
[v1: Initial post]

With confidential computing like TDX the guest doesn't trust the host
anymore. The host is allowed to DOS of course, but it is not allowed
to read or write any guest memory not explicitely shared with it.

This has implication for virtio. Traditionally virtio didn't assume
the other side of the communication channel is malicious, and therefore
didn't do any boundary checks in virtio ring data structures.

This patchkit does hardening for virtio.  In a TDX like model
the only host memory accesses allowed are in the virtio ring, 
as well as the (forced) swiotlb buffer.

This patch kit does various changes to ensure there can be no
access outside these two areas. It is possible for the host
to break the communication, but this should result in a IO
error on the guest, but no memory safety violations.

virtio is quite complicated with many modes. To simplify
the task we enforce that virtio is only in split mode without
indirect descriptors, when running as a TDX guest. We also
enforce use of the DMA API.

Then these code paths are hardened against any corruptions
on the ring.

This patchkit has components in three subsystems:
- Hardening changes to virtio, all in the generic virtio-ring
- Hardening changes to kernel/dma swiotlb to harden swiotlb against
malicious pointers. It requires an API change which needed a tree sweep.
- A single x86 patch to enable the arch_has_restricted_memory_access
for TDX

It depends on Sathya's earlier patchkit that adds the basic infrastructure
for TDX. This is only needed for the "am I running in TDX" part.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Andi Kleen
In protected guest mode we don't trust the host.

This means we need to make sure the host cannot subvert us through
virtio communication. In general it can corrupt our virtio data
and cause a DOS, but it should not be able to access any data
that is not explicitely under IO.

Also boundary checking so that the free list (which is accessible
to the host) cannot point outside the virtio ring. Note it could
still contain loops or similar, but these should only cause an DOS,
not a memory corruption or leak.

When we detect any out of bounds descriptor trigger an IO error.
We also use a WARN() (in case it was a software bug instead of
an attack). This implies that a malicious host can flood
the guest kernel log, but that's only a DOS and acceptable
in the threat model.

This patch only hardens the initial consumption of the free list,
the freeing comes later.

Any of these errors can cause DMA memory leaks, but there is nothing
we can do about that and that would be just a DOS.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 46 
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f35629fa47b1..d37ff5a0ff58 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct 
virtqueue *_vq,
return desc;
 }
 
+/* assumes no indirect mode */
+static inline bool inside_split_ring(struct vring_virtqueue *vq,
+unsigned index)
+{
+   return !WARN(index >= vq->split.vring.num,
+   "desc index %u out of bounds (%u)\n",
+   index, vq->split.vring.num);
+}
+
 static inline int virtqueue_add_split(struct virtqueue *_vq,
  struct scatterlist *sgs[],
  unsigned int total_sg,
@@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
unsigned int i, n, avail, descs_used, prev, err_idx;
int head;
bool indirect;
+   int io_err;
 
START_USE(vq);
 
@@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
 
for (n = 0; n < out_sgs; n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_TO_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
@@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
}
for (; n < (out_sgs + in_sgs); n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_FROM_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
@@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
dma_addr_t addr = vring_map_single(
vq, desc, total_sg * sizeof(struct vring_desc),
DMA_TO_DEVICE);
+   io_err = -ENOMEM;
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
@@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
/* We're using some buffers from the free list. */
vq->vq.num_free -= descs_used;
 
+   io_err = -EIO;
+   if (!inside_split_ring(vq, head))
+   goto unmap_release;
+
/* Update free pointer */
if (indirect)
vq->free_head = virtio16_to_cpu(_vq->vdev,
@@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
+
+   if (avail >= vq->split.vring.num)
+   goto unmap_release;
+
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
 
/* Descriptors and available array need to be set before we expose the
@@ -576,6 +607,8 @@ static inline int virtqueue_add_split(struct virtque

[PATCH v1 6/8] dma: Add return value to dma_unmap_page

2021-06-02 Thread Andi Kleen
In some situations when we know swiotlb is forced and we have
to deal with untrusted hosts, it's useful to know if a mapping
was in the swiotlb or not. This allows us to abort any IO
operation that would access memory outside the swiotlb.

Otherwise it might be possible for a malicious host to inject
any guest page in a read operation. While it couldn't directly
access the results of the read() inside the guest, there
might scenarios where data is echoed back with a write(),
and that would then leak guest memory.

Add a return value to dma_unmap_single/page. Most users
of course will ignore it. The return value is set to EIO
if we're in forced swiotlb mode and the buffer is not inside
the swiotlb buffer. Otherwise it's always 0.

A new callback is used to avoid changing all the IOMMU drivers.

Signed-off-by: Andi Kleen 
---
 drivers/iommu/dma-iommu.c   | 17 +++--
 include/linux/dma-map-ops.h |  3 +++
 include/linux/dma-mapping.h |  7 ---
 kernel/dma/mapping.c|  6 +-
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7ef13198721b..babe46f2ae3a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, 
dma_addr_t dma_addr,
iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
 }
 
-static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
+static int __iommu_dma_unmap_swiotlb_check(struct device *dev,
+   dma_addr_t dma_addr,
size_t size, enum dma_data_direction dir,
unsigned long attrs)
 {
@@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
phys = iommu_iova_to_phys(domain, dma_addr);
if (WARN_ON(!phys))
-   return;
+   return -EIO;
 
__iommu_dma_unmap(dev, dma_addr, size);
 
if (unlikely(is_swiotlb_buffer(phys, size)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
+   else if (swiotlb_force == SWIOTLB_FORCE)
+   return -EIO;
+   return 0;
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, 
struct page *page,
return dma_handle;
 }
 
-static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
+static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t 
dma_handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
-   __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs);
+   return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir,
+  attrs);
 }
 
 /*
@@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, 
struct scatterlist *s
int i;
 
for_each_sg(sg, s, nents, i)
-   __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s),
+   __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s),
sg_dma_len(s), dir, attrs);
 }
 
@@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.mmap   = iommu_dma_mmap,
.get_sgtable= iommu_dma_get_sgtable,
.map_page   = iommu_dma_map_page,
-   .unmap_page = iommu_dma_unmap_page,
+   .unmap_page_check   = iommu_dma_unmap_page_check,
.map_sg = iommu_dma_map_sg,
.unmap_sg   = iommu_dma_unmap_sg,
.sync_single_for_cpu= iommu_dma_sync_single_for_cpu,
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d53a96a3d64..0ed0190f7949 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,9 @@ struct dma_map_ops {
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
unsigned long (*get_merge_boundary)(struct device *dev);
+   int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle,
+   size_t size, enum dma_data_direction dir,
+   unsigned long attrs);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 37fbd12bd4ab..25b8382f8601 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, 
dma_addr_t dma_addr)
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs);
-void dma_unmap_page_attrs(struct device *dev,

Re: [PATCH 08/62] x86/boot/compressed/64: Add IDT Infrastructure

2020-02-14 Thread Andi Kleen
Joerg Roedel  writes:
> + addq$8, %rsp
> +
> + /*
> +  * Make sure we return to __KERNEL_CS - the CS selector on
> +  * the IRET frame might still be from an old BIOS GDT
> +  */
> + movq$__KERNEL_CS, 8(%rsp)

This doesn't make sense. Either it's running on the correct CS
before the exception or not. Likely there's some other problem
here that you patched over with this hack.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] Remove 32-bit Xen PV guest support

2019-07-15 Thread Andi Kleen
Juergen Gross  writes:

> The long term plan has been to replace Xen PV guests by PVH. The first
> victim of that plan are now 32-bit PV guests, as those are used only
> rather seldom these days. Xen on x86 requires 64-bit support and with
> Grub2 now supporting PVH officially since version 2.04 there is no
> need to keep 32-bit PV guest support alive in the Linux kernel.
> Additionally Meltdown mitigation is not available in the kernel running
> as 32-bit PV guest, so dropping this mode makes sense from security
> point of view, too.

Normally we have a deprecation period for feature removals like this.
You would make the kernel print a warning for some releases, and when
no user complains you can then remove. If a user complains you can't.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86, kbuild: revert macrofying inline assembly code

2018-12-21 Thread Andi Kleen
Masahiro Yamada  writes:

> Revert the following 9 commits:

FWIW the -Wa additional also broke LTO builds because it doesn't really
support -Wa for individual files.

So I'm glad they got reverted.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 0/4] Enable LBR for the guest

2017-09-26 Thread Andi Kleen
> On the other side, it seems that the (guest) kernel driver also works
> without
> the above being supported, should we change it to report error and stop
> using the PMU features when the check of the above two fails (at
> intel_pmu_init())?

You could add the extra check for the LBR code yes, although it already checks
if the LBR MSRs work, so in practice it's likely already covered.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-26 Thread Andi Kleen
> 1) vCPU context switching and guest side task switching are not identical.
> That is, when the vCPU is scheduled out, the guest task on the vCPU may not

guest task lifetime has nothing to do with this. It's completely independent
of what you do here on the VCPU level.

> run out its time slice yet, so the task will continue to run when the vCPU
> is
> scheduled in by the host (lbr wasn't save by the guest task when the vCPU is
> scheduled out in this case).
> 
> It is possible to have the vCPU which runs the guest task (in use of lbr)
> scheduled
> out, followed by a new host task being scheduled in on the pCPU to run.
> It is not guaranteed that the new host task does not use the LBR feature on
> the
> pCPU.

Sure it may use the LBR, and the normal perf context switch
will switch it and everything works fine.

It's like any other per-task LBR user.

> 
> 2) Sometimes, people may want this usage: "perf record -b
> ./qemu-system-x86_64 ...",
> which will need lbr to be used in KVM as well.

In this obscure case you can disable LBR support for the guest.
The common case is far more important.

It sounds like you didn't do any performance measurements.
I expect the performance of your current solution to be terrible.

e.g. a normal perf PMI does at least 1 MSR reads and 4+ MSR writes
for a single counter. With multiple counters it gets worse.

For each of those you'll need to exit. Adding something
to the entry/exit list is similar to the cost of doing 
explicit RD/WRMSRs.

On Skylake we have 32*3=96 MSRs for the LBRs.

So with the 5 exits and entries, you're essentually doing
5*2*96=18432 extra MSR accesses for each PMI.

MSR access is 100+ cycles at least, for writes it is far more
expensive.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 0/4] Enable LBR for the guest

2017-09-25 Thread Andi Kleen
On Mon, Sep 25, 2017 at 12:44:52PM +0800, Wei Wang wrote:
> This patch series enables the Last Branch Recording feature for the
> guest. Instead of trapping each LBR stack MSR access, the MSRs are
> passthroughed to the guest. Those MSRs are switched (i.e. load and
> saved) on VMExit and VMEntry.
> 
> Test:
> Try "perf record -b ./test_program" on guest.

I don't see where you expose the PERF capabilities MSR? 

That's normally needed for LBR too to report the version
number.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-25 Thread Andi Kleen
> +static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx)
> +{
> + int i;
> + struct perf_lbr_stack lbr_stack;
> +
> + perf_get_lbr_stack(&lbr_stack);
> +
> + add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0);
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0);
> +
> + for (i = 0; i < lbr_stack.lbr_nr; i++) {
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0);
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0);
> + if (lbr_stack.lbr_info)
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0,
> +   0);
> + }

That will be really expensive and add a lot of overhead to every entry/exit.
perf can already context switch the LBRs on task context switch. With that
you can just switch LBR_SELECT, which is *much* cheaper because there
are far less context switches than exit/entries.

It implies that when KVM is running it needs to prevent perf from enabling
LBRs in the context of KVM, but that should be straight forward.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH -tip v3 13/23] x86/trap: Use NOKPROBE_SYMBOL macro in trap.c

2013-11-25 Thread Andi Kleen
On Wed, Nov 20, 2013 at 04:22:21AM +, Masami Hiramatsu wrote:
> Use NOKPROBE_SYMBOL macro to protect functions from kprobes
> instead of __kprobes annotation in trap.c.
> This also applies __always_inline annotation for some cases,
> because NOKPROBE_SYMBOL() will inhibit inlining by referring
> the symbol address.

NOKPROBE_SYMBOL seems to add a reference from some variable to the function?

With LTO we can optimize away unused functions, but not
when there are references to the symbol. So this would likely
prevent optimizations with LTO.

I prefer a simpler "__kprobe" annotation.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Andi Kleen
On Sat, Jun 01, 2013 at 01:28:00PM -0700, Jeremy Fitzhardinge wrote:
> On 06/01/2013 01:14 PM, Andi Kleen wrote:
> > FWIW I use the paravirt spinlock ops for adding lock elision
> > to the spinlocks.
> 
> Does lock elision still use the ticketlock algorithm/structure, or are
> they different?  If they're still basically ticketlocks, then it seems
> to me that they're complimentary - hle handles the fastpath, and pv the
> slowpath.

It uses the ticketlock algorithm/structure, but:

- it needs to know that the lock is free with an own operation
- it has an additional field for strong adaptation state
(but that field is independent of the low level lock implementation,
so can be used with any kind of lock)

So currently it inlines the ticket lock code into its own.

Doing pv on the slow path would be possible, but would need
some additional (minor) hooks I think.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Andi Kleen

FWIW I use the paravirt spinlock ops for adding lock elision
to the spinlocks.

This needs to be done at the top level (so the level you're removing)

However I don't like the pv mechanism very much and would 
be fine with using an static key hook in the main path
like I do for all the other lock types.

It also uses interrupt ops patching, for that it would 
be still needed though.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] x86: use a read-only IDT alias on all CPUs

2013-04-11 Thread Andi Kleen
Ingo Molnar  writes:
>
> This looks very nice to me now. Peter, any objections?

it seems pointless without randomized main kernel text location, because
the IDT will be still at a known per kernel fixed writable location in
the direct mapping.

As long as such randomization is not there it just wastes a TLB entry.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages

2012-06-26 Thread Andi Kleen
> How is the compiler meant to optimise away "cond" if it's a function
> call?

Inlines can be optimized away.  These tests are usually inlines.

> What did I miss? If nothing, then I will revert this particular change
> and Rafael will need to be sure his patch is not using VM_BUG_ON to call
> a function with side-effects.

Do you have an example where the code is actually different,
or are you just speculating?

FWIW for my config both generates the same code:

size vmlinux-andi-vmbug vmlinux-vmbug-nothing 
   textdata bss dec hex filename
118097041457352 1159168 14426224 dc2070 vmlinux-andi-vmbug
118097041457352 1159168 14426224 dc2070 vmlinux-vmbug-nothing

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages

2012-06-26 Thread Andi Kleen
> I'm not fully sure what putback_balloon_page does, but if it just tests
> a bit (without non variable test_bit) it should be ok.
 ^

should be "without variable ..."

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages

2012-06-26 Thread Andi Kleen
> 
> What shocked me actually is that VM_BUG_ON code is executed on
> !CONFIG_DEBUG_VM builds and has been since 2.6.36 due to commit [4e60c86bd:
> gcc-4.6: mm: fix unused but set warnings]. I thought the whole point of
> VM_BUG_ON was to avoid expensive and usually unnecessary checks. Andi,
> was this deliberate?

The idea was that the compiler optimizes it away anyways.

I'm not fully sure what putback_balloon_page does, but if it just tests
a bit (without non variable test_bit) it should be ok.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-03-30 Thread Andi Kleen
On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote:
> On Sat, 31 Mar 2012, Andi Kleen wrote:
> 
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> > 
> > On a large system under high contention sleeping can perform surprisingly
> > well. pthread mutexes have a tendency to beat kernel spinlocks there.
> > So avoiding sleeping locks completely (that is what pv locks are
> > essentially) is not necessarily that good.
> 
> Care to back that up with numbers and proper trace evidence instead of
> handwaving?

E.g. my plumbers presentations on lock and mm scalability from last year has 
some 
graphs that show this very clearly, plus some additional data on the mutexes. 
This compares to the glibc futex locks, which perform much better than the 
kernel 
mutex locks on larger systems under higher contention

Given your tone I will not supply an URL. I'm sure you can find it if you
need it.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-03-30 Thread Andi Kleen
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly
well. pthread mutexes have a tendency to beat kernel spinlocks there.
So avoiding sleeping locks completely (that is what pv locks are
essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention
and relatively small systems.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netlink versus pid namespaces

2010-10-15 Thread Andi Kleen
On Fri, Oct 15, 2010 at 07:19:03PM +0400, Alexey Kuznetsov wrote:
> Hello!
> 
> > netlink uses pids (or really tids I hope?) to address sockets
> > associated with processes.
> 
> Not really. It uses port number which is called "pid" occasionally. Bad name.
> Autobind function simply selects tgid of calling process as the first guess.

Thanks for the clarification, Alexey. I guess I should have read more
code :/

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


netlink versus pid namespaces

2010-10-15 Thread Andi Kleen
Hi,

I have been trying to figure out how pid namespaces interact
with netlink.

netlink uses pids (or really tids I hope?) to address sockets
associated with processes.

The netlink code passes around pids without caring much about 
the pid namespace.  It does pass around some information about the 
network namespace, but that doesn't help here because the pid
namespace is not necessarily related to the net namespace.

When the netlink consumer runs in kernel (like rtnetlink) and
happens to run in the same process context while receiving
and processing the data it should do the right thing because
it has the same pid namespace.

If it runs in some other process that is not guaranteed and
it may actually send the reply back to the wrong pid.

When a process receives netlink in user space and it isn't
in the same pid space as the sender it is unlikely that
the reply gets back.

Anything I'm missing here? 

Does netlink need to be extended? 
Or perhaps forbid passing netlink between name spaces?

Thanks,
-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
On Wed, Nov 04, 2009 at 03:17:36PM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 04, 2009 at 02:15:33PM +0100, Andi Kleen wrote:
> > On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
> > > > > Fine?
> > > > 
> > > > I cannot say -- are there paths that could drop the device beforehand?
> > > 
> > > Do you mean drop the mm reference?
> > 
> > No the reference to the device, which owns the mm for you.
> 
> The device is created when file is open and destroyed
> when file is closed. So I think the fs code handles the
> reference counting for me: it won't call file cleanup
> callback while some userspace process has the file open.
> Right?

Yes.

But the semantics when someone inherits such a fd through exec
or through file descriptor passing would be surely "interesting"
You would still do IO on the old VM.

I guess it would be a good way to confuse memory accounting schemes 
or administrators @)

It would be all saner if this was all a single atomic step.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
> > > Fine?
> > 
> > I cannot say -- are there paths that could drop the device beforehand?
> 
> Do you mean drop the mm reference?

No the reference to the device, which owns the mm for you.

> 
> > (as in do you hold a reference to it?)
> 
> By design I think I always have a reference to mm before I use it.
> 
> This works like this:
> ioctl SET_OWNER - calls get_task_mm, I think this gets a reference to mm
> ioctl SET_BACKEND - checks that SET_OWNER was run, starts virtqueue
> ioctl RESET_OWNER - stops virtqueues, drops the reference to mm
> file close - stops virtqueues, if we still have it then drops mm
> 
> This is why I think I can call use_mm/unuse_mm while virtqueue is running,
> safely.
> Makes sense?

Do you protect against another thread doing RESET_OWNER in parallel while
RESET_OWNER runs?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
> Fine?

I cannot say -- are there paths that could drop the device beforehand?
(as in do you hold a reference to it?)

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
"Michael S. Tsirkin"  writes:

Haven't really read the whole thing, just noticed something at a glance.

> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_tx(struct vhost_net *net)
> +{
> + struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> + unsigned head, out, in, s;
> + struct msghdr msg = {
> + .msg_name = NULL,
> + .msg_namelen = 0,
> + .msg_control = NULL,
> + .msg_controllen = 0,
> + .msg_iov = vq->iov,
> + .msg_flags = MSG_DONTWAIT,
> + };
> + size_t len, total_len = 0;
> + int err, wmem;
> + size_t hdr_size;
> + struct socket *sock = rcu_dereference(vq->private_data);
> + if (!sock)
> + return;
> +
> + wmem = atomic_read(&sock->sk->sk_wmem_alloc);
> + if (wmem >= sock->sk->sk_sndbuf)
> + return;
> +
> + use_mm(net->dev.mm);

I haven't gone over all this code in detail, but that isolated reference count
use looks suspicious. What prevents the mm from going away before
you increment, if it's not the current one?

-Andi 

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 00/54] [Announce] Microsoft Hyper-V drivers for Linux

2009-07-20 Thread Andi Kleen
Greg KH  writes:
>
> It's the "normal" way of cleaning up code.  Start with a known-working
> piece of stuff, and incrementally clean it up. 

I have to agree with David. This is essentially unreviewable
in the original state. It would be good if you could
do all the obvious cleanups first and then post a consolidated
series.

Most "obvious" cleanups should not change the binary, so
you can easily check they are correct by comparing the assembler
code.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-07 Thread Andi Kleen
Anthony Liguori <[EMAIL PROTECTED]> writes:
>
> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices.  We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack.  A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic.  This will perform just as well as doing passthrough
> (at least) and avoid all that ugliness of dealing with SR-IOV in the
> guest.

But you shift a lot of ugliness into the host network stack again.
Not sure that is a good trade off.

Also it would always require context switches and I believe one
of the reasons for the PV/VF model is very low latency IO and having
heavyweight switches to the host and back would be against that.

-Andi

-- 
[EMAIL PROTECTED]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM PV Guest: Implement paravirtualized DMA

2008-04-29 Thread Andi Kleen
Amit Shah <[EMAIL PROTECTED]> writes:
> +
> +static struct page *page;
> +static unsigned long page_gfn;
Bad variable names

> +
> +const struct dma_mapping_ops *orig_dma_ops;

I suspect real dma ops stacking will need some further thought than
your simple hacks
> +
> + match = find_matching_pt_dev(&pt_devs_head, &pv_pci_info);
> + if (match) {
> + r = match->is_pv;
> + goto out;
> + }
> +
> + memcpy(page_address(page), &pv_pci_info, sizeof(pv_pci_info));

Note that on 32bit page_address() might be not mapped.


> +
> + npages = get_order(size) + 1;

Are you sure that's correct? It looks quite bogus. order is a 2 logarithm,
normally npages = 1 << order

if you want  npages from order the correct  need 1 << order

Haven't read further, but to be honest the code doesn't seem to be anywhere
near merging quality.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: PV DMA for PCI passthrough devices for KVM

2008-04-29 Thread Andi Kleen
Amit Shah <[EMAIL PROTECTED]> writes:

> This patchset implements PVDMA for handling DMA requests from
> devices assigned to the guest from the host machine.

You forgot to post a high level design overview of how this works,
what it is good for, what are the design trade offs etc.? 

Include that in the first patch.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 DMA: Handle devices assigned to the guest by the host

2008-04-29 Thread Andi Kleen
Amit Shah <[EMAIL PROTECTED]> writes:
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 388b113..678cafb 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   memset(memory, 0, size);
>   if (!mmu) {
>   *dma_handle = bus;
> + if (unlikely(dma_ops->is_pv_device) &&
> + unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) {

First double unlikely in a condition is useless. Just drop them.

And then ->is_xyz() in a generic vops interface is about as ugly
and non generic as you can get. dma_alloc_coherent is not performance
critical, so you should rather change the interface that ->alloc_coherent
is always called and the other handlers handle the !mmu case correctly.
In fact they need that already I guess (e.g. on DMAR there is not really
a nommu case) 

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] dm-band: The I/O bandwidth controller: Document

2008-01-23 Thread Andi Kleen
Ryo Tsuruta <[EMAIL PROTECTED]> writes:

> Here is the document of dm-band.

Could you please address in the document how the intended use
cases/feature set etc. differs from CFQ2 io priorities?

Thanks,

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/10] add missing parameter for lookup_address

2008-01-18 Thread Andi Kleen
On Fri, Jan 18, 2008 at 12:26:13PM -0800, Chris Wright wrote:
> * Glauber de Oliveira Costa ([EMAIL PROTECTED]) wrote:
> > lookup_address() receives two parameters, but efi_64.c call
> > is passing only one. It's actually preventing the tree from compiling
> > 
> > Signed-off-by: Glauber de Oliveira Costa <[EMAIL PROTECTED]>
> 
> Good catch, I know I don't test with CONFIG_EFI=y

Ah that came probably from the CPA patchset which added the parameter.
Sorry for that.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/8] unify paravirt parts of system.h

2007-12-17 Thread Andi Kleen
> It probably is safe to remove... but we currently support '2.8.95
> kernel loads/resumes 2.6.24 image'... which would break if 2.8 uses
> cr8.

No it won't. 2.8 would just restore some random useless value.
If 2.8 wants to use CR8 it would have to re-initialize it

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 19/19] unify set_tss_desc

2007-12-13 Thread Andi Kleen
> +static inline void pack_tss(tss_desc *tss, unsigned long addr,
> +unsigned size, unsigned entry)
> +{
> +#ifdef CONFIG_X86_64
> + set_tssldt_descriptor(tss,
> +  addr, entry, size);
> +#else
> + pack_descriptor(tss, (unsigned long)addr,
> + size,
> + 0x80 | entry, 0);

It would be better if you just unified that first instead of ifdef'ing.
I guess i386 could get a set_tssldt_descriptor() like wrapper
around pack_descriptor(). 

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/19] unify desc_struct

2007-12-06 Thread Andi Kleen

> +/*
> + * FIXME: Acessing the desc_struct through its fields is more elegant,
> + * and should be the one valid thing to do. However, a lot of open code
> + * still touches the a and b acessors, and doing this allow us to do it
> + * incrementally. We keep the signature as a struct, rather than an union,
> + * so we can get rid of it transparently in the future -- glommer
> + */
> +#define raw_desc_struct struct { unsigned int a, b; }
> +#define detailed_desc_struct   \
> +  struct {   \
> + u16 limit0; \
> + u16 base0;  \
> + unsigned base1 : 8, type : 4, s : 1, dpl : 2, p : 1;\
> + unsigned limit : 4, avl : 1, l : 1, d : 1, g : 1, base2 :8;\
> +  }

The standard clean way to do this is with a anonymous union.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/8] unify paravirt parts of system.h

2007-12-04 Thread Andi Kleen
On Tue, Dec 04, 2007 at 09:18:33PM +0200, Avi Kivity wrote:
> Glauber de Oliveira Costa wrote:
>> This patch moves the i386 control registers manipulation functions,
>> wbinvd, and clts functions to system.h. They are essentially the same
>> as in x86_64, except for the cr8 register, which we add.
>>
>> +
>> +static inline unsigned long native_read_cr8(void)
>> +{
>> +unsigned long cr8;
>> +asm volatile("mov %%cr8,%0" : "=r" (cr8), "=m" (__force_order));
>> +return cr8;
>> +}
>> +
>>   
>
> There is no cr8 register on i386.  This had better be protected by an 
> #ifdef.
>
> (you're likely not getting an error since it's a static inline, so the asm 
> is never emitted)

Linux never uses that register. The only user is suspend save/restore, 
but that' bogus because it wasn't ever initialized by Linux in the first
place. It could be probably all safely removed.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/10] allow sched clock to be overridden by paravirt

2007-12-04 Thread Andi Kleen
On Tue, Dec 04, 2007 at 09:09:57AM -0200, Glauber de Oliveira Costa wrote:
> This patch turns the sched_clock into native_sched_clock.
> sched clock becomes a weak symbol, which can then give its
> place to a paravirt definition.

I ended up regretting that solution on i386, it would be better
to solve that more clearly with ifdefs.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24

2007-11-13 Thread Andi Kleen

> > Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid
> > upsetting Xen

This patch was the wrong fix anyways as discussed; David Chinner
finally came up with a better one.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/24] make vsmp a paravirt client

2007-11-13 Thread Andi Kleen

> If vsmp is selected, PARAVIRT will be too, and the interrupt code will
> be patched.
> the vsmp option triggers a select statement.
> 
> the ifdef only exists because, as I said, the code itself will be always
> compiled in, to avoid an ifdef in setup_64.c. So it's just a taking it
> from here, putting it there issue.

Peeking around needlessly in pci config space at each boot just to avoid an 
ifdef is not an good idea.

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/24] make vsmp a paravirt client

2007-11-13 Thread Andi Kleen

> The vsmp_64.c file is now compiled unconditionally, according to which
> me and kiran agreed to. The detection code is always run, but will only
> trigger when a suitable box is found. Accordingly, the paravirt structs
> are only touched when PARAVIRT is on. Otherwise, we don't even have the
> symbols.

That seems dumb. What good is it if it doesn't patch the interrupt code?

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/24] make vsmp a paravirt client

2007-11-12 Thread Andi Kleen
On Fri, Nov 09, 2007 at 04:43:05PM -0200, Glauber de Oliveira Costa wrote:
> This patch makes vsmp a paravirt client. It now uses the whole
> infrastructure provided by pvops. When we detect we're running
> a vsmp box, we change the irq-related paravirt operations (and so,
> it have to happen quite early), and the patching function

The PARAVIRT ifdefs look wrong. Surely you don't need them at all
because it cannot work at all without paravirt.

Also you got some white space damage.

And the "EM64T based comment" is wrong because there are AMD based 
vSMPs too.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24

2007-10-15 Thread Andi Kleen

> Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid
> upsetting Xen

This should be probably done unconditionally because it's a undefined
dangerous condition everywhere.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.

2007-09-18 Thread Andi Kleen
On Tuesday 18 September 2007 23:34, Rusty Russell wrote:
> On Tue, 2007-09-18 at 12:57 +0200, Andi Kleen wrote:
> > On Friday 14 September 2007 07:21, Rusty Russell wrote:
> > > It's pretty widely used,
> >
> > Is it? By whom?
>
> Hi Andi,
>
>   Please stop asking for facts!  It's was easy claim to make, and hard to
> disprove 8)
>
> > > and the distributions will turn it on.
> >
> > That's no reason to make it default y. Please undo that. default y
> > is near always a bad idea.
>
> How about a "select" based on Xen, lguest or VMI?  There's no other
> reason to enable it, after all.

I did an patch to do that recently because  the current setup
is indeed unobvious.

But I had to drop it again because 
it ended up with Kconfig warnings. about undefined symbols
on x86-64. The problem is that lguest
is visible in Kconfig for all architectures and it warns
if you select something that doesn't exist on all architectures.

The only workaround would have been to define PARAVIRT
for all architectures, which I considered too ugly.

I think Sam stated recently he wanted to remove that warning
but it needed some infrastructure work.

> > Also I would still consider it experimental.
>
> After 9 months in mainline and three kernel versions, 

Well it changed a lot each release.

> I'd hope not. 
> It's been pretty damn stable (ok, you broke it once, but maybe that's
> because you consider it experimental).

Is there a significant user base? 

At least the Xen port seems to have specific requirements
and essentially only work on xen-unstable (?) [or at least
some very new Xen version] which probably very few
people use.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.

2007-09-18 Thread Andi Kleen
 
> Actually if I understand the functionality of paravirt correctly that is
> not correct. I believe that will turn on the paravirt bits which allow
> it to run under things such as VMI or Xen.

You don't understand it correctly then.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.

2007-09-18 Thread Andi Kleen


>   Why is making something default y a bad idea? 
> Those most likely to care can turn it off. Is there a harmful effect
> from leaving it on if its not being used?

Running yes "" | make oldconfig to upgrade kernel configs is standard practice
and you definitely don't want to have all kinds of random new unnecessary 
features
be turned on then.

Besides paravirt by itself is pretty useless; you need typically quite
complex other options set to do any meaningfull virtualization.

The only reason to use default y is in options that are not user visible
and have a reasonable default or things that cause direct boot failures
when upgrading old configurations. That all doesn't apply here.

Besides it's bad taste and taste is very important.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] Time to make CONFIG_PARAVIRT non-experimental.

2007-09-18 Thread Andi Kleen
On Friday 14 September 2007 07:21, Rusty Russell wrote:
> It's pretty widely used,

Is it? By whom?

> and the distributions will turn it on. 

That's no reason to make it default y. Please undo that. default y
is near always a bad idea.

Also I would still consider it experimental.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Fix preemptible lazy mode bug

2007-09-05 Thread Andi Kleen
>I agree.  The patch is a nop.  I just got overly paranoid.  The whole
>thing is just very prone to bugs.
So do we need a patch for .23 or not?

>; it does
> seem perfectly acceptable though for the mm code to use kmap or vmap
> (not kmap_atomic) internally somewhere in the pagetable code.

i386 does it all the time for highmem pagetables in fact.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


  1   2   3   >