Re: [RFC PATCH v2 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
On Mon, 2016-01-04 at 14:07 -0700, Alex Williamson wrote: > On Thu, 2015-12-31 at 16:50 +0800, Yongji Xie wrote: > > Current vfio-pci implementation disallows to mmap MSI-X > > table in case that user get to touch this directly. > > > > However, EEH mechanism can ensure that a given pci device > > can only shoot the MSIs assigned for its PE. So we think > > it's safe to expose the MSI-X table to userspace because > > the exposed MSI-X table can't be used to do harm to other > > memory space. > > > > And with MSI-X table mmapped, some performance issues which > > are caused when PCI adapters have critical registers in the > > same page as the MSI-X table also can be resolved. > > > > So this patch adds a Kconfig option, VFIO_PCI_MMAP_MSIX, > > to support for mmapping MSI-X table. > > > > Signed-off-by: Yongji Xie> > --- > > drivers/vfio/pci/Kconfig|4 > > drivers/vfio/pci/vfio_pci.c |6 -- > > 2 files changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig > > index 02912f1..67b0a2c 100644 > > --- a/drivers/vfio/pci/Kconfig > > +++ b/drivers/vfio/pci/Kconfig > > @@ -23,6 +23,10 @@ config VFIO_PCI_MMAP > > depends on VFIO_PCI > > def_bool y if !S390 > > > > +config VFIO_PCI_MMAP_MSIX > > + depends on VFIO_PCI_MMAP > > + def_bool y if EEH > > Does CONFIG_EEH necessarily mean the EEH is enabled? Could the > system > not support EEH or could EEH be disabled via kernel commandline > options? EEH is definitely the wrong thing to test here anyway. What needs to be tested is that the PCI Host bridge supports filtering of MSIs, so ideally this should be some kind of host bridge attribute set by the architecture backend. This can happen with or without CONFIG_EEH and you are right, CONFIG_EEH can be enabled and the machine not support it. Any IODA bridge will support this. Cheers, Ben. > > + > > config VFIO_PCI_INTX > > depends on VFIO_PCI > > def_bool y if !S390 > > diff --git a/drivers/vfio/pci/vfio_pci.c > > b/drivers/vfio/pci/vfio_pci.c > > index 09b3805..d536985 100644 > > --- a/drivers/vfio/pci/vfio_pci.c > > +++ b/drivers/vfio/pci/vfio_pci.c > > @@ -555,7 +555,8 @@ static long vfio_pci_ioctl(void *device_data, > > IORESOURCE_MEM && (info.size >= > > PAGE_SIZE || > > pci_resource_page_aligned)) { > > info.flags |= > > VFIO_REGION_INFO_FLAG_MMAP; > > - if (info.index == vdev->msix_bar) > > { > > + if > > (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP_MSIX) && > > + info.index == vdev->msix_bar) > > { > > ret = > > msix_sparse_mmap_cap(vdev, ); > > if (ret) > > return ret; > > @@ -967,7 +968,8 @@ static int vfio_pci_mmap(void *device_data, > > struct vm_area_struct *vma) > > if (phys_len < PAGE_SIZE || req_start + req_len > > > phys_len) > > return -EINVAL; > > > > - if (index == vdev->msix_bar) { > > + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP_MSIX) && > > + index == vdev->msix_bar) { > > /* > > * Disallow mmaps overlapping the MSI-X table; > > users > > don't > > * get to touch this directly. We could find > > somewhere > > -- > To unsubscribe from this list: send the line "unsubscribe linux- > kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
On Thu, 2015-12-17 at 14:41 -0700, Alex Williamson wrote: > > So I think it is safe to mmap/passthrough MSI-X table on PPC64 > > platform. > > And I'm not sure whether other architectures can ensure these two > > points. > > There is another consideration, which is the API exposed to the user. > vfio currently enforces interrupt setup through ioctls by making the > PCI mechanisms for interrupt programming inaccessible through the > device regions. Ignoring that you are only focused on PPC64 with QEMU, > does it make sense for the vfio API to allow a user to manipulate > interrupt programming in a way that not only will not work, but in a > way that we expect to fail and require error isolation to recover from? > I can't say I'm fully convinced that a footnote in the documentation > is sufficient for that. Thanks, Well, one could argue that the "isolation" provided by qemu here is fairly weak anyway ;-) I mean. .. how do you know the device doesn't have a backdoor path into that table via some other MMIO registers anyway ? In any case, the HW isolation on platforms like pseries means that the worst the guest can do si shoot itself in the foot. Big deal. On the other hand, not bothering with intercepting the table has benefits, such as reducing the memory region clutter, but also removing all the massive performacne problems we see because adapters have critical registers in the same 64K page as the MSI-X table. So I don't think there is any question here that we *need* that functionality in power. The filtering of the table by Qemu doesn't provide any practical benefit, it just gets in the way. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Thu, 2015-11-19 at 23:38 +, David Woodhouse wrote: > > I understand that POWER and other platforms don't currently have a > clean way to indicate that certain device don't have translation. And I > understand that we may end up with a *quirk* which ensures that the DMA > API does the right thing (i.e. nothing) in certain cases. > > But we should *NOT* be involving the virtio device drivers in that > quirk, in any way. And putting a feature bit in the virtio device > itself doesn't seem at all sane either. > > Bear in mind that qemu-system-x86_64 currently has the *same* problem > with assigned physical devices. It's claiming they're translated, and > they're not. It's not that clear but yeah ... as I mentioned, I can't find a way to do that quirk that won't break when we want to actually use the iommu... Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, 2015-11-10 at 11:27 +0100, Joerg Roedel wrote: > > You have the same problem when real PCIe devices appear that speak > virtio. I think the only real (still not very nice) solution is to add a > quirk to powerpc platform code that sets noop dma-ops for the existing > virtio vendor/device-ids and add a DT property to opt-out of that quirk. > > New vendor/device-ids (as for real devices) would just not be covered by > the quirk and existing emulated devices continue to work. Why woud real devices use new vendor/device IDs ? Also there are other cases such as using virtio between 2 partitions, which we could do under PowerVM ... that would require proper iommu usage with existing IDs. > The absence of the property just means that the quirk is in place and > the system assumes no translation for virtio devices. The only way that works forward for me (and possibly sparc & others, what about ARM ?) is if we *change* something in virtio qemu at the same time as we add some kind of property. For example the ProgIf field or revision ID field. That way I can key on that change. It's still tricky because I would have to somewhat tell my various firmwares (SLOF, OpenBIOS, OPAL, ...) so they can create the appropriate property, it's still hacky, but it would be workable. Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, 2015-11-10 at 14:43 +0200, Michael S. Tsirkin wrote: > But not virtio-pci I think - that's broken for that usecase since we use > weaker barriers than required for real IO, as these have measureable > overhead. We could have a feature "is a real PCI device", > that's completely reasonable. Do we use weaker barriers on the Linux driver side ? I didn't think so ... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, 2015-11-10 at 10:54 -0800, Andy Lutomirski wrote: > > Does that work on powerpc on existing kernels? > > Anyway, here's another crazy idea: make the quirk assume that the > IOMMU is bypasses if and only if the weak barriers bit is set on > systems that are missing the new DT binding. "New DT bindings" doesn't mean much ... how do we change DT bindings on existing machines with a FW in flash ? What about partition <-> partition virtio such as what we could do on PAPR systems. That would have the weak barrier bit. Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote: > > We could do it the other way around: on powerpc, if a PCI device is in > that range and doesn't have the "bypass" property at all, then it's > assumed to bypass the IOMMU. This means that everything that > currently works continues working. If someone builds a physical > virtio device or uses another system in PCIe target mode speaking > virtio, then it won't work until they upgrade their firmware to set > bypass=0. Meanwhile everyone using hypothetical new QEMU also gets > bypass=0 and no ambiguity. > > vfio will presumably notice the bypass and correctly refuse to map any > current virtio devices. > > Would that work? That would be extremely strange from a platform perspective. Any device in that vendor/device range would bypass the iommu unless some new property "actually-works-like-a-real-pci-device" happens to exist in the device-tree, which we would then need to define somewhere and handle accross at least 3 different platforms who get their device-tree from widly different places. Also if tomorrow I create a PCI device that implements virtio-net and put it in a machine running IBM proprietary firmware (or Apple's or Sun's), it won't have that property... This is not hypothetical. People are using virtio to do point-to-point communication between machines via PCIe today. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, 2015-11-10 at 15:44 -0800, Andy Lutomirski wrote: > > > What about partition <-> partition virtio such as what we could do on > > PAPR systems. That would have the weak barrier bit. > > > > Is it partition <-> partition, bypassing IOMMU? No. > I think I'd settle for just something that doesn't regress > non-experimental setups that actually work today and that allow new > setups (x86 with fixed QEMU and maybe something more complicated on > powerpc and/or sparc) to work in all cases. > > We could certainly just make powerpc and sparc continue bypassing the > IOMMU until someone comes up with a way to fix it. I'll send out some > patches that do that, and maybe that'll help this make progress. But we haven't found a solution that works. All we have come up with is a quirk that will force bypass on virtio always and will not allow us to operate non-bypassing devices on either of those architectures in the future. I'm not too happy about this. Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, 2015-11-10 at 10:45 +0100, Knut Omang wrote: > Can something be done by means of PCIe capabilities? > ATS (Address Translation Support) seems like a natural choice? Euh no... ATS is something else completely Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, 2015-11-10 at 20:46 -0800, Andy Lutomirski wrote: > Me neither. At least it wouldn't be a regression, but it's still > crappy. > > I think that arm is fine, at least. I was unable to find an arm QEMU > config that has any problems with my patches. Ok, give me a few days for my headache & fever to subside see if I can find something better. David, no idea from your side ? :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: > > /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. > */ > static const struct pci_device_id virtio_pci_id_table[] = { > { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, > { 0 } > }; > > Can we match on that range? We can, but the problem remains, how do we differenciate an existing device that does bypass vs. a newer one that needs the IOMMU and thus doesn't have the new "bypass" property in the device-tree. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: > > Which leaves the special case of Xen, where even preexisting devices > don't bypass the IOMMU. Can we keep this specific to powerpc and > sparc? On x86, this problem is basically nonexistent, since the IOMMU > is properly self-describing. > > IOW, I think that on x86 we should assume that all virtio devices > honor the IOMMU. You don't like performances ? :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote: > The problem here is that in some of the problematic cases the virtio > driver may not even be loaded. If someone runs an L1 guest with an > IOMMU-bypassing virtio device and assigns it to L2 using vfio, then > *boom* L1 crashes. (Same if, say, DPDK gets used, I think.) > > > > > The only way out of this while keeping the "platform" stuff would be to > > also bump some kind of version in the virtio config (or PCI header). I > > have no other way to differenciate between "this is an old qemu that > > doesn't do the 'bypass property' yet" from "this is a virtio device > > that doesn't bypass". > > > > Any better idea ? > > I'd suggest that, in the absence of the new DT binding, we assume that > any PCI device with the virtio vendor ID is passthrough on powerpc. I > can do this in the virtio driver, but if it's in the platform code > then vfio gets it right too (i.e. fails to load). The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor ID which will be used by more than just virtio, so we need to specifically list the devices. Additionally, that still means that once we have a virtio device that actually uses the iommu, powerpc will not work since the "workaround" above will kick in. The "in absence of the new DT binding" doesn't make that much sense. Those platforms use device-trees defined since the dawn of ages by actual open firmware implementations, they either have no iommu representation in there (Macs, the platform code hooks it all up) or have various properties related to the iommu but no concept of "bypass" in there. We can *add* a new property under some circumstances that indicates a bypass on a per-device basis, however that doesn't completely solve it: - As I said above, what does the absence of that property mean ? An old qemu that does bypass on all virtio or a new qemu trying to tell you that the virtio device actually does use the iommu (or some other environment that isn't qemu) ? - On things like macs, the device-tree is generated by openbios, it would have to have some added logic to try to figure that out, which means it needs to know *via different means* that some or all virtio devices bypass the iommu. I thus go back to my original statement, it's a LOT easier to handle if the device itself is self describing, indicating whether it is set to bypass a host iommu or not. For L1->L2, well, that wouldn't be the first time qemu/VFIO plays tricks with the passed through device configuration space... Note that the above can be solved via some kind of compromise: The device self describes the ability to honor the iommu, along with the property (or ACPI table entry) that indicates whether or not it does. IE. We could use the revision or ProgIf field of the config space for example. Or something in virtio config. If it's an "old" device, we know it always bypass. If it's a new device, we know it only bypasses if the corresponding property is in. I still would have to sort out the openbios case for mac among others but it's at least a workable direction. BTW. Don't you have a similar problem on x86 that today qemu claims that everything honors the iommu in ACPI ? Unless somebody can come up with a better idea... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
So ... I've finally tried to sort that out for powerpc and I can't find a way to make that work that isn't a complete pile of stinking shit. I'm very tempted to go back to my original idea: virtio itself should indicate it's "bypassing ability" via the virtio config space or some other bit (like the ProgIf of the PCI header etc...). I don't see how I can make it work otherwise. The problem with the statement "it's a platform matter" is that: - It's not entirely correct. It's actually a feature of a specific virtio implementation (qemu's) that it bypasses all the platform IOMMU facilities. - The platforms (on powerpc there's at least 3 or 4 that have qemu emulation and support some form of PCI) don't have an existing way to convey the information that a device bypasses the IOMMU (if any). - Even if we add such a mechanism (new property in the device-tree), we end up with a big turd: Because we need to be compatible with older qemus, we essentially need a quirk that will make all virtio devices assume that property is present. That will of course break whenever we try to use another implementation of virtio on powerpc which doesn't bypass the iommu. We don't have a way to differenciate a virtio device that does the bypass from one that doesn't and the backward compatibility requirement forces us to treat all existing virtio devices as doing bypass. The only way out of this while keeping the "platform" stuff would be to also bump some kind of version in the virtio config (or PCI header). I have no other way to differenciate between "this is an old qemu that doesn't do the 'bypass property' yet" from "this is a virtio device that doesn't bypass". Any better idea ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Tue, 2015-11-03 at 14:11 +0100, Christoph Hellwig wrote: > > xHCI for example, vs. something like 10G ethernet... but yes I agree it > > sucks. I don't like that sort of policy anywhere in drivers. On the > > other hand the platform doesn't have much information to make that sort > > of decision either. > > Mabye because it should simply use what's optimal? E.g. passthrough > whenever possible, where arguments against possible are: dma_mask, vfio > requirements, kernel command line option. Right this is what I do today on powerpc with the exception of the command line option. > This is what a lot of > architectures already do, I remember the SGI Origin / Altix code has the > same behavior as well. Those IOMMUs already had the 64 bit passthrough > and 32-bit sliding window in addition to the real IOMMU 10 years ago. > -- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote: > On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt > wrote: > > > > Chosing on a per-mapping basis *in the back end* might still make > > some > > In my case, choosing mapping based on the hardware that will use this > mappings makes more sense. Most hardware are not that performance > sensitive as the Infiniband hardware. ... > The driver know for what hardware it is mapping the memory so it know > if the memory will be used by performance sensitive hardware or not. Then I would argue for naming this differently. Make it an optional hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is achieved via using a bypass or other means in the backend not the business of the driver. > In your case, what will give the better performance - 1:1 mapping or > IOMMU > mapping? When you say 'relaxing the protection' you refer to 1:1 > mapping? > Also, how this 1:1 window address the security concerns that other > raised > by other here? It will partially only but it's just an example of another way the bakcend could provide some improved performances without a bypass. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 22:48 +, David Woodhouse wrote: > > In the Intel case, the mapping setup is entirely per-device (except for > crap devices and devices behind a PCIe-PCI bridge, etc.). > > So you can happily have a passthrough mapping for *one* device, without > making that same mapping available to another device. You can make that > trade-off of speed vs. protection for each device in the system. Same for me. I should have written "in the context of a device ...", the problem reamains that it doesn't buy you much to do it *per -mapping* which is what this API seems to be about. > Currently we have the 'iommu=pt' option that makes us use passthrough > for *all* devices (except those whose dma_mask can't reach all of > physical memory). But I could live with something like Shamir is > proposing, which lets us do the bypass only for performance-sensitive > devices which actually *ask* for it (and of course we'd have a system- > wide mode which declines that request and does normal mappings anyway). > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 09:23 +0200, Shamir Rabinovitch wrote: > To summary - > > 1. The whole point of the IOMMU pass through was to get bigger address space > and faster map/unmap operations for performance critical hardware > 2. SPARC IOMMU in particular has the ability to DVMA which adress all the > protection concerns raised above. Not sure what will be the > performance > impact though. This still need a lot of work before we could test > this. > 3. On x86 we use IOMMU in pass through mode so all the above concerns are > valid > > The question are - > > 1. Does partial use of IOMMU while the pass through window is enabled add some > protection? > 2. Do we rather the x86 way of doing this which is enable / disable IOMMU > translations at kernel level? > > I think that I can live with option (2) till I have DVMA if there is strong > disagree on the need for per allocation IOMMU bypass. Chosing on a per-mapping basis *in the back end* might still make some amount of sense. What I don't completely grasp is what does it give you to expose that choice to the *driver* all the way up the chain. Why does the driver knows better whether something should use the bypass or not ? I can imagine some in-between setups, for example, on POWER (and probably x86), I could setup a window that is TCE-mapped (TCEs are our iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always map to the same physical page. I could then use map/unmap to adjust the protection, the idea being that only "relaxing" the protection requires flushing the IO TLB, ie, we could delay most flushes. But that sort of optimization only makes sense in the back-end. So what was your original idea where you thought the driver was the right one to decide whether to use the bypass or not for a given map operation ? That's what I don't grasp... you might have a valid case that I just fail to see. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote: > > Then I would argue for naming this differently. Make it an optional > > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is > > achieved via using a bypass or other means in the backend not the > > business of the driver. > > > > With a name like that, who wouldn't pass that flag? ;-) xHCI for example, vs. something like 10G ethernet... but yes I agree it sucks. I don't like that sort of policy anywhere in drivers. On the other hand the platform doesn't have much information to make that sort of decision either. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote: > Not sure this use case is possible for Infiniband where application hold > the data buffers and there is no way to force application to re use the > buffer as suggested. > > This is why I think there will be no easy way to bypass the DMA mapping cost > for all use cases unless we allow applications to request bypass/pass through > DMA mapping (implicitly or explicitly). But but but ... What I don't understand is how that brings you any safety. Basically, either your bridge has a bypass window, or it doesn't. (Or it has one and it's enabled or not, same thing). If you request the bypass on a per-mapping basis, you basically have to keep the window always enabled, unless you do some nasty refcounting of how many people have a bypass mapping in flight, but that doesn't seem useful. Thus you have already lost all protection from the device, since your entire memory is accessible via the bypass mapping. Which means what is the point of then having non-bypass mappings for other things ? Just to make things slower ? I really don't see what this whole "bypass on a per-mapping basis" buys you. Note that we implicitly already do that on powerpc, but not for those reasons, we do it based on the DMA mask, so that if your coherent mask is 32-bit but your dma mask is 64-bit (which is not an uncommon combination), we service the "coherent" requests (basically the long lifed dma allocs) from the remapped region and the "stream" requests (ie, map_page/map_sg) from the bypass. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote: > On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote: > > > > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes > > > across > > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have > > > a > > > sane meaning in the paranoid mode (and perhaps we'd want an ultra > > > -paranoid mode where it's not honoured). > > > > Possibly, though ideally that would be a user policy but of course > > by > > the time you get to userspace it's generally too late. > > IIRC, we have an 'iommu=force' command line switch for this, to > ensure > that no device can use a linear mapping and everything goes th ough > the page tables. This is often useful for both debugging and as a > security measure when dealing with unpriviledged DMA access (virtual > machines, vfio, ...). That was used to force-enable the iommu on platforms like G5s where we would otherwise only do so if the memory was larger than 32-bit but we never implemented using it to prevent the bypass region. > If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly > document > which osed to force-enable the iommu on HGthe two we expect to take > priority in cases where we have a > choice. > > I wonder if the 'iommu=force' attribute is too coarse-grained though, > and if we should perhaps allow a per-device setting on architectures > that allow this. The interesting thing, if we can make it work, is the bypass attribute being per mapping... Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote: > On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt" > <b...@kernel.crashing.org> wrote: > > > > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > > > > > On Power, I generally have 2 IOMMU windows for a device, one at > > > > the > > > > bottom is remapped, and is generally used for 32-bit devices > > > > and the > > > > one at the top us setup as a bypass > > > > > > So in the normal case of decent 64-bit devices (and not in a VM), > > > they'll *already* be using the bypass region and have full access > > > to > > > all of memory, all of the time? And you have no protection > > > against > > > driver and firmware bugs causing stray DMA? > > > > Correct, we chose to do that for performance reasons. > > Could this be mitigated using pools? I don't know if the net code > would play along easily. Possibly, the pools we have already limit the lock contention but we still have the map/unmap overhead which under a hypervisor can be quite high. I'm not necessarily against changing the way we do things but it would have to be backed up with numbers. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote: > We have discussed that at kernel summit. I will try to implement a dummy > dma_ops for > s390 that does 1:1 mapping and Ben will look into doing some quirk to handle > "old" > code in addition to also make it possible to mark devices as iommu bypass > (IIRC, > via device tree, Ben?) Something like that yes. I'll look into it when I'm back home. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > On Power, I generally have 2 IOMMU windows for a device, one at the > > bottom is remapped, and is generally used for 32-bit devices and the > > one at the top us setup as a bypass > > So in the normal case of decent 64-bit devices (and not in a VM), > they'll *already* be using the bypass region and have full access to > all of memory, all of the time? And you have no protection against > driver and firmware bugs causing stray DMA? Correct, we chose to do that for performance reasons. > > I don't see how thata ttribute would work for us. > > Because you're already doing it anyway without being asked :) > > If SPARC and POWER are both doing that, perhaps we should change the > default for Intel too? > > Aside from the lack of security, the other disadvantage of that is that > you have to pin *all* pages of a guest in case DMA happens; you don't > get to pin *only* those pages which are referenced by that guest's > IOMMU page tables... Correct, the problem is that the cost of doing map/unmap from a guest is really a huge hit on things like network devices. Another problem is that the failure mode isn't great if you don't pin. IE. You have to pin pages as they get mapped into the iommu by the guest, but you don't know in advance how much and you may hit the process ulimit on pinned pages half way through. We tried to address that in various ways but it always ended up horrid. > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a > sane meaning in the paranoid mode (and perhaps we'd want an ultra > -paranoid mode where it's not honoured). Possibly, though ideally that would be a user policy but of course by the time you get to userspace it's generally too late. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Wed, 2015-10-28 at 22:31 +0900, David Woodhouse wrote: > We have an option in the Intel IOMMU for pass-through mode too, which > basically *is* a total bypass. In practice, what's the difference > between that and a "simple translation that does not require any > [translation]"? We set up a full 1:1 mapping of all memory, and then > the map/unmap methods become no-ops. > > Currently we have no way to request that mode on a per-device basis; > we > only have 'iommu=pt' on the command line to set it for *all* devices. > But performance-sensitive devices might want it, while we keep doing > proper translation for others. On Power, I generally have 2 IOMMU windows for a device, one at the bottom is remapped, and is generally used for 32-bit devices and the one at the top us setup as a bypass (or in the case of KVM, as a linear mapping of guest memory which looks the same as a bypass to the guest). The DMA ops will automatically hit the appropriate one based on the DMA mask. I don't see how that attribute would work for us. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-ppc] KVM memory slots limit on powerpc
On Fri, 2015-09-04 at 12:28 +0200, Thomas Huth wrote: > > Maybe some rcu protected scheme that doubles the amount of memslots > > for > > each overrun? Yes, that would be good and even reduce the footprint > > for > > systems with only a small number of memslots. > > Seems like Alex Williamson already posted a patchset for growable > memslots a couple of years ago: > > http://www.spinics.net/lists/kvm/msg50491.html > > But I didn't quite spot the result in that thread why it never has > been > included upstream. Alex (W.), do you remember the outcome? Isn't the memslot array *already* protected by RCU anyway ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-ppc] KVM memory slots limit on powerpc
On Fri, 2015-09-04 at 12:28 +0200, Thomas Huth wrote: > > Maybe some rcu protected scheme that doubles the amount of memslots > > for > > each overrun? Yes, that would be good and even reduce the footprint > > for > > systems with only a small number of memslots. > > Seems like Alex Williamson already posted a patchset for growable > memslots a couple of years ago: > > http://www.spinics.net/lists/kvm/msg50491.html > > But I didn't quite spot the result in that thread why it never has > been > included upstream. Alex (W.), do you remember the outcome? Isn't the memslot array *already* protected by RCU anyway ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ppc: Fix size of the PSPB register
On Wed, 2015-09-02 at 08:24 +1000, Paul Mackerras wrote: > On Tue, Sep 01, 2015 at 11:41:18PM +0200, Thomas Huth wrote: > > The size of the Problem State Priority Boost Register is only > > 32 bits, so let's change the type of the corresponding variable > > accordingly to avoid future trouble. > > Since we're already using lwz/stw in the assembly code in > book3s_hv_rmhandlers.S, this is actually a bug fix, isn't it? > How did you find it? Did you observe a failure of some kind, or did > you just find it by code inspection? Won't the fix break migration ? Unless qemu doens't migrate it ... > Paul. > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ppc: Fix size of the PSPB register
On Wed, 2015-09-02 at 08:24 +1000, Paul Mackerras wrote: > On Tue, Sep 01, 2015 at 11:41:18PM +0200, Thomas Huth wrote: > > The size of the Problem State Priority Boost Register is only > > 32 bits, so let's change the type of the corresponding variable > > accordingly to avoid future trouble. > > Since we're already using lwz/stw in the assembly code in > book3s_hv_rmhandlers.S, this is actually a bug fix, isn't it? > How did you find it? Did you observe a failure of some kind, or did > you just find it by code inspection? Won't the fix break migration ? Unless qemu doens't migrate it ... > Paul. > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ppc: Fix size of the PSPB register
On Wed, 2015-09-02 at 08:45 +1000, Paul Mackerras wrote: > On Wed, Sep 02, 2015 at 08:25:05AM +1000, Benjamin Herrenschmidt > wrote: > > On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote: > > > The size of the Problem State Priority Boost Register is only > > > 32 bits, so let's change the type of the corresponding variable > > > accordingly to avoid future trouble. > > > > It's not future trouble, it's broken today for LE and this should > > fix > > it BUT > > No, it's broken today for BE hosts, which will always see 0 for the > PSPB register value. LE hosts are fine. 0 or PSPB << 32 ? > > The asm accesses it using lwz/stw and C accesses it as a ulong. On > > LE > > that will mean that userspace will see the value << 32 > > No, that will happen on BE, and since KVM_REG_PPC_PSPB says it's a > 32-bit register, we'll just pass 0 back to userspace when it reads > it. Ah ok, I missed that bit about KVM_REG_PPC_PSPB > > Now "fixing" it might break migration if that field is already > > stored/loaded in its "broken" form. We may have to keep the > > "broken" > > behaviour and document that qemu sees a value shifted by 32. > > It will be being set to 0 on BE hosts across migration today > (fortunately 0 is a benign value for PSPB). If we fix this on both > the source and destination host, then the value will get migrated > across correctly. Ok, I missed the part where KVM_REG_PPC_PSPB passed it down as a 32 -bit. That means Thomas patch should work indeed. > I think Thomas's patch is fine, it just needs a stronger patch > description saying that it fixes an actual bug. Right. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ppc: Fix size of the PSPB register
On Wed, 2015-09-02 at 08:45 +1000, Paul Mackerras wrote: > On Wed, Sep 02, 2015 at 08:25:05AM +1000, Benjamin Herrenschmidt > wrote: > > On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote: > > > The size of the Problem State Priority Boost Register is only > > > 32 bits, so let's change the type of the corresponding variable > > > accordingly to avoid future trouble. > > > > It's not future trouble, it's broken today for LE and this should > > fix > > it BUT > > No, it's broken today for BE hosts, which will always see 0 for the > PSPB register value. LE hosts are fine. 0 or PSPB << 32 ? > > The asm accesses it using lwz/stw and C accesses it as a ulong. On > > LE > > that will mean that userspace will see the value << 32 > > No, that will happen on BE, and since KVM_REG_PPC_PSPB says it's a > 32-bit register, we'll just pass 0 back to userspace when it reads > it. Ah ok, I missed that bit about KVM_REG_PPC_PSPB > > Now "fixing" it might break migration if that field is already > > stored/loaded in its "broken" form. We may have to keep the > > "broken" > > behaviour and document that qemu sees a value shifted by 32. > > It will be being set to 0 on BE hosts across migration today > (fortunately 0 is a benign value for PSPB). If we fix this on both > the source and destination host, then the value will get migrated > across correctly. Ok, I missed the part where KVM_REG_PPC_PSPB passed it down as a 32 -bit. That means Thomas patch should work indeed. > I think Thomas's patch is fine, it just needs a stronger patch > description saying that it fixes an actual bug. Right. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ppc: Fix size of the PSPB register
On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote: > The size of the Problem State Priority Boost Register is only > 32 bits, so let's change the type of the corresponding variable > accordingly to avoid future trouble. It's not future trouble, it's broken today for LE and this should fix it BUT The asm accesses it using lwz/stw and C accesses it as a ulong. On LE that will mean that userspace will see the value << 32 Now "fixing" it might break migration if that field is already stored/loaded in its "broken" form. We may have to keep the "broken" behaviour and document that qemu sees a value shifted by 32. Cheers, Ben. > Signed-off-by: Thomas Huth> --- > arch/powerpc/include/asm/kvm_host.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index d91f65b..c825f3a 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -473,7 +473,7 @@ struct kvm_vcpu_arch { > ulong ciabr; > ulong cfar; > ulong ppr; > - ulong pspb; > + u32 pspb; > ulong fscr; > ulong shadow_fscr; > ulong ebbhr; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ppc: Fix size of the PSPB register
On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote: > The size of the Problem State Priority Boost Register is only > 32 bits, so let's change the type of the corresponding variable > accordingly to avoid future trouble. It's not future trouble, it's broken today for LE and this should fix it BUT The asm accesses it using lwz/stw and C accesses it as a ulong. On LE that will mean that userspace will see the value << 32 Now "fixing" it might break migration if that field is already stored/loaded in its "broken" form. We may have to keep the "broken" behaviour and document that qemu sees a value shifted by 32. Cheers, Ben. > Signed-off-by: Thomas Huth> --- > arch/powerpc/include/asm/kvm_host.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index d91f65b..c825f3a 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -473,7 +473,7 @@ struct kvm_vcpu_arch { > ulong ciabr; > ulong cfar; > ulong ppr; > - ulong pspb; > + u32 pspb; > ulong fscr; > ulong shadow_fscr; > ulong ebbhr; -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: sleeping function called from ras_epow_interrupt context
On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote: Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use mdelay() instead of msleep() in rtas_busy_delay()? Something more fancy? A proper fix would be more fancy, the get_sensor should happen in a kernel thread instead. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/25] powerpc: Use bool function return values of true/false not 1/0
On Mon, 2015-03-30 at 16:46 -0700, Joe Perches wrote: Use the normal return values for bool functions Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org Should we merge it or will you ? Cheers, Ben. Signed-off-by: Joe Perches j...@perches.com --- arch/powerpc/include/asm/dcr-native.h| 2 +- arch/powerpc/include/asm/dma-mapping.h | 4 ++-- arch/powerpc/include/asm/kvm_book3s_64.h | 4 ++-- arch/powerpc/sysdev/dcr.c| 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/dcr-native.h b/arch/powerpc/include/asm/dcr-native.h index 7d2e623..4efc11d 100644 --- a/arch/powerpc/include/asm/dcr-native.h +++ b/arch/powerpc/include/asm/dcr-native.h @@ -31,7 +31,7 @@ typedef struct { static inline bool dcr_map_ok_native(dcr_host_native_t host) { - return 1; + return true; } #define dcr_map_native(dev, dcr_n, dcr_c) \ diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 894d538..9103687 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -191,11 +191,11 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) struct dev_archdata *sd = dev-archdata; if (sd-max_direct_dma_addr addr + size sd-max_direct_dma_addr) - return 0; + return false; #endif if (!dev-dma_mask) - return 0; + return false; return addr + size - 1 = *dev-dma_mask; } diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 2d81e20..2a244bf 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -335,7 +335,7 @@ static inline bool hpte_read_permission(unsigned long pp, unsigned long key) { if (key) return PP_RWRX = pp pp = PP_RXRX; - return 1; + return true; } static inline bool hpte_write_permission(unsigned long pp, unsigned long key) @@ -373,7 +373,7 @@ static inline bool slot_is_aligned(struct kvm_memory_slot *memslot, unsigned long mask = (pagesize PAGE_SHIFT) - 1; if (pagesize = PAGE_SIZE) - return 1; + return true; return !(memslot-base_gfn mask) !(memslot-npages mask); } diff --git a/arch/powerpc/sysdev/dcr.c b/arch/powerpc/sysdev/dcr.c index 2d8a101..121e26f 100644 --- a/arch/powerpc/sysdev/dcr.c +++ b/arch/powerpc/sysdev/dcr.c @@ -54,7 +54,7 @@ bool dcr_map_ok_generic(dcr_host_t host) else if (host.type == DCR_HOST_MMIO) return dcr_map_ok_mmio(host.host.mmio); else - return 0; + return false; } EXPORT_SYMBOL_GPL(dcr_map_ok_generic); -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/25] powerpc: Use bool function return values of true/false not 1/0
On Mon, 2015-03-30 at 16:46 -0700, Joe Perches wrote: Use the normal return values for bool functions Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org Should we merge it or will you ? Cheers, Ben. Signed-off-by: Joe Perches j...@perches.com --- arch/powerpc/include/asm/dcr-native.h| 2 +- arch/powerpc/include/asm/dma-mapping.h | 4 ++-- arch/powerpc/include/asm/kvm_book3s_64.h | 4 ++-- arch/powerpc/sysdev/dcr.c| 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/dcr-native.h b/arch/powerpc/include/asm/dcr-native.h index 7d2e623..4efc11d 100644 --- a/arch/powerpc/include/asm/dcr-native.h +++ b/arch/powerpc/include/asm/dcr-native.h @@ -31,7 +31,7 @@ typedef struct { static inline bool dcr_map_ok_native(dcr_host_native_t host) { - return 1; + return true; } #define dcr_map_native(dev, dcr_n, dcr_c) \ diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 894d538..9103687 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -191,11 +191,11 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) struct dev_archdata *sd = dev-archdata; if (sd-max_direct_dma_addr addr + size sd-max_direct_dma_addr) - return 0; + return false; #endif if (!dev-dma_mask) - return 0; + return false; return addr + size - 1 = *dev-dma_mask; } diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 2d81e20..2a244bf 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -335,7 +335,7 @@ static inline bool hpte_read_permission(unsigned long pp, unsigned long key) { if (key) return PP_RWRX = pp pp = PP_RXRX; - return 1; + return true; } static inline bool hpte_write_permission(unsigned long pp, unsigned long key) @@ -373,7 +373,7 @@ static inline bool slot_is_aligned(struct kvm_memory_slot *memslot, unsigned long mask = (pagesize PAGE_SHIFT) - 1; if (pagesize = PAGE_SIZE) - return 1; + return true; return !(memslot-base_gfn mask) !(memslot-npages mask); } diff --git a/arch/powerpc/sysdev/dcr.c b/arch/powerpc/sysdev/dcr.c index 2d8a101..121e26f 100644 --- a/arch/powerpc/sysdev/dcr.c +++ b/arch/powerpc/sysdev/dcr.c @@ -54,7 +54,7 @@ bool dcr_map_ok_generic(dcr_host_t host) else if (host.type == DCR_HOST_MMIO) return dcr_map_ok_mmio(host.host.mmio); else - return 0; + return false; } EXPORT_SYMBOL_GPL(dcr_map_ok_generic); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Remove page table walk helpers
On Mon, 2015-03-30 at 10:39 +0530, Aneesh Kumar K.V wrote: This patch remove helpers which we had used only once in the code. Limiting page table walk variants help in ensuring that we won't end up with code walking page table with wrong assumptions. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Alex, it would be preferable to have this (and the previous one) in the powerpc tree due to dependencies with further fixes to our page table walking vs. THP, so if you're happy, just give us an ack. Cheers, Ben. --- arch/powerpc/include/asm/pgtable.h | 21 - arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 - arch/powerpc/kvm/e500_mmu_host.c| 2 +- 3 files changed, 28 insertions(+), 57 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 9835ac4173b7..92fe01c355a9 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -249,27 +249,6 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, #endif pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift); - -static inline pte_t *lookup_linux_ptep(pgd_t *pgdir, unsigned long hva, - unsigned long *pte_sizep) -{ - pte_t *ptep; - unsigned long ps = *pte_sizep; - unsigned int shift; - - ptep = find_linux_pte_or_hugepte(pgdir, hva, shift); - if (!ptep) - return NULL; - if (shift) - *pte_sizep = 1ul shift; - else - *pte_sizep = PAGE_SIZE; - - if (ps *pte_sizep) - return NULL; - - return ptep; -} #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 625407e4d3b0..73e083cb9f7e 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -131,25 +131,6 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, unlock_rmap(rmap); } -static pte_t lookup_linux_pte_and_update(pgd_t *pgdir, unsigned long hva, - int writing, unsigned long *pte_sizep) -{ - pte_t *ptep; - unsigned long ps = *pte_sizep; - unsigned int hugepage_shift; - - ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift); - if (!ptep) - return __pte(0); - if (hugepage_shift) - *pte_sizep = 1ul hugepage_shift; - else - *pte_sizep = PAGE_SIZE; - if (ps *pte_sizep) - return __pte(0); - return kvmppc_read_update_linux_pte(ptep, writing, hugepage_shift); -} - static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v) { asm volatile(PPC_RELEASE_BARRIER : : : memory); @@ -166,10 +147,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, struct revmap_entry *rev; unsigned long g_ptel; struct kvm_memory_slot *memslot; - unsigned long pte_size; + unsigned hpage_shift; unsigned long is_io; unsigned long *rmap; - pte_t pte; + pte_t *ptep; unsigned int writing; unsigned long mmu_seq; unsigned long rcbits; @@ -208,22 +189,33 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Translate to host virtual address */ hva = __gfn_to_hva_memslot(memslot, gfn); + ptep = find_linux_pte_or_hugepte(pgdir, hva, hpage_shift); + if (ptep) { + pte_t pte; + unsigned int host_pte_size; - /* Look up the Linux PTE for the backing page */ - pte_size = psize; - pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte) !pte_protnone(pte)) { - if (writing !pte_write(pte)) - /* make the actual HPTE be read-only */ - ptel = hpte_make_readonly(ptel); - is_io = hpte_cache_bits(pte_val(pte)); - pa = pte_pfn(pte) PAGE_SHIFT; - pa |= hva (pte_size - 1); - pa |= gpa ~PAGE_MASK; - } + if (hpage_shift) + host_pte_size = 1ul hpage_shift; + else + host_pte_size = PAGE_SIZE; + /* + * We should always find the guest page size + * to = host page size, if host is using hugepage + */ + if (host_pte_size psize) + return H_PARAMETER; - if (pte_size psize) - return H_PARAMETER; + pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift); + if (pte_present(pte) !pte_protnone(pte)) { + if (writing !pte_write(pte)) + /* make the actual HPTE be read-only */ +
Re: [PATCH 2/2] KVM: PPC: Remove page table walk helpers
On Mon, 2015-03-30 at 10:39 +0530, Aneesh Kumar K.V wrote: This patch remove helpers which we had used only once in the code. Limiting page table walk variants help in ensuring that we won't end up with code walking page table with wrong assumptions. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Alex, it would be preferable to have this (and the previous one) in the powerpc tree due to dependencies with further fixes to our page table walking vs. THP, so if you're happy, just give us an ack. Cheers, Ben. --- arch/powerpc/include/asm/pgtable.h | 21 - arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 - arch/powerpc/kvm/e500_mmu_host.c| 2 +- 3 files changed, 28 insertions(+), 57 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 9835ac4173b7..92fe01c355a9 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -249,27 +249,6 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, #endif pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift); - -static inline pte_t *lookup_linux_ptep(pgd_t *pgdir, unsigned long hva, - unsigned long *pte_sizep) -{ - pte_t *ptep; - unsigned long ps = *pte_sizep; - unsigned int shift; - - ptep = find_linux_pte_or_hugepte(pgdir, hva, shift); - if (!ptep) - return NULL; - if (shift) - *pte_sizep = 1ul shift; - else - *pte_sizep = PAGE_SIZE; - - if (ps *pte_sizep) - return NULL; - - return ptep; -} #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 625407e4d3b0..73e083cb9f7e 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -131,25 +131,6 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, unlock_rmap(rmap); } -static pte_t lookup_linux_pte_and_update(pgd_t *pgdir, unsigned long hva, - int writing, unsigned long *pte_sizep) -{ - pte_t *ptep; - unsigned long ps = *pte_sizep; - unsigned int hugepage_shift; - - ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift); - if (!ptep) - return __pte(0); - if (hugepage_shift) - *pte_sizep = 1ul hugepage_shift; - else - *pte_sizep = PAGE_SIZE; - if (ps *pte_sizep) - return __pte(0); - return kvmppc_read_update_linux_pte(ptep, writing, hugepage_shift); -} - static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v) { asm volatile(PPC_RELEASE_BARRIER : : : memory); @@ -166,10 +147,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, struct revmap_entry *rev; unsigned long g_ptel; struct kvm_memory_slot *memslot; - unsigned long pte_size; + unsigned hpage_shift; unsigned long is_io; unsigned long *rmap; - pte_t pte; + pte_t *ptep; unsigned int writing; unsigned long mmu_seq; unsigned long rcbits; @@ -208,22 +189,33 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Translate to host virtual address */ hva = __gfn_to_hva_memslot(memslot, gfn); + ptep = find_linux_pte_or_hugepte(pgdir, hva, hpage_shift); + if (ptep) { + pte_t pte; + unsigned int host_pte_size; - /* Look up the Linux PTE for the backing page */ - pte_size = psize; - pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte) !pte_protnone(pte)) { - if (writing !pte_write(pte)) - /* make the actual HPTE be read-only */ - ptel = hpte_make_readonly(ptel); - is_io = hpte_cache_bits(pte_val(pte)); - pa = pte_pfn(pte) PAGE_SHIFT; - pa |= hva (pte_size - 1); - pa |= gpa ~PAGE_MASK; - } + if (hpage_shift) + host_pte_size = 1ul hpage_shift; + else + host_pte_size = PAGE_SIZE; + /* + * We should always find the guest page size + * to = host page size, if host is using hugepage + */ + if (host_pte_size psize) + return H_PARAMETER; - if (pte_size psize) - return H_PARAMETER; + pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift); + if (pte_present(pte) !pte_protnone(pte)) { + if (writing !pte_write(pte)) + /* make the actual HPTE be read-only */ +
Re: [PATCH v5 25/29] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
On Wed, 2015-03-11 at 19:54 +1100, Alexey Kardashevskiy wrote: +/* Page size flags for ibm,query-pe-dma-window */ +#define DDW_PGSIZE_4K 0x01 +#define DDW_PGSIZE_64K 0x02 +#define DDW_PGSIZE_16M 0x04 +#define DDW_PGSIZE_32M 0x08 +#define DDW_PGSIZE_64M 0x10 +#define DDW_PGSIZE_128M 0x20 +#define DDW_PGSIZE_256M 0x40 +#define DDW_PGSIZE_16G 0x80 +#define DDW_PGSIZE_MASK 0xFF + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u32 max_dynamic_windows_supported; + __u32 max_levels; + __u32 flags; Just realized that due to their static nature, they are better to be in iommu_table_group_ops, will fix it in v6. Ugh ? I dislike mixing function pointers and other fields, even if statis. If you *really* want to separate them make them a struct iommu_table_info and declare a const member. Otherwise don't bother and leave them where they are. + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ed60b38..07857c4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -48,6 +48,7 @@ #include pci.h #define POWERNV_IOMMU_DEFAULT_LEVELS 1 +#define POWERNV_IOMMU_MAX_LEVELS 5 extern void ioda_eeh_tvt_print(struct pnv_phb *phb); @@ -1155,11 +1156,14 @@ static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index, pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); } +static void pnv_pci_free_table(struct iommu_table *tbl); + struct iommu_table_ops pnv_ioda1_iommu_ops = { .set = pnv_ioda1_tce_build_vm, .exchange = pnv_ioda1_tce_xchg_vm, .clear = pnv_ioda1_tce_free_vm, .get = pnv_tce_get, + .free = pnv_pci_free_table }; static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, @@ -1317,6 +1321,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); } tbl-it_ops = pnv_ioda1_iommu_ops; + pe-table_group.tce32_start = tbl-it_offset tbl-it_page_shift; + pe-table_group.tce32_size = tbl-it_size tbl-it_page_shift; + pe-table_group.max_dynamic_windows_supported = 0; + pe-table_group.max_levels = 0; + pe-table_group.flags = 0; iommu_init_table(tbl, phb-hose-node); iommu_register_group(pe-table_group, phb-hose-global_number, pe-pe_number); @@ -1401,7 +1410,7 @@ static __be64 *pnv_alloc_tce_table(int nid, } static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, - __u32 page_shift, __u64 window_size, __u32 levels, + int num, __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) { struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, @@ -1428,8 +1437,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, shift = ROUND_UP(ilog2(window_size) - page_shift, levels) / levels; shift += 3; shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K); - pr_info(Creating TCE table %08llx, %d levels, TCE table size = %lx\n, - window_size, levels, 1UL shift); + pr_info(Creating TCE table #%d %08llx, %d levels, TCE table size = %lx\n, + num, window_size, levels, 1UL shift); tbl-it_level_size = 1ULL (shift - 3); left = tce_table_size; @@ -1440,11 +1449,10 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, tbl-it_indirect_levels = levels - 1; /* Setup linux iommu table */ - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - page_shift); + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, + num ? pe-tce_bypass_base : 0, page_shift); tbl-it_ops = pnv_ioda2_iommu_ops; - iommu_init_table(tbl, nid); return 0; } @@ -1461,8 +1469,21 @@ static void pnv_pci_free_table(struct iommu_table *tbl) iommu_reset_table(tbl, ioda2); } +static inline void pnv_pci_ioda2_tvt_invalidate(unsigned int pe_number, + unsigned long it_index) +{ + __be64 __iomem *invalidate = (__be64 __iomem *)it_index; + /* 01xb - invalidate TCEs that match the specified PE# */ + unsigned long addr = (0x4ull 60) | (pe_number 0xFF); + + if (!it_index) + return; + + __raw_writeq(cpu_to_be64(addr), invalidate); +} + static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, - struct iommu_table
Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size
On Tue, 2015-03-10 at 17:03 -0600, Alex Williamson wrote: return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift); This won't be bool though. Yes, it will. Don't you have your parenthesis in the wrong place, Alex ? :-) This will (I'll do this) shift = PAGE_SHIFT + compound_order(compound_head(page)); return (shift = page_shift); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/11] qemu: towards virtio-1 host support
On Wed, 2014-10-22 at 16:17 +0200, Jan Kiszka wrote: I thought about this again, and I'm not sure anymore if we can use ACPI to black-list the incompatible virtio devices. Reason: hotplug. To my understanding, the ACPI DRHD tables won't change during runtime when a device shows up or disappears. We would have to isolate virtio devices from the rest of the system by using separate buses for it (and avoid listing those in any DRHD table) and enforce that they only get plugged into those buses. I suppose that is not desirable. Maybe it's better to fix virtio /wrt IOMMUs. I always go back to my initial proposal which is to define that current virtio always bypass any iommu (which is what it does really) and have it expose via a new capability if that isn't the case. That means fixing that Xen thingy to allow qemu to know what to expose I assume but that seems to be the less bad approach. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] powerpc/kvm: support to handle sw breakpoint
On Mon, 2014-08-11 at 09:26 +0200, Alexander Graf wrote: diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c index da86d9b..d95014e 100644 --- a/arch/powerpc/kvm/emulate.c +++ b/arch/powerpc/kvm/emulate.c This should be book3s_emulate.c. Any reason we can't make that 0000 opcode as breakpoint common to all powerpc variants ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] powerpc/kvm: support to handle sw breakpoint
On Mon, 2014-08-11 at 09:26 +0200, Alexander Graf wrote: diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c index da86d9b..d95014e 100644 --- a/arch/powerpc/kvm/emulate.c +++ b/arch/powerpc/kvm/emulate.c This should be book3s_emulate.c. Any reason we can't make that 0000 opcode as breakpoint common to all powerpc variants ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/5] powerpc/eeh: Export eeh_iommu_group_to_pe()
On Thu, 2014-08-07 at 12:47 +1000, Gavin Shan wrote: The function is used by VFIO driver, which might be built as a dynamic module. So it should be exported. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org Alex, are you taking this or should I ? --- arch/powerpc/kernel/eeh.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 6043879..59a64f8 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1254,6 +1254,7 @@ struct eeh_pe *eeh_iommu_group_to_pe(struct iommu_group *group) return edev-pe; } +EXPORT_SYMBOL_GPL(eeh_iommu_group_to_pe); #endif /* CONFIG_IOMMU_API */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/5] powerpc/eeh: Add warning message in eeh_dev_open()
On Thu, 2014-08-07 at 12:47 +1000, Gavin Shan wrote: The patch adds one warning message in eeh_dev_open() in case the PCI device can't be marked as passed through. Suggested-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org Alex, are you taking this or should I ? arch/powerpc/kernel/eeh.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 59a64f8..5d73a49 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1162,8 +1162,11 @@ int eeh_dev_open(struct pci_dev *pdev) /* No EEH device or PE ? */ edev = pci_dev_to_eeh_dev(pdev); - if (!edev || !edev-pe) + if (!edev || !edev-pe) { + pr_warn_once(%s: PCI device %s not supported\n, + __func__, pci_name(pdev)); goto out; + } /* Increase PE's pass through count */ atomic_inc(edev-pe-pass_dev_cnt); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] vfio_pci: spapr: Enable VFIO if EEH is not supported
On Tue, 2014-08-05 at 21:44 -0600, Alex Williamson wrote: ret = vfio_spapr_pci_eeh_open(vdev-pdev); - if (ret) { - vfio_pci_disable(vdev); - goto error; - } + if (ret) + pr_warn_once(EEH is not supported\n); } return 0; Now the next question, what's the point of vfio_spapr_pci_eeh_open() returning a value? Couldn't it return void now and this warning can go into eeh specific code? Thanks, In order to call vfio_pci_disable() when that happens ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 16/63] PPC: Add asm helpers for BE 32bit load/store
On Fri, 2014-08-01 at 11:17 +0200, Alexander Graf wrote: From assembly code we might not only have to explicitly BE access 64bit values, but sometimes also 32bit ones. Add helpers that allow for easy use of lwzx/stwx in their respective byte-reverse or native form. Signed-off-by: Alexander Graf ag...@suse.de Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- arch/powerpc/include/asm/asm-compat.h | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/include/asm/asm-compat.h b/arch/powerpc/include/asm/asm-compat.h index 4b237aa..21be8ae 100644 --- a/arch/powerpc/include/asm/asm-compat.h +++ b/arch/powerpc/include/asm/asm-compat.h @@ -34,10 +34,14 @@ #define PPC_MIN_STKFRM 112 #ifdef __BIG_ENDIAN__ +#define LWZX_BE stringify_in_c(lwzx) #define LDX_BE stringify_in_c(ldx) +#define STWX_BE stringify_in_c(stwx) #define STDX_BE stringify_in_c(stdx) #else +#define LWZX_BE stringify_in_c(lwbrx) #define LDX_BE stringify_in_c(ldbrx) +#define STWX_BE stringify_in_c(stwbrx) #define STDX_BE stringify_in_c(stdbrx) #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 16/63] PPC: Add asm helpers for BE 32bit load/store
On Fri, 2014-08-01 at 11:17 +0200, Alexander Graf wrote: From assembly code we might not only have to explicitly BE access 64bit values, but sometimes also 32bit ones. Add helpers that allow for easy use of lwzx/stwx in their respective byte-reverse or native form. Signed-off-by: Alexander Graf ag...@suse.de Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- arch/powerpc/include/asm/asm-compat.h | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/include/asm/asm-compat.h b/arch/powerpc/include/asm/asm-compat.h index 4b237aa..21be8ae 100644 --- a/arch/powerpc/include/asm/asm-compat.h +++ b/arch/powerpc/include/asm/asm-compat.h @@ -34,10 +34,14 @@ #define PPC_MIN_STKFRM 112 #ifdef __BIG_ENDIAN__ +#define LWZX_BE stringify_in_c(lwzx) #define LDX_BE stringify_in_c(ldx) +#define STWX_BE stringify_in_c(stwx) #define STDX_BE stringify_in_c(stdx) #else +#define LWZX_BE stringify_in_c(lwbrx) #define LDX_BE stringify_in_c(ldbrx) +#define STWX_BE stringify_in_c(stwbrx) #define STDX_BE stringify_in_c(stdbrx) #endif -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
On Mon, 2014-07-28 at 15:58 +0800, Liu ping fan wrote: Hope I am right. Take the following seq as an example if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); } else { kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - if we try_to_unmap on pfn at here, then @r contains a invalid pfn hptep[1] = r; eieio(); hptep[0] = hpte[0]; asm volatile(ptesync : : : memory); If that was the case we would have the same race in kvmppc_do_h_enter(). I think the fact that the HPTE is locked will prevent the race, ie, HPTE_V_HVLOCK is set until hptep[0] is written to. If I look at at the unmap case, my understanding is that it uses kvm_unmap_rmapp() which will also lock the HPTE (try_lock_hpte) and so shouldn't have a race vs the above code. Or do you see a race I don't ? Cheers, Ben. Thx. Fan On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote: In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Please describe the race you think you see. I'm quite sure both Paul and I went over that code and somewhat convinced ourselves that it was ok but it's possible that we were both wrong :-) Cheers, Ben. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote: In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Please describe the race you think you see. I'm quite sure both Paul and I went over that code and somewhat convinced ourselves that it was ok but it's possible that we were both wrong :-) Cheers, Ben. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection
On Mon, 2014-07-21 at 16:06 +0800, Mike Qiu wrote: I don't like this. I much prefer have dedicated error injection files in their respective locations, something for PCI under the corresponding PCI bridge etc... So PowerNV error injection will be designed rely on debugfs been configured, right? Not necessarily. If we create a better debugfs layout for our PHBs, then yes. It might be useful to provide more info in there for example access to some of the counters ... But on the other hand, for error injection in general, I wonder if we should be under sysfs instead... something to study a bit. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection
On Tue, 2014-07-22 at 11:10 +0800, Mike Qiu wrote: On 07/22/2014 06:49 AM, Benjamin Herrenschmidt wrote: On Mon, 2014-07-21 at 16:06 +0800, Mike Qiu wrote: I don't like this. I much prefer have dedicated error injection files in their respective locations, something for PCI under the corresponding PCI bridge etc... So PowerNV error injection will be designed rely on debugfs been configured, right? Not necessarily. If we create a better debugfs layout for our PHBs, then yes. It might be useful to provide more info in there for example access to some of the counters ... But on the other hand, for error injection in general, I wonder if we should be under sysfs instead... something to study a bit. In pHyp, general error injection use syscall: #define __NR_rtas255 I don't know if it is a good idea to reuse this syscall for PowerNV. At least, it is another choice without sysfs rely. No, we certainly don't want that RTAS stuff. I though Linux had some kind of error injection infrastructure nowadays... somebody needs to have a look. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] Use virtual page class key protection mechanism for speeding up guest page fault
On Sun, 2014-06-29 at 16:47 +0530, Aneesh Kumar K.V wrote: To achieve the above we use virtual page calss protection mechanism for covering (2) and (3). For both the above case we mark the hpte valid, but associate the page with virtual page class index 30 and 31. The authority mask register is configured such that class index 30 and 31 will have read/write denied. The above change results in a key fault for (2) and (3). This allows us to forward a NO_HPTE fault directly to guest without doing the expensive hash pagetable lookup. So we have a measurable performance benefit (about half a second out of 8) but you didn't explain the drawback here which is to essentially make it impossible for guests to exploit virtual page class keys, or did you find a way to still make that possible ? As it-is, it's not a huge issue for Linux but we might have to care with other OSes that do care... Do we have a way in PAPR to signify to the guest that the keys are not available ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Tue, 2014-06-24 at 12:41 +0200, Alexander Graf wrote: Is there actually any difference in generated code with this patch applied and without? I would hope that iowrite..() is inlined and cancels out the cpu_to_le..() calls that are also inlined? No, the former uses byteswapping asm, the compiler can't cancel it out, but the overhead of the additional byteswap might not be measurable. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote: I do not understand why @val is considered LE here and need to be converted to CPU. Really. I truly believe it should be cpu_to_le32(). No. Both are slightly wrong semantically but le32_to_cpu() is less wrong :-) iowrite32 supposedly takes a cpu value as argument and writes an le value. So if anything, you need something that converts to a cpu value before you call iowrite32. Now it's still slightly wrong because the input to le32_to_cpu() is supposed to be an LE value but of course here it's not, it's a cpu value too :-) But yes, I agree with aw here, either do nothing or stick a set of iowriteXX_native or something equivalent in the generic iomap header, define them in term of iowrite32be/iowrite32 based on the compile time endian of the arch. Hitting asm-generic/iomap.h I think will cover all archs except ARM. For ARM, just hack arch/arm/asm/io.h Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection
Is it reasonable to do error injection with CONFIG_IOMMU_API ? That means if use default config(CONFIG_IOMMU_API = n), we can not do error injection to pci devices? Well we can't pass them through either so ... In any case, this is not a priority. First we need to implement a solid error injection facility for the *host*. The guest one is really really low on the list. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection
On Tue, 2014-06-24 at 14:57 +0800, Mike Qiu wrote: Is that mean *host* side error injection should base on CONFIG_IOMMU_API ? If it is just host side(no guest, no pass through), can't we do error inject? Maybe I misunderstand :) Ah no, make different patches, we don't want to use IOMMU group ID, just PE numbers. Maybe we should expose in sysfs the PEs from the platform code with the error injection files underneath ... Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection
On Wed, 2014-06-25 at 11:05 +0800, Mike Qiu wrote: Here maybe /sys/kernel/debug/powerpc/errinjct is better, because it will supply PCI_domain_nr in parameters, so no need supply errinjct for each PCI domain. Another reason is error inject not only for PCI(in future), so better not in PCI domain entry. Also it simple for userland tools to has a fixed path. I don't like this. I much prefer have dedicated error injection files in their respective locations, something for PCI under the corresponding PCI bridge etc... Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/3] powerpc/powernv: Sync header with firmware
On Mon, 2014-06-23 at 12:14 +1000, Gavin Shan wrote: The patch synchronizes firmware header file (opal.h) for PCI error injection The FW API you expose is not PCI specific. I haven't seen the corresponding FW patches yet but I'm not fan of that single call that collates unrelated things. I much'd prefer see a opal_pci_err_inject that is specific to IO(D)A errors, which takes a PHB ID and goes via the normal dispatch to PHB ops inside OPAL. For the rest, especially core specific injections, we can provide a separate dedicated call. Cheers, Ben. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal.h| 65 ++ arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 2 files changed, 66 insertions(+) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 66ad7a7..d982bb8 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -175,6 +175,7 @@ extern int opal_enter_rtas(struct rtas_args *args, #define OPAL_SET_PARAM 90 #define OPAL_DUMP_RESEND 91 #define OPAL_DUMP_INFO2 94 +#define OPAL_ERR_INJECT 96 #ifndef __ASSEMBLY__ @@ -219,6 +220,69 @@ enum OpalPciErrorSeverity { OPAL_EEH_SEV_INF= 5 }; +enum OpalErrinjctType { + OpalErrinjctTypeFirst = 0, + OpalErrinjctTypeFatal = 1, + OpalErrinjctTypeRecoverRandomEvent = 2, + OpalErrinjctTypeRecoverSpecialEvent = 3, + OpalErrinjctTypeCorruptedPage = 4, + OpalErrinjctTypeCorruptedSlb= 5, + OpalErrinjctTypeTranslatorFailure = 6, + OpalErrinjctTypeIoaBusError = 7, + OpalErrinjctTypeIoaBusError64 = 8, + OpalErrinjctTypePlatformSpecific= 9, + OpalErrinjctTypeDcacheStart = 10, + OpalErrinjctTypeDcacheEnd = 11, + OpalErrinjctTypeIcacheStart = 12, + OpalErrinjctTypeIcacheEnd = 13, + OpalErrinjctTypeTlbStart= 14, + OpalErrinjctTypeTlbEnd = 15, + OpalErrinjctTypeUpstreamIoError = 16, + OpalErrinjctTypeLast= 17, + + /* IoaBusError IoaBusError64 */ + OpalEitIoaLoadMemAddr = 0, + OpalEitIoaLoadMemData = 1, + OpalEitIoaLoadIoAddr= 2, + OpalEitIoaLoadIoData= 3, + OpalEitIoaLoadConfigAddr= 4, + OpalEitIoaLoadConfigData= 5, + OpalEitIoaStoreMemAddr = 6, + OpalEitIoaStoreMemData = 7, + OpalEitIoaStoreIoAddr = 8, + OpalEitIoaStoreIoData = 9, + OpalEitIoaStoreConfigAddr = 10, + OpalEitIoaStoreConfigData = 11, + OpalEitIoaDmaReadMemAddr= 12, + OpalEitIoaDmaReadMemData= 13, + OpalEitIoaDmaReadMemMaster = 14, + OpalEitIoaDmaReadMemTarget = 15, + OpalEitIoaDmaWriteMemAddr = 16, + OpalEitIoaDmaWriteMemData = 17, + OpalEitIoaDmaWriteMemMaster = 18, + OpalEitIoaDmaWriteMemTarget = 19, +}; + +struct OpalErrinjct { + int32_t type; + union { + struct { + uint32_t addr; + uint32_t mask; + uint64_t phb_id; + uint32_t pe; + uint32_t function; + } ioa; + struct { + uint64_t addr; + uint64_t mask; + uint64_t phb_id; + uint32_t pe; + uint32_t function; + } ioa64; + }; +}; + enum OpalShpcAction { OPAL_SHPC_GET_LINK_STATE = 0, OPAL_SHPC_GET_SLOT_STATE = 1 @@ -870,6 +934,7 @@ int64_t opal_update_flash(uint64_t blk_list); int64_t opal_dump_init(uint8_t dump_type); int64_t opal_dump_info(__be32 *dump_id, __be32 *dump_size); int64_t opal_dump_info2(__be32 *dump_id, __be32 *dump_size, __be32 *dump_type); +int64_t opal_err_injct(struct OpalErrinjct *ei); int64_t opal_dump_read(uint32_t dump_id, uint64_t buffer); int64_t opal_dump_ack(uint32_t dump_id); int64_t opal_dump_resend_notification(void); diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index f531ffe..44b3d81 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -136,6 +136,7 @@ OPAL_CALL(opal_resync_timebase, OPAL_RESYNC_TIMEBASE); OPAL_CALL(opal_dump_init,OPAL_DUMP_INIT);
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: Working on big endian being an accident may be a matter of perspective :-) The comment remains that this patch doesn't actually fix anything except the overhead on big endian systems doing redundant byte swapping and maybe the philosophy that vfio regions are little endian. Yes, that works by accident because technically VFIO is a transport and thus shouldn't perform any endian swapping of any sort, which remains the responsibility of the end driver which is the only one to know whether a given BAR location is a a register or some streaming data and in the former case whether it's LE or BE (some PCI devices are BE even ! :-) But yes, in the end, it works with the dual cancelling swaps and the overhead of those swaps is probably drowned in the noise of the syscall overhead. I'm still not a fan of iowrite vs iowritebe, there must be something we can use that doesn't have an implicit swap. Sadly there isn't ... In the old day we didn't even have the be variant and readl/writel style accessors still don't have them either for all archs. There is __raw_readl/writel but here the semantics are much more than just don't swap, they also don't have memory barriers (which means they are essentially useless to most drivers unless those are platform specific drivers which know exactly what they are doing, or in the rare cases such as accessing a framebuffer which we know never have side effects). Calling it iowrite*_native is also an abuse of the namespace. Next thing we know some common code will legitimately use that name. I might make sense to those definitions into a common header. There have been a handful of cases in the past that wanted that sort of native byte order MMIOs iirc (though don't ask me for examples, I can't really remember). If we do need to define an alias (which I'd like to avoid) it should be something like vfio_iowrite32. Thanks, Cheers, Ben. Alex === any better? Suggested-by: Benjamin Herrenschmidt b...@kernel.crashing.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/pci/vfio_pci_rdwr.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 210db24..f363b5a 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -21,6 +21,18 @@ #include vfio_pci_private.h +#ifdef __BIG_ENDIAN__ +#define ioread16_native ioread16be +#define ioread32_native ioread32be +#define iowrite16_native iowrite16be +#define iowrite32_native iowrite32be +#else +#define ioread16_native ioread16 +#define ioread32_native ioread32 +#define iowrite16_native iowrite16 +#define iowrite32_native iowrite32 +#endif + /* * Read or write from an __iomem region (MMIO or I/O port) with an excluded * range which is inaccessible. The excluded range drops writes and fills @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, if (copy_from_user(val, buf, 4)) return -EFAULT; - iowrite32(le32_to_cpu(val), io + off); + iowrite32_native(val, io + off); } else { - val = cpu_to_le32(ioread32(io + off)); + val = ioread32_native(io + off); if (copy_to_user(buf, val, 4)) return -EFAULT; @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, if (copy_from_user(val, buf, 2)) return -EFAULT; - iowrite16(le16_to_cpu(val), io + off); + iowrite16_native(val, io + off); } else { - val = cpu_to_le16(ioread16(io + off)); + val = ioread16_native(io + off); if (copy_to_user(buf, val, 2)) return -EFAULT; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Sat, 2014-06-21 at 00:14 +1000, Alexey Kardashevskiy wrote: We can still use __raw_writelco, would that be ok? No unless you understand precisely what kind of memory barriers each platform require for these. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc/kvm: support to handle sw breakpoint
On Tue, 2014-06-17 at 10:54 +0200, Alexander Graf wrote: Also, why don't we use twi always or something else that actually is defined as illegal instruction? I would like to see this shared with book3s_32 PR. twi will be directed to the guest on HV no ? We want a real illegal because those go to the host (for potential emulation by the HV). I'm trying to see if I can get the architect to set one in stone in a future proof way. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc/kvm: support to handle sw breakpoint
On Tue, 2014-06-17 at 11:25 +0200, Alexander Graf wrote: On 17.06.14 11:22, Benjamin Herrenschmidt wrote: On Tue, 2014-06-17 at 10:54 +0200, Alexander Graf wrote: Also, why don't we use twi always or something else that actually is defined as illegal instruction? I would like to see this shared with book3s_32 PR. twi will be directed to the guest on HV no ? We want a real illegal because those go to the host (for potential emulation by the HV). Ah, good point. I guess we need different one for PR and HV then to ensure compatibility with older ISAs on PR. Well, we also need to be careful with what happens if a PR guest puts that instruction in, do that stop its HV guest/host ? What if it's done in userspace ? Do that stop the kernel ? :-) Maddy, I haven't checked, does your patch ensure that we only ever stop if the instruction is at a recorded bkpt address ? It still means that a userspace process can practically DOS its kernel by issuing a lot of these causing a crapload of exits. Cheers, Ben. Alex I'm trying to see if I can get the architect to set one in stone in a future proof way. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc/kvm: support to handle sw breakpoint
On Tue, 2014-06-17 at 11:25 +0200, Alexander Graf wrote: On 17.06.14 11:22, Benjamin Herrenschmidt wrote: On Tue, 2014-06-17 at 10:54 +0200, Alexander Graf wrote: Also, why don't we use twi always or something else that actually is defined as illegal instruction? I would like to see this shared with book3s_32 PR. twi will be directed to the guest on HV no ? We want a real illegal because those go to the host (for potential emulation by the HV). Ah, good point. I guess we need different one for PR and HV then to ensure compatibility with older ISAs on PR. Well, we also need to be careful with what happens if a PR guest puts that instruction in, do that stop its HV guest/host ? What if it's done in userspace ? Do that stop the kernel ? :-) Maddy, I haven't checked, does your patch ensure that we only ever stop if the instruction is at a recorded bkpt address ? It still means that a userspace process can practically DOS its kernel by issuing a lot of these causing a crapload of exits. Cheers, Ben. Alex I'm trying to see if I can get the architect to set one in stone in a future proof way. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows
On Thu, 2014-06-05 at 17:25 +1000, Alexey Kardashevskiy wrote: +This creates a virtual TCE (translation control entry) table, which +is an IOMMU for PAPR-style virtual I/O. It is used to translate +logical addresses used in virtual I/O into guest physical addresses, +and provides a scatter/gather capability for PAPR virtual I/O. + +/* for KVM_CAP_SPAPR_TCE_64 */ +struct kvm_create_spapr_tce_64 { + __u64 liobn; + __u64 window_size; + __u64 bus_offset; + __u32 page_shift; + __u32 flags; +}; + +The liobn field gives the logical IO bus number for which to create a +TCE table. The window_size field specifies the size of the DMA window +which this TCE table will translate - the table will contain one 64 +bit TCE entry for every IOMMU page. The bus_offset field tells where +this window is mapped on the IO bus. Hrm, the bus_offset cannot be set arbitrarily, it has some pretty strong HW limits depending on the type of bridge architecture version... Do you plan to have that knowledge in qemu ? Or do you have some other mechanism to query it ? (I might be missing a piece of the puzzle here). Also one thing I've been pondering ... We'll end up wasting a ton of memory with those TCE tables. If you have 3 PEs mapped into a guest, it will try to create 3 DDW's mapping the entire guest memory and so 3 TCE tables large enough for that ... and which will contain exactly the same entries ! We really want to look into extending PAPR to allow the creation of table aliases so that the guest can essentially create one table and associate it with multiple PEs. We might still decide to do multiple copies for NUMA reasons but no more than one per node for example... at least we can have the policy in qemu/kvm. Also, do you currently require allocating a single physically contiguous table or do you support TCE trees in your implementation ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows
On Thu, 2014-06-05 at 19:26 +1000, Alexey Kardashevskiy wrote: No trees yet. For 64GB window we need (6430)/(1620)*8 = 32K TCE table. Do we really need trees? The above is assuming hugetlbfs backed guests. These are the least of my worry indeed. But we need to deal with 4k and 64k guests. Cheers, Ben -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows
On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote: What if we ask user space to give us a pointer to user space allocated memory along with the TCE registration? We would still ask user space to only use the returned fd for TCE modifications, but would have some nicely swappable memory we can store the TCE entries in. That isn't going to work terribly well for VFIO :-) But yes, for emulated devices, we could improve things a bit, including for the 32-bit TCE tables. For emulated, the real mode path could walk the page tables and fallback to virtual mode get_user if the page isn't present, thus operating directly on qemu memory TCE tables instead of the current pinned stuff. However that has a cost in performance, but since that's really only used for emulated devices and PAPR VIOs, it might not be a huge issue. But for VFIO we don't have much choice, we need to create something the HW can access. In fact, the code as is today can allocate an arbitrary amount of pinned kernel memory from within user space without any checks. Right. We should at least account it in the locked limit. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 2/3] powerpc/eeh: EEH support for VFIO PCI device
On Thu, 2014-06-05 at 16:36 +1000, Gavin Shan wrote: +#define EEH_OPT_GET_PE_ADDR0 /* Get PE addr */ +#define EEH_OPT_GET_PE_MODE1 /* Get PE mode */ I assume that's just some leftover from the previous patches :-) Don't respin just yet, let's see what other comments come in. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows
On Thu, 2014-06-05 at 17:25 +1000, Alexey Kardashevskiy wrote: +This creates a virtual TCE (translation control entry) table, which +is an IOMMU for PAPR-style virtual I/O. It is used to translate +logical addresses used in virtual I/O into guest physical addresses, +and provides a scatter/gather capability for PAPR virtual I/O. + +/* for KVM_CAP_SPAPR_TCE_64 */ +struct kvm_create_spapr_tce_64 { + __u64 liobn; + __u64 window_size; + __u64 bus_offset; + __u32 page_shift; + __u32 flags; +}; + +The liobn field gives the logical IO bus number for which to create a +TCE table. The window_size field specifies the size of the DMA window +which this TCE table will translate - the table will contain one 64 +bit TCE entry for every IOMMU page. The bus_offset field tells where +this window is mapped on the IO bus. Hrm, the bus_offset cannot be set arbitrarily, it has some pretty strong HW limits depending on the type of bridge architecture version... Do you plan to have that knowledge in qemu ? Or do you have some other mechanism to query it ? (I might be missing a piece of the puzzle here). Also one thing I've been pondering ... We'll end up wasting a ton of memory with those TCE tables. If you have 3 PEs mapped into a guest, it will try to create 3 DDW's mapping the entire guest memory and so 3 TCE tables large enough for that ... and which will contain exactly the same entries ! We really want to look into extending PAPR to allow the creation of table aliases so that the guest can essentially create one table and associate it with multiple PEs. We might still decide to do multiple copies for NUMA reasons but no more than one per node for example... at least we can have the policy in qemu/kvm. Also, do you currently require allocating a single physically contiguous table or do you support TCE trees in your implementation ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows
On Thu, 2014-06-05 at 19:26 +1000, Alexey Kardashevskiy wrote: No trees yet. For 64GB window we need (6430)/(1620)*8 = 32K TCE table. Do we really need trees? The above is assuming hugetlbfs backed guests. These are the least of my worry indeed. But we need to deal with 4k and 64k guests. Cheers, Ben -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows
On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote: What if we ask user space to give us a pointer to user space allocated memory along with the TCE registration? We would still ask user space to only use the returned fd for TCE modifications, but would have some nicely swappable memory we can store the TCE entries in. That isn't going to work terribly well for VFIO :-) But yes, for emulated devices, we could improve things a bit, including for the 32-bit TCE tables. For emulated, the real mode path could walk the page tables and fallback to virtual mode get_user if the page isn't present, thus operating directly on qemu memory TCE tables instead of the current pinned stuff. However that has a cost in performance, but since that's really only used for emulated devices and PAPR VIOs, it might not be a huge issue. But for VFIO we don't have much choice, we need to create something the HW can access. In fact, the code as is today can allocate an arbitrary amount of pinned kernel memory from within user space without any checks. Right. We should at least account it in the locked limit. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE
On Tue, 2014-06-03 at 09:45 +0200, Alexander Graf wrote: For EEH it could as well be a dumb eventfd - really just a side channel that can tell user space that something happened asynchronously :). Which the host kernel may have no way to detect without actively poking at the device (fences in powernv or anything in PAPR host) and the only user of this for now has no use for. I insist don't bother. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On Thu, 2014-05-29 at 23:27 +0200, Alexander Graf wrote: On 29.05.14 09:45, Michael Neuling wrote: +/* Values for 2nd argument to H_SET_MODE */ +#define H_SET_MODE_RESOURCE_SET_CIABR1 +#define H_SET_MODE_RESOURCE_SET_DAWR2 +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 +#define H_SET_MODE_RESOURCE_LE4 Much better, but I think you want to make use of these in non-kvm code too, no? At least the LE one is definitely already implemented as call :) Sure but that's a different patch below. Ben, how would you like to handle these 2 patches? If you give me an ack I can just put this patch into my kvm queue. Alternatively we could both carry a patch that adds the H_SET_MODE header bits only and whoever hits Linus' tree first wins ;). No biggie. Worst case it's a trivial conflict. Cheers, Ben. Alex Mikey powerpc/pseries: Use new defines when calling h_set_mode Now that we define these in the KVM code, use these defines when we call h_set_mode. No functional change. Signed-off-by: Michael Neuling mi...@neuling.org -- This depends on the KVM h_set_mode patches. diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 12c32c5..67859ed 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, unsigned long resource, static inline long enable_reloc_on_exceptions(void) { /* mflags = 3: Exceptions at 0xC0004000 */ - return plpar_set_mode(3, 3, 0, 0); + return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } /* @@ -284,7 +284,7 @@ static inline long enable_reloc_on_exceptions(void) * returns H_SUCCESS. */ static inline long disable_reloc_on_exceptions(void) { - return plpar_set_mode(0, 3, 0, 0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } /* @@ -297,7 +297,7 @@ static inline long disable_reloc_on_exceptions(void) { static inline long enable_big_endian_exceptions(void) { /* mflags = 0: big endian exceptions */ - return plpar_set_mode(0, 4, 0, 0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_LE, 0, 0); } /* @@ -310,17 +310,17 @@ static inline long enable_big_endian_exceptions(void) static inline long enable_little_endian_exceptions(void) { /* mflags = 1: little endian exceptions */ - return plpar_set_mode(1, 4, 0, 0); + return plpar_set_mode(1, H_SET_MODE_RESOURCE_LE, 0, 0); } static inline long plapr_set_ciabr(unsigned long ciabr) { - return plpar_set_mode(0, 1, ciabr, 0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_CIABR, ciabr, 0); } static inline long plapr_set_watchpoint0(unsigned long dawr0, unsigned long dawrx0) { - return plpar_set_mode(0, 2, dawr0, dawrx0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0); } #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On Thu, 2014-05-29 at 23:27 +0200, Alexander Graf wrote: On 29.05.14 09:45, Michael Neuling wrote: +/* Values for 2nd argument to H_SET_MODE */ +#define H_SET_MODE_RESOURCE_SET_CIABR1 +#define H_SET_MODE_RESOURCE_SET_DAWR2 +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 +#define H_SET_MODE_RESOURCE_LE4 Much better, but I think you want to make use of these in non-kvm code too, no? At least the LE one is definitely already implemented as call :) Sure but that's a different patch below. Ben, how would you like to handle these 2 patches? If you give me an ack I can just put this patch into my kvm queue. Alternatively we could both carry a patch that adds the H_SET_MODE header bits only and whoever hits Linus' tree first wins ;). No biggie. Worst case it's a trivial conflict. Cheers, Ben. Alex Mikey powerpc/pseries: Use new defines when calling h_set_mode Now that we define these in the KVM code, use these defines when we call h_set_mode. No functional change. Signed-off-by: Michael Neuling mi...@neuling.org -- This depends on the KVM h_set_mode patches. diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 12c32c5..67859ed 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, unsigned long resource, static inline long enable_reloc_on_exceptions(void) { /* mflags = 3: Exceptions at 0xC0004000 */ - return plpar_set_mode(3, 3, 0, 0); + return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } /* @@ -284,7 +284,7 @@ static inline long enable_reloc_on_exceptions(void) * returns H_SUCCESS. */ static inline long disable_reloc_on_exceptions(void) { - return plpar_set_mode(0, 3, 0, 0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } /* @@ -297,7 +297,7 @@ static inline long disable_reloc_on_exceptions(void) { static inline long enable_big_endian_exceptions(void) { /* mflags = 0: big endian exceptions */ - return plpar_set_mode(0, 4, 0, 0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_LE, 0, 0); } /* @@ -310,17 +310,17 @@ static inline long enable_big_endian_exceptions(void) static inline long enable_little_endian_exceptions(void) { /* mflags = 1: little endian exceptions */ - return plpar_set_mode(1, 4, 0, 0); + return plpar_set_mode(1, H_SET_MODE_RESOURCE_LE, 0, 0); } static inline long plapr_set_ciabr(unsigned long ciabr) { - return plpar_set_mode(0, 1, ciabr, 0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_CIABR, ciabr, 0); } static inline long plapr_set_watchpoint0(unsigned long dawr0, unsigned long dawrx0) { - return plpar_set_mode(0, 2, dawr0, dawrx0); + return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0); } #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */ -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote: I will remove those address related macros in next revision because it's user-level bussiness, not related to host kernel any more. If the user is QEMU + guest, we need the address to identify the PE though PHB BUID could be used as same purpose to get PHB, which is one-to-one mapping with IOMMU group on sPAPR platform. However, once the PE address is built and returned to guest, guest will use the PE address as input parameter in subsequent RTAS calls. If the user is some kind of application who just uses the ioctl() without supporting RTAS calls. We don't need care about PE address. I am a bit reluctant with that PE==PHB equation we seem to be introducing. This isn't the case in HW. It's possible that this is how we handle VFIO *today* in qemu but it doesn't have to be does it ? It also paints us into a corner if we want to start implementing some kind of emulated EEH for selected emulated devices and/or virtio. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
On Thu, 2014-05-29 at 10:05 +1000, Gavin Shan wrote: The log stuff is TBD and I'll figure it out later. About to what are the errors, there are a lot. Most of them are related to hardware level, for example unstable PCI link. Usually, those error bits defined in AER fatal error state register contribute to EEH errors. It could be software related, e.g. violating IOMMU protection (read/write permission etc), or even one PCI device isn't capable of DMAing. Hopefully, it's the explaination you're looking for? :-) Note to Alex('s) ... The log we get from FW at the moment in the host is: - In the case of pHyp / RTAS host, opaque. Basically it's a blob that we store and that can be sent to IBM service people :-) Not terribly practical. - On PowerNV, it's a IODA specific data structure (basically a dump of a bunch of register state and tables). IODA is our IO architecture (sadly the document itself isn't public at this point) and we have two versions, IODA1 and IODA2. You can consider the structure as chipset specific basically. What I want to do in the long run is: - In the case of pHyp/RTAS host, there's not much we can do, so basically forward that log as-is. - In the case of PowerNV, forward the log *and* add a well-defined blob to it that does some basic interpretation of it. In fact I want to do the latter more generally in the host kernel for host kernel errors as well, but we can forward it to the guest via VFIO too. What I mean by interpretation is something along the lines of an error type DMA IOMMU fault, MMIO error, Link loss, PCIe UE, ... among a list of well known error types that represent the most common ones, with a little bit of added info when available (for most DMA errors we can provide the DMA address that faulted for example). So Gavin and I need to work a bit on that, both in the context of the host kernel to improve the reporting there, and in the context of what we pass to user space. However, no driver today cares about that information. The PCIe error recovery system doesn't carry it and it has no impact on the EEH recovery procedures, so EEH in that sense is useful without that reporting. It's useful for the programmer (or user/admin) to identify what went wrong but it's not used by the automated recovery process. One last thing to look at is in the case of a VFIO device, we might want to silence the host kernel printf's once we support guest EEH since otherwise the guest has a path to flood the host kernel log by triggering a lot of EEH errors purposefully. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
On Tue, 2014-05-27 at 12:15 -0600, Alex Williamson wrote: +/* + * Reset is the major step to recover problematic PE. The following + * command helps on that. + */ +struct vfio_eeh_pe_reset { + __u32 argsz; + __u32 flags; + __u32 option; +#define VFIO_EEH_PE_RESET_DEACTIVATE 0 /* Deactivate reset */ +#define VFIO_EEH_PE_RESET_HOT 1 /* Hot reset */ +#define VFIO_EEH_PE_RESET_FUNDAMENTAL 3 /* Fundamental reset */ How does a user know which of these to use? The usual way is the driver asks for one or the other, this plumbs back into the guest EEH code which itself plumbs into the PCIe error recovery framework in Linux. However I do have a question for Gavin here: Why do we expose an explicit deactivate ? The reset functions should do the whole reset sequence (assertion, delay, deassertion). In fact the firmware doesn't really give you a choice for PERST right ? Or do we have a requirement to expose both phases for RTAS? (In that case I'm happy to ignore the deassertion there too). Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
On Tue, 2014-05-27 at 14:37 -0600, Alex Williamson wrote: The usual way is the driver asks for one or the other, this plumbs back into the guest EEH code which itself plumbs into the PCIe error recovery framework in Linux. So magic? Yes. The driver is expected to more or less knows what kind of reset it wants for its device. Ideally hot reset is sufficient but some drivers knows that the device they drive is crappy enough that it mostly ignores hot reset and really needs a PERST for example... Also we have other reasons to expose those interfaces outside of EEH. For example, some drivers might want to specifically trigger a PERST after a microcode update. IE. There are path outside of EEH error recovery where drivers in the guest might want to trigger a reset to the device and they have control under some circumstances on which kind of reset they are doing (and the guest Linux does have different code path to do a hot reset vs. a fundamental reset). So we need to expose that distinction to be able to honor the guest decision. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device
On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote: There's no notification, the user needs to observe the return value an poll? Should we be enabling an eventfd to notify the user of the state change? Yes. The user needs to monitor the return value. we should have one notification, but it's for later as we discussed :-) ../.. How does the guest learn about the error? Does it need to? When guest detects 0xFF's from reading PCI config space or IO, it's going check the device (PE) state. If the device (PE) has been put into frozen state, the recovery will be started. Quick recap for Alex W (we discussed that with Alex G). While a notification looks like a worthwhile addition in the long run, it is not sufficient and not used today and I prefer that we keep that as something to add later for those two main reasons: - First, the kernel itself isn't always notified. For example, if we implement on top of an RTAS backend (PR KVM under pHyp) or if we are on top of PowerNV but the error is a PHB fence (the entire PCI Host bridge gets fenced out in hardware due to an internal error), then we get no notification. Only polling of the hardware or firmware will tell us. Since we don't want to have a polling timer in the kernel, that means that the userspace client of VFIO (or alternatively the KVM guest) is the one that polls. - Second, this is how our primary user expects it: The primary (and only initial) user of this will be qemu/KVM for PAPR guests and they don't have a notification mechanism. Instead they query the EEH state after detecting an all 1's return from MMIO or config space. This is how PAPR specifies it so we are just implementing the spec here :-) Because of these, I think we shouldn't worry too much about notification at this stage. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO
On Wed, 2014-05-21 at 08:23 +0200, Alexander Graf wrote: Note to Alex: This definitely kills the notifier idea for now though, at least as a first class citizen of the design. We can add it as an optional optimization on top later. I don't think it does. The notifier would just get triggered on config space read failures for example :). It's really just an aid for the vfio user to have a common code path for error handling. I'll let Gavin make the final call on that one, if he thinks we can reliably trigger it and there isn't too much code churn as a consequence. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device
On Wed, 2014-05-21 at 15:07 +0200, Alexander Graf wrote: +#ifdef CONFIG_VFIO_PCI_EEH +int eeh_vfio_open(struct pci_dev *pdev) Why vfio? Also that config option will not be set if vfio is compiled as a module. +{ + struct eeh_dev *edev; + + /* No PCI device ? */ + if (!pdev) + return -ENODEV; + + /* No EEH device ? */ + edev = pci_dev_to_eeh_dev(pdev); + if (!edev || !edev-pe) + return -ENODEV; + + eeh_dev_set_passed(edev, true); + eeh_pe_set_passed(edev-pe, true); + + return 0; +} +EXPORT_SYMBOL_GPL(eeh_vfio_open); Additionally, shouldn't we have some locking here ? (and in release too) I don't like relying on the caller locking (if it does it at all). + /* Device existing ? */ + ret = eeh_vfio_check_dev(pdev, edev, pe); + if (ret) { + pr_debug(%s: Cannot find device %s\n, + __func__, pdev ? pci_name(pdev) : NULL); + *retval = -7; What are these? Please use proper kernel internal return values for errors. I don't want to see anything even remotely tied to RTAS in any of these patches. Hint: -ENODEV Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE
On Tue, 2014-05-20 at 21:56 +1000, Gavin Shan wrote: .../... I think what you want is an irqfd that the in-kernel eeh code notifies when it sees a failure. When such an fd exists, the kernel skips its own error handling. Yeah, it's a good idea and something for me to improve in phase II. We can discuss for more later. For now, what I have in my head is something like this: However, this would be a deviation from (or extension of) PAPR. At the moment, the way things work in PAPR is that the guest is responsible for querying the EEH state when something looks like an error (ie, getting ff's back). This is also how it works in pHyp. We have an interrupt path in the host when doing native EEH, and it would be nice to extend PAPR to also be able to shoot an event to the guest possibly using RTAS events, but let's get the basics working and upstream first. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE
On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote: Instead of if (passed_flag) return; you would do if (trigger_irqfd) { trigger_irqfd(); return; } which would be a much nicer, generic interface. But that's not how PAPR works. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE
On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote: So how about we just implement this whole thing properly as irqfd? Whether QEMU can actually do anything with the interrupt is a different question - we can leave it be for now. But we could model all the code with the assumption that it should either handle the error itself or trigger and irqfd write. I don't object to the idea... however this smells of Deja Vu... You often tend to want to turn something submitted that fills a specific gap and implements a specific spec/function into some kind of idealized grand design :-) And that means nothing gets upstream for weeks or monthes as we churn and churn... Sometimes it's probably worth it. Here I would argue against it and would advocate for doing the basic functionality first, as it is used by guests, and later add the irqfd option. I don't see any emergency here and adding the irqfd will not cause fundamental design changes: The passed flag (though I'm not fan of the name) is really something we want in the low level handlers to avoid triggering host side EEH in various places, regardless of whether we use irqfd or not. This is totally orthogonal from the mechanism used for notifications. Even in host, the detection path doesn't always involve interrupts, and we can detect some things as a result of a host side config space access for example etc... So let's keep things nice and separate here. The interrupt notification is just an optimization which will speed up discovery of the error in *some* cases later on (but adds its own complexity since we have multiple discovery path in host, so we need to keep track whether we have notified yet or not etc...) so let's keep it for later. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO
On Tue, 2014-05-20 at 14:25 +0200, Alexander Graf wrote: - Move eeh-vfio.c to drivers/vfio/pci/ - From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call Hrm, I think it'd be nicer to just export individual functions that do thing you want to do from eeh.c. We already have an eeh_ops backend system with callbacks since we have different backends for RTAS and powernv, so we could do what you suggest but it would probably just boil down to wrappers around the EEH ops. No big opinion either way on my side though. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO
On Tue, 2014-05-20 at 22:39 +1000, Gavin Shan wrote: Yeah. How about this? :-) - Move eeh-vfio.c to drivers/vfio/pci/ - From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call Hrm, I think it'd be nicer to just export individual functions that do thing you want to do from eeh.c. Ok. Got it. Thanks for your comments :) The interesting thing with this approach is that VFIO per-se can work with EEH RTAS backend too in the host. IE, with PR KVM for example or with non-KVM uses of VFIO, it would be possible to use a device in a user process and exploit EEH even when running under a PAPR hypervisor. That is, vfio-eeh uses generic exported EEH APIs from the EEH core that will work on both powernv and RTAS backends. Note to Alex: This definitely kills the notifier idea for now though, at least as a first class citizen of the design. We can add it as an optional optimization on top later. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] powerpc/eeh: Info to trace passed devices
On Mon, 2014-05-19 at 14:46 +0200, Alexander Graf wrote: I don't see the point of VFIO knowing about guest addresses. They are not unique across a system and the whole idea that a VFIO device has to be owned by a guest is also pretty dubious. I suppose what you really care about here is just a token for a specific device? But why do you need one where we don't have tokens yet? I think this is going to be needed when doing in-kernel acceleration of some of the RTAS calls, but yes, Gavin, why do we need that now ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] powerpc: Extend syscall ppc_rtas()
On Mon, 2014-05-19 at 14:55 +0200, Alexander Graf wrote: On 14.05.14 06:12, Gavin Shan wrote: Originally, syscall ppc_rtas() can be used to invoke RTAS call from user space. Utility errinjct is using it to inject various errors to the system for testing purpose. The patch intends to extend the syscall to support both pSeries and PowerNV platform. With that, RTAS and OPAL call can be invoked from user space. In turn, utility errinjct can be supported on pSeries and PowerNV platform at same time. The original syscall handler ppc_rtas() is renamed to ppc_firmware(), which calls ppc_call_rtas() or ppc_call_opal() depending on the running platform. The data transported between userland and kerenl is Please fix your spelling of kernel. by struct rtas_args. It's platform specific on how to use the data. Signed-off-by: Mike Qiu qiud...@linux.vnet.ibm.com Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com I think the basic idea to maintain the same interface between PAPR and OPAL to user space is sound, but this is really Ben's call. Yeah that worries me a bit, RTAS and OPAL are completely different beasts. We can keep that error injection separate from the rest of the EEH enablement for now. I'll look at it when I get a chance. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] powerpc/powernv: Error injection infrastructure
On Mon, 2014-05-19 at 15:04 +0200, Alexander Graf wrote: On 14.05.14 06:12, Gavin Shan wrote: The patch intends to implement the error injection infrastructure for PowerNV platform. The predetermined handlers will be called according to the type of injected error (e.g. OpalErrinjctTypeIoaBusError). For now, we just support PCI error injection. We need support injecting other types of errors in future. Your token to a VFIO device is the VFIO fd. If you want to inject an error into that device, you should do it via that token. That gets you all permission problems solved for free. But I still didn't quite grasp why you need to do this. Why do we need to inject an error into a device via OPAL when we want to do EEH inside of a guest? Are you trying to emulate guest side error injection? Yes, that's what he's trying to do but let's keep that separate from the core EEH. I'd like the latter to be reviewed /fixed and upstream first, then we can look at guest side injection. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO
On Mon, 2014-05-19 at 16:33 -0600, Alex Williamson wrote: On Wed, 2014-05-14 at 14:11 +1000, Gavin Shan wrote: The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container to support EEH functionality for PCI devices, which have been passed from host to guest via VFIO. Some comments throughout, but overall this seems to forgo every bit of the device ownership and protection model used by VFIO and lets the user pick arbitrary host devices and do various operations, mostly unchecked. That's not acceptable. Right, I don't understand why we need that mapping for base EEH. The VFIO fd owns the PE, it can do operations on that PE, it doesn't need to know the guest token/address. Only when we start doing in-kernel acceleration of RTAS do we need that sort of mapping established, which we'll be able to do via something akin to what we did with TCEs to ensure we validate that we have ownership of the physical device. But let's proceed step by step. First, have something that is strictly EEH from qemu RTAS to VFIO via ioctl, and that requires no mapping of any sort. Cheers, Ben. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/Makefile | 1 + arch/powerpc/platforms/powernv/eeh-vfio.c | 593 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 12 + include/uapi/linux/vfio.h | 57 +++ 4 files changed, 663 insertions(+) create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index 63cebb9..2b15a03 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -6,5 +6,6 @@ obj-y += opal-msglog.o obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o obj-$(CONFIG_EEH) += eeh-ioda.o eeh-powernv.o +obj-$(CONFIG_VFIO_EEH) += eeh-vfio.o obj-$(CONFIG_PPC_SCOM) += opal-xscom.o obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c b/arch/powerpc/platforms/powernv/eeh-vfio.c new file mode 100644 index 000..69d5f2d --- /dev/null +++ b/arch/powerpc/platforms/powernv/eeh-vfio.c @@ -0,0 +1,593 @@ +/* + * The file intends to support EEH funtionality for those PCI devices, + * which have been passed through from host to guest via VFIO. So this + * file is naturally part of VFIO implementation on PowerNV platform. + * + * Copyright Benjamin Herrenschmidt Gavin Shan, IBM Corporation 2014. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/init.h +#include linux/io.h +#include linux/irq.h +#include linux/kernel.h +#include linux/kvm_host.h +#include linux/msi.h +#include linux/pci.h +#include linux/string.h +#include linux/vfio.h + +#include asm/eeh.h +#include asm/eeh_event.h +#include asm/io.h +#include asm/iommu.h +#include asm/opal.h +#include asm/msi_bitmap.h +#include asm/pci-bridge.h +#include asm/ppc-pci.h +#include asm/tce.h +#include asm/uaccess.h + +#include powernv.h +#include pci.h + +static int powernv_eeh_vfio_map(struct vfio_eeh_info *info) +{ + struct pci_bus *bus, *pe_bus; + struct pci_dev *pdev; + struct eeh_dev *edev; + struct eeh_pe *pe; + int domain, bus_no, devfn; + + /* Host address */ + domain = info-map.host_domain; + bus_no = (info-map.host_cfg_addr 8) 0xff; + devfn = info-map.host_cfg_addr 0xff; Where are we validating that the user has any legitimate claim to be touching this device? + /* Find PCI bus */ + bus = pci_find_bus(domain, bus_no); + if (!bus) { + pr_warn(%s: PCI bus %04x:%02x not found\n, + __func__, domain, bus_no); + return -ENODEV; + } + + /* Find PCI device */ + pdev = pci_get_slot(bus, devfn); + if (!pdev) { + pr_warn(%s: PCI device %04x:%02x:%02x.%01x not found\n, + __func__, domain, bus_no, + PCI_SLOT(devfn), PCI_FUNC(devfn)); + return -ENODEV; + } + + /* No EEH device - almost impossible */ + edev = pci_dev_to_eeh_dev(pdev); + if (unlikely(!edev)) { + pci_dev_put(pdev); + pr_warn(%s: No EEH dev for PCI device %s\n, + __func__, pci_name(pdev)); + return -ENODEV; + } + + /* Doesn't support PE migration between different PHBs */ + pe = edev-pe; + if (!eeh_pe_passed(pe)) { + pe_bus = eeh_pe_bus_get(pe); + BUG_ON(!pe_bus); Can a user trigger
Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest
On Tue, 2014-05-06 at 08:56 +0200, Alexander Graf wrote: For the error injection, I guess I have to put the logic token management into QEMU and error injection request will be handled by QEMU and then routed to host kernel via additional syscall as we did for pSeries. Yes, start off without in-kernel XICS so everything simply lives in QEMU. Then add callbacks into the in-kernel XICS to inject these interrupts if we don't have wide enough interfaces already. It's got nothing to do with XICS ... :-) But yes, we can route everything via qemu for now, then we'll need at least one of the call to have a direct path but we should probably strive to even make it real mode if that's possible, it's the one that Linux will call whenever an MMIO returns all f's to check if the underlying PE is frozen. But we can do that as a second stage. In fact going via VFIO ioctl's does make the whole security and translation model much simpler initially. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote: On 06.05.14 02:06, Benjamin Herrenschmidt wrote: On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: Isn't this a greater problem? We should start swapping before we hit the point where non movable kernel allocation fails, no? Possibly but the fact remains, this can be avoided by making sure that if we create a CMA reserve for KVM, then it uses it rather than using the rest of main memory for hash tables. So why were we preferring non-CMA memory before? Considering that Aneesh introduced that logic in fa61a4e3 I suppose this was just a mistake? I assume so. The fact that KVM uses a good number of normal kernel pages is maybe suboptimal, but shouldn't be a critical problem. The point is that we explicitly reserve those pages in CMA for use by KVM for that specific purpose, but the current code tries first to get them out of the normal pool. This is not an optimal behaviour and is what Aneesh patches are trying to fix. I agree, and I agree that it's worth it to make better use of our resources. But we still shouldn't crash. Well, Linux hitting out of memory conditions has never been a happy story :-) However, reading through this thread I think I've slowly grasped what the problem is. The hugetlbfs size calculation. Not really. I guess something in your stack overreserves huge pages because it doesn't account for the fact that some part of system memory is already reserved for CMA. Either that or simply Linux runs out because we dirty too fast... really, Linux has never been good at dealing with OO situations, especially when things like network drivers and filesystems try to do ATOMIC or NOIO allocs... So the underlying problem is something completely orthogonal. The patch body as is is fine, but the patch description should simply say that we should prefer the CMA region because it's already reserved for us for this purpose and we make better use of our available resources that way. No. We give a chunk of memory to hugetlbfs, it's all good and fine. Whatever remains is split between CMA and the normal page allocator. Without Aneesh latest patch, when creating guests, KVM starts allocating it's hash tables from the latter instead of CMA (we never allocate from hugetlb pool afaik, only guest pages do that, not hash tables). So we exhaust the page allocator and get linux into OOM conditions while there's plenty of space in CMA. But the kernel cannot use CMA for it's own allocations, only to back user pages, which we don't care about because our guest pages are covered by our hugetlb reserve :-) All the bits about pinning, numa, libvirt and whatnot don't really matter and are just details that led Aneesh to find this non-optimal allocation. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote: So if I understand this patch correctly, it simply introduces logic to handle page sizes other than 4k, 64k, 16M by analyzing the actual page size field in the HPTE. Mind to explain why exactly that enables us to use THP? What exactly is the flow if the pages are not backed by huge pages? What is the flow when they start to get backed by huge pages? The hypervisor doesn't care about segments ... but it needs to properly decode the page size requested by the guest, if anything, to issue the right form of tlbie instruction. The encoding in the HPTE for a 16M page inside a 64K segment is different than the encoding for a 16M in a 16M segment, this is done so that the encoding carries both information, which allows broadcast tlbie to properly find the right set in the TLB for invalidations among others. So from a KVM perspective, we don't know whether the guest is doing THP or something else (Linux calls it THP but all we care here is that this is MPSS, another guest than Linux might exploit that differently). What we do know is that if we advertise MPSS, we need to decode the page sizes encoded in the HPTE so that we know what we are dealing with in H_ENTER and can do the appropriate TLB invalidations in H_REMOVE evictions. + if (a_size != -1) + return 1ul mmu_psize_defs[a_size].shift; + } + + } + return 0; } static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 8227dba5af0f..a38d3289320a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps, * support pte_enc here */ (*sps)-enc[0].pte_enc = def-penc[linux_psize]; + /* +* Add 16MB MPSS support +*/ + if (linux_psize != MMU_PAGE_16M) { + (*sps)-enc[1].page_shift = 24; + (*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M]; + } So this basically indicates that every segment (except for the 16MB one) can also handle 16MB MPSS page sizes? I suppose you want to remove the comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here. I haven't reviewed the code there, make sure it will indeed do a different encoding for every combination of segment/actual page size. Can we also ensure that every system we run on can do MPSS? P7 and P8 are identical in that regard. However 970 doesn't do MPSS so let's make sure we get that right. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
On Tue, 2014-05-06 at 21:38 +0530, Aneesh Kumar K.V wrote: I updated the commit message as below. Let me know if this is ok. KVM: PPC: BOOK3S: HV: THP support for guest This has nothing to do with THP. THP support in guest depend on KVM advertising MPSS feature. We already have rest of the changes needed to support transparent huge pages upstream. (We do support THP with PowerVM LPAR already). The primary motivation of this patch is to enable THP in powerkvm guest. I would argue (nit picking, I know ... :-) that the subject should be Enable MPSS support for guests, and the description can then explain that this allows Linux guests to use THP. Cheers, Ben. On recent IBM Power CPUs, while the hashed page table is looked up using the page size from the segmentation hardware (i.e. the SLB), it is possible to have the HPT entry indicate a larger page size. Thus for example it is possible to put a 16MB page in a 64kB segment, but since the hash lookup is done using a 64kB page size, it may be necessary to put multiple entries in the HPT for a single 16MB page. This capability is called mixed page-size segment (MPSS). With MPSS, there are two relevant page sizes: the base page size, which is the size used in searching the HPT, and the actual page size, which is the size indicated in the HPT entry. [ Note that the actual page size is always = base page size ]. We advertise MPSS feature to guest only if the host CPU supports the same. We use ibm,segment-page-sizes device tree node to advertise the MPSS support. The penc encoding indicate whether we support a specific combination of base page size and actual page size in the same segment. It is also the value used in the L|LP encoding of HPTE entry. In-order to support MPSS in guest, KVM need to handle the below details * advertise MPSS via ibm,segment-page-sizes * Decode the base and actual page size correctly from the HPTE entry so that we know what we are dealing with in H_ENTER and and can do Which code path exactly changes for H_ENTER? There is no real code path changes. Any code path that use hpte_page_size() is impacted. We return actual page size there. the appropriate TLB invalidation in H_REMOVE and evictions. Apart from the grammar (which is pretty broken for the part that is not copied from Paul) and the subject line this sounds quite reasonable. Wll try to fix. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest
On Tue, 2014-05-06 at 08:56 +0200, Alexander Graf wrote: For the error injection, I guess I have to put the logic token management into QEMU and error injection request will be handled by QEMU and then routed to host kernel via additional syscall as we did for pSeries. Yes, start off without in-kernel XICS so everything simply lives in QEMU. Then add callbacks into the in-kernel XICS to inject these interrupts if we don't have wide enough interfaces already. It's got nothing to do with XICS ... :-) But yes, we can route everything via qemu for now, then we'll need at least one of the call to have a direct path but we should probably strive to even make it real mode if that's possible, it's the one that Linux will call whenever an MMIO returns all f's to check if the underlying PE is frozen. But we can do that as a second stage. In fact going via VFIO ioctl's does make the whole security and translation model much simpler initially. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote: On 06.05.14 02:06, Benjamin Herrenschmidt wrote: On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: Isn't this a greater problem? We should start swapping before we hit the point where non movable kernel allocation fails, no? Possibly but the fact remains, this can be avoided by making sure that if we create a CMA reserve for KVM, then it uses it rather than using the rest of main memory for hash tables. So why were we preferring non-CMA memory before? Considering that Aneesh introduced that logic in fa61a4e3 I suppose this was just a mistake? I assume so. The fact that KVM uses a good number of normal kernel pages is maybe suboptimal, but shouldn't be a critical problem. The point is that we explicitly reserve those pages in CMA for use by KVM for that specific purpose, but the current code tries first to get them out of the normal pool. This is not an optimal behaviour and is what Aneesh patches are trying to fix. I agree, and I agree that it's worth it to make better use of our resources. But we still shouldn't crash. Well, Linux hitting out of memory conditions has never been a happy story :-) However, reading through this thread I think I've slowly grasped what the problem is. The hugetlbfs size calculation. Not really. I guess something in your stack overreserves huge pages because it doesn't account for the fact that some part of system memory is already reserved for CMA. Either that or simply Linux runs out because we dirty too fast... really, Linux has never been good at dealing with OO situations, especially when things like network drivers and filesystems try to do ATOMIC or NOIO allocs... So the underlying problem is something completely orthogonal. The patch body as is is fine, but the patch description should simply say that we should prefer the CMA region because it's already reserved for us for this purpose and we make better use of our available resources that way. No. We give a chunk of memory to hugetlbfs, it's all good and fine. Whatever remains is split between CMA and the normal page allocator. Without Aneesh latest patch, when creating guests, KVM starts allocating it's hash tables from the latter instead of CMA (we never allocate from hugetlb pool afaik, only guest pages do that, not hash tables). So we exhaust the page allocator and get linux into OOM conditions while there's plenty of space in CMA. But the kernel cannot use CMA for it's own allocations, only to back user pages, which we don't care about because our guest pages are covered by our hugetlb reserve :-) All the bits about pinning, numa, libvirt and whatnot don't really matter and are just details that led Aneesh to find this non-optimal allocation. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote: So if I understand this patch correctly, it simply introduces logic to handle page sizes other than 4k, 64k, 16M by analyzing the actual page size field in the HPTE. Mind to explain why exactly that enables us to use THP? What exactly is the flow if the pages are not backed by huge pages? What is the flow when they start to get backed by huge pages? The hypervisor doesn't care about segments ... but it needs to properly decode the page size requested by the guest, if anything, to issue the right form of tlbie instruction. The encoding in the HPTE for a 16M page inside a 64K segment is different than the encoding for a 16M in a 16M segment, this is done so that the encoding carries both information, which allows broadcast tlbie to properly find the right set in the TLB for invalidations among others. So from a KVM perspective, we don't know whether the guest is doing THP or something else (Linux calls it THP but all we care here is that this is MPSS, another guest than Linux might exploit that differently). What we do know is that if we advertise MPSS, we need to decode the page sizes encoded in the HPTE so that we know what we are dealing with in H_ENTER and can do the appropriate TLB invalidations in H_REMOVE evictions. + if (a_size != -1) + return 1ul mmu_psize_defs[a_size].shift; + } + + } + return 0; } static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 8227dba5af0f..a38d3289320a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps, * support pte_enc here */ (*sps)-enc[0].pte_enc = def-penc[linux_psize]; + /* +* Add 16MB MPSS support +*/ + if (linux_psize != MMU_PAGE_16M) { + (*sps)-enc[1].page_shift = 24; + (*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M]; + } So this basically indicates that every segment (except for the 16MB one) can also handle 16MB MPSS page sizes? I suppose you want to remove the comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here. I haven't reviewed the code there, make sure it will indeed do a different encoding for every combination of segment/actual page size. Can we also ensure that every system we run on can do MPSS? P7 and P8 are identical in that regard. However 970 doesn't do MPSS so let's make sure we get that right. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
On Tue, 2014-05-06 at 21:38 +0530, Aneesh Kumar K.V wrote: I updated the commit message as below. Let me know if this is ok. KVM: PPC: BOOK3S: HV: THP support for guest This has nothing to do with THP. THP support in guest depend on KVM advertising MPSS feature. We already have rest of the changes needed to support transparent huge pages upstream. (We do support THP with PowerVM LPAR already). The primary motivation of this patch is to enable THP in powerkvm guest. I would argue (nit picking, I know ... :-) that the subject should be Enable MPSS support for guests, and the description can then explain that this allows Linux guests to use THP. Cheers, Ben. On recent IBM Power CPUs, while the hashed page table is looked up using the page size from the segmentation hardware (i.e. the SLB), it is possible to have the HPT entry indicate a larger page size. Thus for example it is possible to put a 16MB page in a 64kB segment, but since the hash lookup is done using a 64kB page size, it may be necessary to put multiple entries in the HPT for a single 16MB page. This capability is called mixed page-size segment (MPSS). With MPSS, there are two relevant page sizes: the base page size, which is the size used in searching the HPT, and the actual page size, which is the size indicated in the HPT entry. [ Note that the actual page size is always = base page size ]. We advertise MPSS feature to guest only if the host CPU supports the same. We use ibm,segment-page-sizes device tree node to advertise the MPSS support. The penc encoding indicate whether we support a specific combination of base page size and actual page size in the same segment. It is also the value used in the L|LP encoding of HPTE entry. In-order to support MPSS in guest, KVM need to handle the below details * advertise MPSS via ibm,segment-page-sizes * Decode the base and actual page size correctly from the HPTE entry so that we know what we are dealing with in H_ENTER and and can do Which code path exactly changes for H_ENTER? There is no real code path changes. Any code path that use hpte_page_size() is impacted. We return actual page size there. the appropriate TLB invalidation in H_REMOVE and evictions. Apart from the grammar (which is pretty broken for the part that is not copied from Paul) and the subject line this sounds quite reasonable. Wll try to fix. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
On Mon, 2014-05-05 at 19:56 +0530, Aneesh Kumar K.V wrote: Paul mentioned that BOOK3S always had DAR value set on alignment interrupt. And the patch is to enable/collect correct DAR value when running with Little Endian PR guest. Now to limit the impact and to enable Little Endian PR guest, I ended up doing the conditional code only for book3s 64 for which we know for sure that we set DAR value. Only BookS ? Afaik, the kernel align.c unconditionally uses DAR on every processor type. It's DSISR that may or may not be populated but afaik DAR always is. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: Isn't this a greater problem? We should start swapping before we hit the point where non movable kernel allocation fails, no? Possibly but the fact remains, this can be avoided by making sure that if we create a CMA reserve for KVM, then it uses it rather than using the rest of main memory for hash tables. The fact that KVM uses a good number of normal kernel pages is maybe suboptimal, but shouldn't be a critical problem. The point is that we explicitly reserve those pages in CMA for use by KVM for that specific purpose, but the current code tries first to get them out of the normal pool. This is not an optimal behaviour and is what Aneesh patches are trying to fix. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html