Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
On Fri, 23 Oct 2009 09:14:29 -0500 Javier Guerra jav...@guerrag.com wrote: I think that the major difference between sheepdog and cluster file systems such as Google File system, pNFS, etc is the interface between clients and a storage system. note that GFS is Global File System (written by Sistina (the same folks from LVM) and bought by RedHat). Google Filesystem is a different thing, and ironically the client/storage interface is a little more like sheepdog and unlike a regular cluster filesystem. Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is ambiguous. Anyway, Global File System is a SAN file system. It's a completely different architecture from Sheepdog. Sheepdog uses consistent hashing to decide where objects store; I/O load is balanced across the nodes. When a new node is added or the existing node is removed, the hash table changes and the data automatically and transparently are moved over nodes. We plan to implement a mechanism to distribute the data not randomly but intelligently; we could use machine load, the locations of VMs, etc. i don't have much hands-on experience on consistent hashing; but it sounds reasonable to make each node's ring segment proportional to its storage capacity. Yeah, that's one of the techniques, I think. dynamic load balancing seems a tougher nut to crack, especially while keeping all clients mapping consistent. There are some techniques to do that. We think that there are some existing techniques to distribute data intelligently. We just have not analyzed the options. i'd just want to add my '+1 votes' on both getting rid of JVM dependency and using block devices (usually LVM) instead of ext3/btrfs LVM doesn't fit for our requirement nicely. What we need is updating some objects in an atomic way. We can implement that for ourselves but we prefer to keep our code simple by using the existing mechanism. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] add frontend implementation for the IOMMU API
On Fri, 28 Nov 2008 12:31:29 +0100 Joerg Roedel [EMAIL PROTECTED] wrote: On Fri, Nov 28, 2008 at 06:40:41PM +0900, FUJITA Tomonori wrote: On Thu, 27 Nov 2008 16:40:48 +0100 Joerg Roedel [EMAIL PROTECTED] wrote: Signed-off-by: Joerg Roedel [EMAIL PROTECTED] --- drivers/base/iommu.c | 94 ++ 1 files changed, 94 insertions(+), 0 deletions(-) create mode 100644 drivers/base/iommu.c diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c new file mode 100644 index 000..7250b9c --- /dev/null +++ b/drivers/base/iommu.c Hmm, why is this at drivers/base/? Anyone except for kvm could use this? If so, under virt/ is more appropriate? I don't see a reason why this should be KVM specific. KVM is the only user for now. But it can be used for i.e. UIO too. Or in drivers to speed up devices which have bad performance when they do scather gather IO. If there are some except for kvm that could use this, it should be fine, I guess. Can you add such information (e.g. who could use this) to the patch description? It should be in the git log if the patch is merged. The majority of the names (include/linux/iommu.h, iommu.c, iommu_ops, etc) looks too generic? We already have lots of similar things (e.g. arch/{x86,ia64}/asm/iommu.h, several archs' iommu.c, etc). Such names are expected to be used by all the IOMMUs. The API is already useful for more than KVM. I also plan to extend it to support more types of IOMMUs than VT-d and AMD IOMMU in the future. But these changes are more intrusive than this patchset and need more discussion. I prefer to do small steps into this direction. Can you be more specific? What IOMMU could use this? For example, how GART can use this? I think that people expect the name 'struct iommu_ops' to be an abstract for all the IOMMUs (or the majority at least). If this works like that, the name is a good choice, I think. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] add frontend implementation for the IOMMU API
On Mon, 1 Dec 2008 15:02:09 +0200 Muli Ben-Yehuda [EMAIL PROTECTED] wrote: On Mon, Dec 01, 2008 at 01:00:26PM +0100, Joerg Roedel wrote: The majority of the names (include/linux/iommu.h, iommu.c, iommu_ops, etc) looks too generic? We already have lots of similar things (e.g. arch/{x86,ia64}/asm/iommu.h, several archs' iommu.c, etc). Such names are expected to be used by all the IOMMUs. The API is already useful for more than KVM. I also plan to extend it to support more types of IOMMUs than VT-d and AMD IOMMU in the future. But these changes are more intrusive than this patchset and need more discussion. I prefer to do small steps into this direction. Can you be more specific? What IOMMU could use this? For example, how GART can use this? I think that people expect the name 'struct iommu_ops' to be an abstract for all the IOMMUs (or the majority at least). If this works like that, the name is a good choice, I think. GART can't use exactly this. But with some extensions we can make it useful for GART and GART-like IOMMUs too. For example we can emulate domains in GART by partitioning the GART aperture space. That would only work with a pvdma API, since GART doesn't support multiple address spaces, and you don't get the isolation properties of a real IOMMU, so... why would you want to do that? If this works for only IOMMUs that support kinda domain concept, then I think that a name like iommu_domain_ops is more appropriate. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] add frontend implementation for the IOMMU API
On Mon, 01 Dec 2008 16:33:11 +0200 Avi Kivity [EMAIL PROTECTED] wrote: Joerg Roedel wrote: Hmm, is there any hardware IOMMU with which we can't emulate domains by partitioning the IO address space? This concept works for GART and Calgary. Is partitioning secure? Domain X's user could program its hardware to dma to domain Y's addresses, zapping away Domain Y's user's memory. It can't be secure. So what's the point to emulate the domain partitioning in many traditional hardware IOMMUs that doesn't support it. The emulated domain support with the DMA mapping debugging feature might be useful to debug drivers but it doesn't mean that we need to add the emulated domain support to every hardware IOMMU. If you add it to swiotlb, everyone can enjoy the debugging. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] add frontend implementation for the IOMMU API
On Thu, 27 Nov 2008 16:40:48 +0100 Joerg Roedel [EMAIL PROTECTED] wrote: Signed-off-by: Joerg Roedel [EMAIL PROTECTED] --- drivers/base/iommu.c | 94 ++ 1 files changed, 94 insertions(+), 0 deletions(-) create mode 100644 drivers/base/iommu.c diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c new file mode 100644 index 000..7250b9c --- /dev/null +++ b/drivers/base/iommu.c Hmm, why is this at drivers/base/? Anyone except for kvm could use this? If so, under virt/ is more appropriate? The majority of the names (include/linux/iommu.h, iommu.c, iommu_ops, etc) looks too generic? We already have lots of similar things (e.g. arch/{x86,ia64}/asm/iommu.h, several archs' iommu.c, etc). Such names are expected to be used by all the IOMMUs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9][RFC] stackable dma_ops for x86
On Sun, 28 Sep 2008 20:49:26 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote: On Mon, 22 Sep 2008 20:21:12 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: Hi, this patch series implements stackable dma_ops on x86. This is useful to be able to fall back to a different dma_ops implementation if one can not handle a particular device (as necessary for example with paravirtualized device passthrough or if a hardware IOMMU only handles a subset of available devices). We already handle the latter. This patchset is more flexible but seems to incur more overheads. This feature will be used for only paravirtualized device passthrough? If so, I feel that there is more simpler (and specific) solutions for it. Its not only for device passthrough. It handles also the cases where a hardware IOMMU does not handle all devices in the system (like in some Calgary systems but also possible with AMD IOMMU). With this patchset we I know that. As I wrote in the previous mail, we already solved that problem with per-device-dma-ops. My question is what unsolved problems this patchset can fix? This patchset is named stackable dma_ops but it's different from what we discussed as stackable dma_ops. This patchset provides IOMMUs a generic mechanism to set up stackable dma_ops. But this patchset doesn't solve the problem that a hardware IOMMU does not handle all devices (it was already solved with per-device-dma-ops). If paravirtualized device passthrough still needs to call multiple dma_ops, then this patchset doesn't solve that issue. can handle these cases in a generic way without hacking it into the hardware drivers (these hacks are also in the AMD IOMMU code and I plan to remove them in the case this patchset will be accepted). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
On Mon, 29 Sep 2008 11:36:52 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: On Mon, Sep 29, 2008 at 12:30:44PM +0300, Muli Ben-Yehuda wrote: On Sun, Sep 28, 2008 at 09:13:33PM +0200, Joerg Roedel wrote: I think we should try to build a paravirtualized IOMMU for KVM guests. It should work this way: We reserve a configurable amount of contiguous guest physical memory and map it dma contiguous using some kind of hardware IOMMU. This is possible with all hardare IOMMUs we have in the field by now, also Calgary and GART. The guest does dma_coherent allocations from this memory directly and is done. For map_single and map_sg the guest can do bounce buffering. We avoid nearly all pvdma hypercalls with this approach, keep guest swapping working and solve also the problems with device dma_masks and guest memory that is not contigous on the host side. I'm not sure I follow, but if I understand correctly with this approach the guest could only DMA into buffers that fall within the range you allocated for DMA and mapped. Isn't that a pretty nasty limitation? The guest would need to bounce-bufer every frame that happened to not fall inside that range, with the resulting loss of performance. The bounce buffering is needed for map_single/map_sg allocations. For dma_alloc_coherent we can directly allocate from that range. The performance loss of the bounce buffering may be lower than the hypercalls we need as the alternative (we need hypercalls for map, unmap and sync). Nobody cares about the performance of dma_alloc_coherent. Only the performance of map_single/map_sg matters. I'm not sure how expensive the hypercalls are, but they are more expensive than bounce buffering coping lots of data for every I/Os? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9][RFC] stackable dma_ops for x86
On Mon, 29 Sep 2008 15:26:47 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: On Mon, Sep 29, 2008 at 10:16:39PM +0900, FUJITA Tomonori wrote: On Sun, 28 Sep 2008 20:49:26 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote: On Mon, 22 Sep 2008 20:21:12 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: Hi, this patch series implements stackable dma_ops on x86. This is useful to be able to fall back to a different dma_ops implementation if one can not handle a particular device (as necessary for example with paravirtualized device passthrough or if a hardware IOMMU only handles a subset of available devices). We already handle the latter. This patchset is more flexible but seems to incur more overheads. This feature will be used for only paravirtualized device passthrough? If so, I feel that there is more simpler (and specific) solutions for it. Its not only for device passthrough. It handles also the cases where a hardware IOMMU does not handle all devices in the system (like in some Calgary systems but also possible with AMD IOMMU). With this patchset we I know that. As I wrote in the previous mail, we already solved that problem with per-device-dma-ops. My question is what unsolved problems this patchset can fix? This patchset is named stackable dma_ops but it's different from what we discussed as stackable dma_ops. This patchset provides IOMMUs a generic mechanism to set up stackable dma_ops. But this patchset doesn't solve the problem that a hardware IOMMU does not handle all devices (it was already solved with per-device-dma-ops). If paravirtualized device passthrough still needs to call multiple dma_ops, then this patchset doesn't solve that issue. Ok, the name stackable is misleading and was a bad choice. I will rename it to multiplexing. This should make it more clear what it is. Like you pointed out, the problems are solved with per-device dma_ops, but in the current implementation it needs special hacks in the IOMMU drivers to use these per-device dma_ops. I see this patchset as a continuation of the per-device dma_ops idea. It moves the per-device handling out of the specific drivers to a common place. So we can avoid or remove special hacks in the IOMMU drivers. Basically, I'm not against this patchset. It simplify Calgary and AMD IOMMUs code to set up per-device-dma-ops (though it makes dma_ops a bit complicated). But it doesn't solve any problems including the paravirtualized device passthrough. When I wrote per-device-dma-ops, I expected that KVM people want more changes (such as stackable dma_ops) to dma_ops for the paravirtualized device passthrough. I'd like to hear what they want first. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
On Mon, 22 Sep 2008 20:21:21 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: This patch enables stackable dma_ops on x86. To do this, it also enables the per-device dma_ops on i386. Signed-off-by: Joerg Roedel [EMAIL PROTECTED] --- arch/x86/kernel/pci-dma.c | 26 ++ include/asm-x86/device.h |6 +++--- include/asm-x86/dma-mapping.h | 14 +++--- 3 files changed, 36 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index b990fb6..2e517c2 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops, write_unlock_irqrestore(dma_ops_list_lock, flags); } +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev) +{ + int i; + unsigned long flags; + struct dma_mapping_ops *entry, *ops = NULL; + + read_lock_irqsave(dma_ops_list_lock, flags); + + for (i = 0; i DMA_OPS_TYPE_MAX; ++i) + list_for_each_entry(entry, dma_ops_list[i], list) { + if (!entry-device_supported) + continue; + if (entry-device_supported(dev)) { + ops = entry; + goto out; + } + } +out: + read_unlock_irqrestore(dma_ops_list_lock, flags); Hmm, every time we call dma_sg/map_single, we call read_lock_irqsave(dma_ops_list_lock, flags). It's likely that we see notable performance drop? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9][RFC] stackable dma_ops for x86
On Mon, 22 Sep 2008 20:21:12 +0200 Joerg Roedel [EMAIL PROTECTED] wrote: Hi, this patch series implements stackable dma_ops on x86. This is useful to be able to fall back to a different dma_ops implementation if one can not handle a particular device (as necessary for example with paravirtualized device passthrough or if a hardware IOMMU only handles a subset of available devices). We already handle the latter. This patchset is more flexible but seems to incur more overheads. This feature will be used for only paravirtualized device passthrough? If so, I feel that there is more simpler (and specific) solutions for it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
On Wed, 30 Jul 2008 15:58:46 +0200 Andrea Arcangeli [EMAIL PROTECTED] wrote: On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote: * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote: I'm not so interested to go there right now, because while this code is useful right now because the majority of systems out there lacks VT-d/iommu, I suspect this code could be nuked in the long run when all systems will ship with that, which is why I kept it all Actually at least on Intel platforms and if you exclude the lowest end VT-d is shipping universally for quite some time now. If you buy a Intel box today or bought it in the last year the chances are pretty high that it has VT-d support. I think you mean VT-x, which is virtualization extensions for the x86 architecture. VT-d is virtualization extensions for devices (IOMMU). I think Andi understood VT-d right but even if he was right that every reader of this email that is buying a new VT-x system today is also almost guaranteed to get a VT-d motherboard (which I disagree unless you buy some really expensive toy), there are current large installations of VT-x systems that lacks VT-d and that with recent current dual/quadcore cpus are very fast and will be used for the next couple of years and they will not upgrade just the motherboard to use pci-passthrough. Today, very inexpensive desktops (for example, Dell OptiPlex 755) have VT-d support. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html