Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread FUJITA Tomonori
On Fri, 23 Oct 2009 09:14:29 -0500
Javier Guerra jav...@guerrag.com wrote:

  I think that the major difference between sheepdog and cluster file
  systems such as Google File system, pNFS, etc is the interface between
  clients and a storage system.
 
 note that GFS is Global File System (written by Sistina (the same
 folks from LVM) and bought by RedHat).  Google Filesystem is a
 different thing, and ironically the client/storage interface is a
 little more like sheepdog and unlike a regular cluster filesystem.

Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
ambiguous. Anyway, Global File System is a SAN file system. It's
a completely different architecture from Sheepdog.


  Sheepdog uses consistent hashing to decide where objects store; I/O
  load is balanced across the nodes. When a new node is added or the
  existing node is removed, the hash table changes and the data
  automatically and transparently are moved over nodes.
 
  We plan to implement a mechanism to distribute the data not randomly
  but intelligently; we could use machine load, the locations of VMs, etc.
 
 i don't have much hands-on experience on consistent hashing; but it
 sounds reasonable to make each node's ring segment proportional to its
 storage capacity.

Yeah, that's one of the techniques, I think.


  dynamic load balancing seems a tougher nut to
 crack, especially while keeping all clients mapping consistent.

There are some techniques to do that.

We think that there are some existing techniques to distribute data
intelligently. We just have not analyzed the options.


 i'd just want to add my '+1 votes' on both getting rid of JVM
 dependency and using block devices (usually LVM) instead of ext3/btrfs

LVM doesn't fit for our requirement nicely. What we need is updating
some objects in an atomic way. We can implement that for ourselves but
we prefer to keep our code simple by using the existing mechanism.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] add frontend implementation for the IOMMU API

2008-12-01 Thread FUJITA Tomonori
On Fri, 28 Nov 2008 12:31:29 +0100
Joerg Roedel [EMAIL PROTECTED] wrote:

 On Fri, Nov 28, 2008 at 06:40:41PM +0900, FUJITA Tomonori wrote:
  On Thu, 27 Nov 2008 16:40:48 +0100
  Joerg Roedel [EMAIL PROTECTED] wrote:
  
   Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
   ---
drivers/base/iommu.c |   94 
   ++
1 files changed, 94 insertions(+), 0 deletions(-)
create mode 100644 drivers/base/iommu.c
   
   diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
   new file mode 100644
   index 000..7250b9c
   --- /dev/null
   +++ b/drivers/base/iommu.c
  
  Hmm, why is this at drivers/base/? Anyone except for kvm could use
  this? If so, under virt/ is more appropriate?
 
 I don't see a reason why this should be KVM specific. KVM is the only
 user for now. But it can be used for i.e. UIO too. Or in drivers to
 speed up devices which have bad performance when they do scather gather
 IO.

If there are some except for kvm that could use this, it should be
fine, I guess.

Can you add such information (e.g. who could use this) to the patch
description? It should be in the git log if the patch is merged.


  The majority of the names (include/linux/iommu.h, iommu.c, iommu_ops,
  etc) looks too generic? We already have lots of similar things
  (e.g. arch/{x86,ia64}/asm/iommu.h, several archs' iommu.c, etc). Such
  names are expected to be used by all the IOMMUs.
 
 The API is already useful for more than KVM. I also plan to extend it to
 support more types of IOMMUs than VT-d and AMD IOMMU in the future. But
 these changes are more intrusive than this patchset and need more
 discussion. I prefer to do small steps into this direction.

Can you be more specific? What IOMMU could use this? For example, how
GART can use this? I think that people expect the name 'struct
iommu_ops' to be an abstract for all the IOMMUs (or the majority at
least). If this works like that, the name is a good choice, I think.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] add frontend implementation for the IOMMU API

2008-12-01 Thread FUJITA Tomonori
On Mon, 1 Dec 2008 15:02:09 +0200
Muli Ben-Yehuda [EMAIL PROTECTED] wrote:

 On Mon, Dec 01, 2008 at 01:00:26PM +0100, Joerg Roedel wrote:
 
 The majority of the names (include/linux/iommu.h, iommu.c,
 iommu_ops, etc) looks too generic? We already have lots of
 similar things (e.g. arch/{x86,ia64}/asm/iommu.h, several
 archs' iommu.c, etc). Such names are expected to be used by
 all the IOMMUs.

The API is already useful for more than KVM. I also plan to
extend it to support more types of IOMMUs than VT-d and AMD
IOMMU in the future. But these changes are more intrusive than
this patchset and need more discussion. I prefer to do small
steps into this direction.
   
   Can you be more specific? What IOMMU could use this? For example,
   how GART can use this? I think that people expect the name 'struct
   iommu_ops' to be an abstract for all the IOMMUs (or the majority
   at least). If this works like that, the name is a good choice, I
   think.
  
  GART can't use exactly this. But with some extensions we can make it
  useful for GART and GART-like IOMMUs too. For example we can emulate
  domains in GART by partitioning the GART aperture space.
 
 That would only work with a pvdma API, since GART doesn't support
 multiple address spaces, and you don't get the isolation properties of
 a real IOMMU, so... why would you want to do that?

If this works for only IOMMUs that support kinda domain concept, then
I think that a name like iommu_domain_ops is more appropriate.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] add frontend implementation for the IOMMU API

2008-12-01 Thread FUJITA Tomonori
On Mon, 01 Dec 2008 16:33:11 +0200
Avi Kivity [EMAIL PROTECTED] wrote:

 Joerg Roedel wrote:
  Hmm, is there any hardware IOMMU with which we can't emulate domains by
  partitioning the IO address space? This concept works for GART and
  Calgary.
 

 
 Is partitioning secure?  Domain X's user could program its hardware to 
 dma to domain Y's addresses, zapping away Domain Y's user's memory.

It can't be secure. So what's the point to emulate the domain
partitioning in many traditional hardware IOMMUs that doesn't support
it.

The emulated domain support with the DMA mapping debugging feature
might be useful to debug drivers but it doesn't mean that we need to
add the emulated domain support to every hardware IOMMU. If you add it
to swiotlb, everyone can enjoy the debugging.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] add frontend implementation for the IOMMU API

2008-11-28 Thread FUJITA Tomonori
On Thu, 27 Nov 2008 16:40:48 +0100
Joerg Roedel [EMAIL PROTECTED] wrote:

 Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
 ---
  drivers/base/iommu.c |   94 
 ++
  1 files changed, 94 insertions(+), 0 deletions(-)
  create mode 100644 drivers/base/iommu.c
 
 diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
 new file mode 100644
 index 000..7250b9c
 --- /dev/null
 +++ b/drivers/base/iommu.c

Hmm, why is this at drivers/base/? Anyone except for kvm could use
this? If so, under virt/ is more appropriate?

The majority of the names (include/linux/iommu.h, iommu.c, iommu_ops,
etc) looks too generic? We already have lots of similar things
(e.g. arch/{x86,ia64}/asm/iommu.h, several archs' iommu.c, etc). Such
names are expected to be used by all the IOMMUs.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9][RFC] stackable dma_ops for x86

2008-09-29 Thread FUJITA Tomonori
On Sun, 28 Sep 2008 20:49:26 +0200
Joerg Roedel [EMAIL PROTECTED] wrote:

 On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
  On Mon, 22 Sep 2008 20:21:12 +0200
  Joerg Roedel [EMAIL PROTECTED] wrote:
  
   Hi,
   
   this patch series implements stackable dma_ops on x86. This is useful to
   be able to fall back to a different dma_ops implementation if one can
   not handle a particular device (as necessary for example with
   paravirtualized device passthrough or if a hardware IOMMU only handles a
   subset of available devices).
  
  We already handle the latter. This patchset is more flexible but
  seems to incur more overheads.
  
  This feature will be used for only paravirtualized device passthrough?
  If so, I feel that there is more simpler (and specific) solutions for
  it.
 
 Its not only for device passthrough. It handles also the cases where a
 hardware IOMMU does not handle all devices in the system (like in some
 Calgary systems but also possible with AMD IOMMU). With this patchset we

I know that. As I wrote in the previous mail, we already solved that
problem with per-device-dma-ops.

My question is what unsolved problems this patchset can fix?


This patchset is named stackable dma_ops but it's different from
what we discussed as stackable dma_ops. This patchset provides
IOMMUs a generic mechanism to set up stackable dma_ops. But this
patchset doesn't solve the problem that a hardware IOMMU does not
handle all devices (it was already solved with per-device-dma-ops).

If paravirtualized device passthrough still needs to call multiple
dma_ops, then this patchset doesn't solve that issue.


 can handle these cases in a generic way without hacking it into the
 hardware drivers (these hacks are also in the AMD IOMMU code and I plan
 to remove them in the case this patchset will be accepted).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops

2008-09-29 Thread FUJITA Tomonori
On Mon, 29 Sep 2008 11:36:52 +0200
Joerg Roedel [EMAIL PROTECTED] wrote:

 On Mon, Sep 29, 2008 at 12:30:44PM +0300, Muli Ben-Yehuda wrote:
  On Sun, Sep 28, 2008 at 09:13:33PM +0200, Joerg Roedel wrote:
  
   I think we should try to build a paravirtualized IOMMU for KVM
   guests.  It should work this way: We reserve a configurable amount
   of contiguous guest physical memory and map it dma contiguous using
   some kind of hardware IOMMU. This is possible with all hardare
   IOMMUs we have in the field by now, also Calgary and GART. The guest
   does dma_coherent allocations from this memory directly and is done.
   For map_single and map_sg 
   the guest can do bounce buffering. We avoid nearly all pvdma hypercalls
   with this approach, keep guest swapping working and solve also the
   problems with device dma_masks and guest memory that is not contigous on
   the host side.
  
  I'm not sure I follow, but if I understand correctly with this
  approach the guest could only DMA into buffers that fall within the
  range you allocated for DMA and mapped. Isn't that a pretty nasty
  limitation?  The guest would need to bounce-bufer every frame that
  happened to not fall inside that range, with the resulting loss of
  performance.
 
 The bounce buffering is needed for map_single/map_sg allocations. For
 dma_alloc_coherent we can directly allocate from that range. The
 performance loss of the bounce buffering may be lower than the
 hypercalls we need as the alternative (we need hypercalls for map, unmap
 and sync).

Nobody cares about the performance of dma_alloc_coherent. Only the
performance of map_single/map_sg matters.

I'm not sure how expensive the hypercalls are, but they are more
expensive than bounce buffering coping lots of data for every I/Os?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9][RFC] stackable dma_ops for x86

2008-09-29 Thread FUJITA Tomonori
On Mon, 29 Sep 2008 15:26:47 +0200
Joerg Roedel [EMAIL PROTECTED] wrote:

 On Mon, Sep 29, 2008 at 10:16:39PM +0900, FUJITA Tomonori wrote:
  On Sun, 28 Sep 2008 20:49:26 +0200
  Joerg Roedel [EMAIL PROTECTED] wrote:
  
   On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
On Mon, 22 Sep 2008 20:21:12 +0200
Joerg Roedel [EMAIL PROTECTED] wrote:

 Hi,
 
 this patch series implements stackable dma_ops on x86. This is useful 
 to
 be able to fall back to a different dma_ops implementation if one can
 not handle a particular device (as necessary for example with
 paravirtualized device passthrough or if a hardware IOMMU only 
 handles a
 subset of available devices).

We already handle the latter. This patchset is more flexible but
seems to incur more overheads.

This feature will be used for only paravirtualized device passthrough?
If so, I feel that there is more simpler (and specific) solutions for
it.
   
   Its not only for device passthrough. It handles also the cases where a
   hardware IOMMU does not handle all devices in the system (like in some
   Calgary systems but also possible with AMD IOMMU). With this patchset we
  
  I know that. As I wrote in the previous mail, we already solved that
  problem with per-device-dma-ops.
  
  My question is what unsolved problems this patchset can fix?
  
  
  This patchset is named stackable dma_ops but it's different from
  what we discussed as stackable dma_ops. This patchset provides
  IOMMUs a generic mechanism to set up stackable dma_ops. But this
  patchset doesn't solve the problem that a hardware IOMMU does not
  handle all devices (it was already solved with per-device-dma-ops).
  
  If paravirtualized device passthrough still needs to call multiple
  dma_ops, then this patchset doesn't solve that issue.
 
 Ok, the name stackable is misleading and was a bad choice. I will
 rename it to multiplexing. This should make it more clear what it is.
 Like you pointed out, the problems are solved with per-device dma_ops,
 but in the current implementation it needs special hacks in the IOMMU
 drivers to use these per-device dma_ops.
 I see this patchset as a continuation of the per-device dma_ops idea. It
 moves the per-device handling out of the specific drivers to a common
 place. So we can avoid or remove special hacks in the IOMMU drivers.

Basically, I'm not against this patchset. It simplify Calgary and AMD
IOMMUs code to set up per-device-dma-ops (though it makes dma_ops a
bit complicated).

But it doesn't solve any problems including the paravirtualized device
passthrough. When I wrote per-device-dma-ops, I expected that KVM
people want more changes (such as stackable dma_ops) to dma_ops for
the paravirtualized device passthrough. I'd like to hear what they
want first.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops

2008-09-28 Thread FUJITA Tomonori
On Mon, 22 Sep 2008 20:21:21 +0200
Joerg Roedel [EMAIL PROTECTED] wrote:

 This patch enables stackable dma_ops on x86. To do this, it also enables
 the per-device dma_ops on i386.
 
 Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
 ---
  arch/x86/kernel/pci-dma.c |   26 ++
  include/asm-x86/device.h  |6 +++---
  include/asm-x86/dma-mapping.h |   14 +++---
  3 files changed, 36 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
 index b990fb6..2e517c2 100644
 --- a/arch/x86/kernel/pci-dma.c
 +++ b/arch/x86/kernel/pci-dma.c
 @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
   write_unlock_irqrestore(dma_ops_list_lock, flags);
  }
  
 +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
 +{
 + int i;
 + unsigned long flags;
 + struct dma_mapping_ops *entry, *ops = NULL;
 +
 + read_lock_irqsave(dma_ops_list_lock, flags);
 +
 + for (i = 0; i  DMA_OPS_TYPE_MAX; ++i)
 + list_for_each_entry(entry, dma_ops_list[i], list) {
 + if (!entry-device_supported)
 + continue;
 + if (entry-device_supported(dev)) {
 + ops = entry;
 + goto out;
 + }
 + }
 +out:
 + read_unlock_irqrestore(dma_ops_list_lock, flags);

Hmm, every time we call dma_sg/map_single, we call
read_lock_irqsave(dma_ops_list_lock, flags). It's likely that we see
notable performance drop?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9][RFC] stackable dma_ops for x86

2008-09-28 Thread FUJITA Tomonori
On Mon, 22 Sep 2008 20:21:12 +0200
Joerg Roedel [EMAIL PROTECTED] wrote:

 Hi,
 
 this patch series implements stackable dma_ops on x86. This is useful to
 be able to fall back to a different dma_ops implementation if one can
 not handle a particular device (as necessary for example with
 paravirtualized device passthrough or if a hardware IOMMU only handles a
 subset of available devices).

We already handle the latter. This patchset is more flexible but
seems to incur more overheads.

This feature will be used for only paravirtualized device passthrough?
If so, I feel that there is more simpler (and specific) solutions for
it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware

2008-07-30 Thread FUJITA Tomonori
On Wed, 30 Jul 2008 15:58:46 +0200
Andrea Arcangeli [EMAIL PROTECTED] wrote:

 On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote:
  * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote:
I'm not so interested to go there right now, because while this code
is useful right now because the majority of systems out there lacks
VT-d/iommu, I suspect this code could be nuked in the long
run when all systems will ship with that, which is why I kept it all
  
   Actually at least on Intel platforms and if you exclude the lowest end
   VT-d is shipping universally for quite some time now. If you
   buy a Intel box today or bought it in the last year the chances are pretty
   high that it has VT-d support.
  
  I think you mean VT-x, which is virtualization extensions for the x86 
  architecture. VT-d is virtualization extensions for devices (IOMMU).
 
 I think Andi understood VT-d right but even if he was right that every
 reader of this email that is buying a new VT-x system today is also
 almost guaranteed to get a VT-d motherboard (which I disagree unless
 you buy some really expensive toy), there are current large
 installations of VT-x systems that lacks VT-d and that with recent
 current dual/quadcore cpus are very fast and will be used for the next
 couple of years and they will not upgrade just the motherboard to use
 pci-passthrough.

Today, very inexpensive desktops (for example, Dell OptiPlex 755) have
VT-d support.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html