Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing

2012-10-16 Thread Michael Ellerman
On Tue, 2012-10-16 at 14:13 +1100, Paul Mackerras wrote:
 Michael,
 
 On Tue, Oct 16, 2012 at 11:15:50AM +1100, Michael Ellerman wrote:
  In the Book3s HV code, kvmppc_run_core() has logic to grab the secondary
  threads of the physical core.
  
  If for some reason a thread is stuck, kvmppc_grab_hwthread() can fail,
  but currently we ignore the failure and continue into the guest. If the
  stuck thread is in the kernel badness ensues.
  
  Instead we should check for failure and bail out.
  
  I've moved the grabbing prior to the startup of runnable threads, to 
  simplify
  the error case. AFAICS this is harmless, but I could be missing something
  subtle.
 
 Thanks for looking at this - but in fact this is fixed by my patch
 entitled KVM: PPC: Book3S HV: Fix some races in starting secondary
 threads submitted back on August 28.

OK thanks. It seems that patch didn't make 3.7 ?

I don't see it in kvm-ppc-next either.

cheers

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Various Book3s HV fixes that haven't been picked up yet

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 05:08, Paul Mackerras pau...@samba.org wrote:

 On Mon, Oct 15, 2012 at 02:00:54PM +0200, Alexander Graf wrote:
 
 Sorry, I can't accept patches that haven't shown up on kvm@vger. Please send 
 this patch set again with CC to kvm@vger.
 
 Done; I didn't cc kvm-ppc this time since the patches haven't changed.

Thanks :)

 
 By the way, what is the purpose of kvm-ppc@vger.kernel.org?

It's basically a tag so people can filter based on it. For me, mails to 
kvm-ppc@vger land straight in my inbox, while mails to kvm@vger go to a folder 
I only read occasionally.

Alex

 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Avi Kivity
On 10/16/2012 05:59 AM, Paul Mackerras wrote:
 The mmu_notifier_retry() function, used to test whether any page
 invalidations are in progress, currently takes a vcpu pointer, though
 the code only needs the VM's struct kvm pointer.  Forthcoming patches
 to the powerpc Book3S HV code will need to test for retry within a VM
 ioctl, where a struct kvm pointer is available but a struct vcpu
 pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
 called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
 implements mmu_notifier_retry in terms of it.

Why not change mmu_notifier_retry() and all its callers?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 11:44, Avi Kivity wrote:

 On 10/16/2012 05:59 AM, Paul Mackerras wrote:
 The mmu_notifier_retry() function, used to test whether any page
 invalidations are in progress, currently takes a vcpu pointer, though
 the code only needs the VM's struct kvm pointer.  Forthcoming patches
 to the powerpc Book3S HV code will need to test for retry within a VM
 ioctl, where a struct kvm pointer is available but a struct vcpu
 pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
 called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
 implements mmu_notifier_retry in terms of it.
 
 Why not change mmu_notifier_retry() and all its callers?

Why not use Christoffer's patch? :)

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity
On 10/16/2012 06:01 AM, Paul Mackerras wrote:
 A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
 this fd return the contents of the HPT (hashed page table), writes
 create and/or remove entries in the HPT.  There is a new capability,
 KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
 takes an argument structure with the index of the first HPT entry to
 read out and a set of flags.  The flags indicate whether the user is
 intending to read or write the HPT, and whether to return all entries
 or only the bolted entries (those with the bolted bit, 0x10, set in
 the first doubleword).
 
 This is intended for use in implementing qemu's savevm/loadvm and for
 live migration.  Therefore, on reads, the first pass returns information
 about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
 end of the HPT, it returns from the read.  Subsequent reads only return
 information about HPTEs that have changed since they were last read.
 A read that finds no changed HPTEs in the HPT following where the last
 read finished will return 0 bytes.

Copying people with interest in migration.

 +4.78 KVM_PPC_GET_HTAB_FD
 +
 +Capability: KVM_CAP_PPC_HTAB_FD
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: Pointer to struct kvm_get_htab_fd (in)
 +Returns: file descriptor number (= 0) on success, -1 on error
 +
 +This returns a file descriptor that can be used either to read out the
 +entries in the guest's hashed page table (HPT), or to write entries to
 +initialize the HPT.  The returned fd can only be written to if the
 +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
 +can only be read if that bit is clear.  The argument struct looks like
 +this:
 +
 +/* For KVM_PPC_GET_HTAB_FD */
 +struct kvm_get_htab_fd {
 + __u64   flags;
 + __u64   start_index;
 +};
 +
 +/* Values for kvm_get_htab_fd.flags */
 +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
 +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
 +
 +The `start_index' field gives the index in the HPT of the entry at
 +which to start reading.  It is ignored when writing.
 +
 +Reads on the fd will initially supply information about all
 +interesting HPT entries.  Interesting entries are those with the
 +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
 +all entries.  When the end of the HPT is reached, the read() will
 +return.  

What happens if the read buffer is smaller than the HPT size?

What happens if the read buffer size is not a multiple of entry size?

Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).

 If read() is called again on the fd, it will start again from
 +the beginning of the HPT, but will only return HPT entries that have
 +changed since they were last read.
 +
 +Data read or written is structured as a header (8 bytes) followed by a
 +series of valid HPT entries (16 bytes) each.  The header indicates how
 +many valid HPT entries there are and how many invalid entries follow
 +the valid entries.  The invalid entries are not represented explicitly
 +in the stream.  The header format is:
 +
 +struct kvm_get_htab_header {
 + __u32   index;
 + __u16   n_valid;
 + __u16   n_invalid;
 +};

This structure forces the kernel to return entries sequentially.  Will
this block changing the data structure in the future?  Or is the
hardware spec sufficiently strict that such changes are not realistic?

 +
 +Writes to the fd create HPT entries starting at the index given in the
 +header; first `n_valid' valid entries with contents from the data
 +written, then `n_invalid' invalid entries, invalidating any previously
 +valid entries found.

This scheme is a clever, original, and very interesting approach to live
migration.  That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).

What is the typical number of entries in the HPT?  Do you have estimates
of the change rate?

Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual-physical translations on x86).  How
will we deal with that?  But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity
On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity
On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support without 
 in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 12:57, Avi Kivity wrote:

 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 The current eventfd code assumes that when we have eventfd, we also have
 irqfd for in-kernel interrupt delivery. This is not necessarily true. On
 PPC we don't have an in-kernel irqchip yet, but we can still support easily
 support eventfd.
 
 
 Don't you need, in addition, to reject PIO space for ioeventfd?

Yeah, we could tell user space that it's doing something stupid. Not sure how 
critical it is - right now it would just register it and never get triggered, 
because the bus doesn't exist. But sanity checks are a good thing usually.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 13:01, Avi Kivity wrote:

 On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support without 
 in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.
 
 That's all true but it wasn't my point.  I was asking whether you can
 enable irqfd without in-kernel irqchip.  If your irq input is edge
 triggered then it's trivial.  If it's level triggered then you can use
 the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras
On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  +   __u64   flags;
  +   __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY   ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE ((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?

That's fine; the read stops when it has filled the buffer and a
subsequent read will continue from where the previous one finished.

 What happens if the read buffer size is not a multiple of entry size?

Then we don't use the last few bytes of the buffer.  The read() call
returns the number of bytes that were filled in, of course.  In any
case, the header size is 8 bytes and the HPT entry size is 16 bytes,
so the number of bytes filled in won't necessarily be a multiple of 16
bytes.

 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).

No.

  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  +   __u32   index;
  +   __u16   n_valid;
  +   __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?

By data structure, do you mean the stream format on the file
descriptor, or the HPT structure?  If we want a new stream format,
then we would define a bit in the flags field of struct
kvm_get_htab_fd to mean I want the new stream format.  The code
fails the ioctl if any unknown flag bits are set, so a new userspace
that wants to use the new format could then detect that it is running
on an old kernel and fall back to the old format.

The HPT entry format is very unlikely to change in size or basic
layout (though the architects do redefine some of the bits
occasionally).

  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the change rate?

Typically the HPT would have about a million entries, i.e. it would be
16MiB in size.  The usual guideline is to make it about 1/64 of the
maximum amount of RAM the guest could ever have, rounded up to a power
of two, although we often run with less, say 1/128 or even 1/256.

Because it is a hash table, updates tend to be scattered throughout
the whole table, which is another reason why per-page dirty tracking
and updates would be pretty inefficient.

As for the change rate, it depends on the application of course, but
basically every time the guest changes a PTE in its Linux page tables
we do the corresponding change to the corresponding HPT entry, so the
rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
exec, etc. have a high rate of HPT updates.

 Suppose new hardware arrives that supports nesting HPTs, so that kvm is
 no longer synchronously aware of the guest HPT (similar to how NPT/EPT
 made 

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity
On 10/16/2012 01:58 PM, Paul Mackerras wrote:
 On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  +  __u64   flags;
  +  __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY  ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?
 
 That's fine; the read stops when it has filled the buffer and a
 subsequent read will continue from where the previous one finished.
 
 What happens if the read buffer size is not a multiple of entry size?
 
 Then we don't use the last few bytes of the buffer.  The read() call
 returns the number of bytes that were filled in, of course.  In any
 case, the header size is 8 bytes and the HPT entry size is 16 bytes,
 so the number of bytes filled in won't necessarily be a multiple of 16
 bytes.

That's sane and expected, but it should be documented.

 
 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).
 
 No.

This forces userspace to dedicate a thread for the HPT.

 
  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  +  __u32   index;
  +  __u16   n_valid;
  +  __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?
 
 By data structure, do you mean the stream format on the file
 descriptor, or the HPT structure?  If we want a new stream format,
 then we would define a bit in the flags field of struct
 kvm_get_htab_fd to mean I want the new stream format.  The code
 fails the ioctl if any unknown flag bits are set, so a new userspace
 that wants to use the new format could then detect that it is running
 on an old kernel and fall back to the old format.
 
 The HPT entry format is very unlikely to change in size or basic
 layout (though the architects do redefine some of the bits
 occasionally).

I meant the internal data structure that holds HPT entries.

I guess I don't understand the index.  Do we expect changes to be in
contiguous ranges?  And invalid entries to be contiguous as well?  That
doesn't fit with how hash tables work.  Does the index represent the
position of the entry within the table, or something else?


 
  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the change rate?
 
 Typically the HPT would have about a million entries, i.e. it would be
 16MiB in size.  The usual guideline is to make it about 1/64 of the
 maximum amount of RAM the guest could ever have, rounded up to a power
 of two, although we often run with less, say 1/128 or even 1/256.

16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
it warrant a live migration protocol?

 Because it is 

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity
On 10/16/2012 01:06 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 13:01, Avi Kivity wrote:
 
 On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an 
 in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support 
 without in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.
 
 That's all true but it wasn't my point.  I was asking whether you can
 enable irqfd without in-kernel irqchip.  If your irq input is edge
 triggered then it's trivial.  If it's level triggered then you can use
 the new resampling irqfd.
 
 I'm not sure I fully grasp what you're trying to say :).
 
 We have a single interrupt line on the core. So whenever any external 
 interrupt gets injected (same thing for MSI), we need to go to the MPIC / 
 XICS and ask it which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.

 
 So yes, we could create a direct fd channel between vhost and the user space 
 MPIC, but it wouldn't buy us anything. The interrupt injection path would be 
 as long as it is with the current mechanism.
 

If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf

On 10/16/2012 03:47 PM, Avi Kivity wrote:

On 10/16/2012 01:06 PM, Alexander Graf wrote:

On 16.10.2012, at 13:01, Avi Kivity wrote:


On 10/16/2012 12:59 PM, Alexander Graf wrote:

On 16.10.2012, at 12:56, Avi Kivity wrote:


On 10/15/2012 02:02 PM, Alexander Graf wrote:

In order to support vhost, we need to be able to support ioeventfd.

This patch set adds support for ioeventfd to PPC and makes it possible to
do so without implementing irqfd along the way, as it requires an in-kernel
irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.

The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this 
patch set is cleaning up: enable you to build ioeventfd support without 
in-kernel irqchip irqfd support.

Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) 
in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.


So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?


Nope. Well, yes, in a certain special case where the MPIC pushes the 
interrupt vector on interrupt delivery into a special register. But not 
for the normal case.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing

2012-10-16 Thread Benjamin Herrenschmidt
On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote:
  Thanks for looking at this - but in fact this is fixed by my patch
  entitled KVM: PPC: Book3S HV: Fix some races in starting secondary
  threads submitted back on August 28.
 
 OK thanks. It seems that patch didn't make 3.7 ?
 
 I don't see it in kvm-ppc-next either.

Alex, WTF ?

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Anthony Liguori
Avi Kivity a...@redhat.com writes:

 On 10/16/2012 01:58 PM, Paul Mackerras wrote:
 On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  + __u64   flags;
  + __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?
 
 That's fine; the read stops when it has filled the buffer and a
 subsequent read will continue from where the previous one finished.
 
 What happens if the read buffer size is not a multiple of entry size?
 
 Then we don't use the last few bytes of the buffer.  The read() call
 returns the number of bytes that were filled in, of course.  In any
 case, the header size is 8 bytes and the HPT entry size is 16 bytes,
 so the number of bytes filled in won't necessarily be a multiple of 16
 bytes.

 That's sane and expected, but it should be documented.

 
 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).
 
 No.

 This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size  0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  + __u32   index;
  + __u16   n_valid;
  + __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?
 
 By data structure, do you mean the stream format on the file
 descriptor, or the HPT structure?  If we want a new stream format,
 then we would define a bit in the flags field of struct
 kvm_get_htab_fd to mean I want the new stream format.  The code
 fails the ioctl if any unknown flag bits are set, so a new userspace
 that wants to use the new format could then detect that it is running
 on an old kernel and fall back to the old format.
 
 The HPT entry format is very unlikely to change in size or basic
 layout (though the architects do redefine some of the bits
 occasionally).

 I meant the internal data structure that holds HPT entries.

 I guess I don't understand the index.  Do we expect changes to be in
 contiguous ranges?  And invalid entries to be contiguous as well?  That
 doesn't fit with how hash tables work.  Does the index represent the
 position of the entry within the table, or something else?


 
  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the 

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras
On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
 On 10/16/2012 01:58 PM, Paul Mackerras wrote:
  On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
  Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
  to change).
  
  No.
 
 This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

   If read() is called again on the fd, it will start again from
   +the beginning of the HPT, but will only return HPT entries that have
   +changed since they were last read.
   +
   +Data read or written is structured as a header (8 bytes) followed by a
   +series of valid HPT entries (16 bytes) each.  The header indicates how
   +many valid HPT entries there are and how many invalid entries follow
   +the valid entries.  The invalid entries are not represented explicitly
   +in the stream.  The header format is:
   +
   +struct kvm_get_htab_header {
   +__u32   index;
   +__u16   n_valid;
   +__u16   n_invalid;
   +};
  
  This structure forces the kernel to return entries sequentially.  Will
  this block changing the data structure in the future?  Or is the
  hardware spec sufficiently strict that such changes are not realistic?
  
  By data structure, do you mean the stream format on the file
  descriptor, or the HPT structure?  If we want a new stream format,
  then we would define a bit in the flags field of struct
  kvm_get_htab_fd to mean I want the new stream format.  The code
  fails the ioctl if any unknown flag bits are set, so a new userspace
  that wants to use the new format could then detect that it is running
  on an old kernel and fall back to the old format.
  
  The HPT entry format is very unlikely to change in size or basic
  layout (though the architects do redefine some of the bits
  occasionally).
 
 I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

 I guess I don't understand the index.  Do we expect changes to be in
 contiguous ranges?  And invalid entries to be contiguous as well?  That
 doesn't fit with how hash tables work.  Does the index represent the
 position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

   +
   +Writes to the fd create HPT entries starting at the index given in the
   +header; first `n_valid' valid entries with contents from the data
   +written, then `n_invalid' invalid entries, invalidating any previously
   +valid entries found.
  
  This scheme is a clever, original, and very interesting approach to live
  migration.  That doesn't necessarily mean a NAK, we should see if it
  makes sense for other migration APIs as well (we currently have
  difficulties migrating very large/wide guests).
  
  What is the typical number of entries in the HPT?  Do you have estimates
  of the change rate?
  
  Typically the HPT would have about a million entries, i.e. it would be
  16MiB in size.  The usual guideline is to make it about 1/64 of the
  maximum amount of RAM the guest could ever have, rounded up to a power
  of two, although we often run with less, say 1/128 or even 1/256.
 
 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
 it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

  Because it is a hash table, updates tend to be scattered throughout
  the whole table, which is another reason why per-page dirty tracking
  and updates would be pretty inefficient.
 
 This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

  
  As for the change rate, it depends on the application of course, but
  basically every time the guest changes a PTE in its Linux page tables
  we do the corresponding change to the corresponding HPT entry, so the
  rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
  exec, etc. have a high rate of HPT updates.
 
 If the rate is high enough, then there's no point in a live update.

True, but doesn't that argument apply to memory pages as well?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to