Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 21:33, Benjamin Herrenschmidt wrote:

> On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote:
>>> Thanks for looking at this - but in fact this is fixed by my patch
>>> entitled "KVM: PPC: Book3S HV: Fix some races in starting secondary
>>> threads" submitted back on August 28.
>> 
>> OK thanks. It seems that patch didn't make 3.7 ?
>> 
>> I don't see it in kvm-ppc-next either.
> 
> Alex, WTF ?

Hrm. Not sure what happened there. I think I waited for your ack, but never 
actually applied things when it came. My bad :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras
On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> >> to change).
> > 
> > No.
> 
> This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

> >> > If read() is called again on the fd, it will start again from
> >> > +the beginning of the HPT, but will only return HPT entries that have
> >> > +changed since they were last read.
> >> > +
> >> > +Data read or written is structured as a header (8 bytes) followed by a
> >> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> >> > +many valid HPT entries there are and how many invalid entries follow
> >> > +the valid entries.  The invalid entries are not represented explicitly
> >> > +in the stream.  The header format is:
> >> > +
> >> > +struct kvm_get_htab_header {
> >> > +__u32   index;
> >> > +__u16   n_valid;
> >> > +__u16   n_invalid;
> >> > +};
> >> 
> >> This structure forces the kernel to return entries sequentially.  Will
> >> this block changing the data structure in the future?  Or is the
> >> hardware spec sufficiently strict that such changes are not realistic?
> > 
> > By "data structure", do you mean the stream format on the file
> > descriptor, or the HPT structure?  If we want a new stream format,
> > then we would define a bit in the flags field of struct
> > kvm_get_htab_fd to mean "I want the new stream format".  The code
> > fails the ioctl if any unknown flag bits are set, so a new userspace
> > that wants to use the new format could then detect that it is running
> > on an old kernel and fall back to the old format.
> > 
> > The HPT entry format is very unlikely to change in size or basic
> > layout (though the architects do redefine some of the bits
> > occasionally).
> 
> I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

> >> > +
> >> > +Writes to the fd create HPT entries starting at the index given in the
> >> > +header; first `n_valid' valid entries with contents from the data
> >> > +written, then `n_invalid' invalid entries, invalidating any previously
> >> > +valid entries found.
> >> 
> >> This scheme is a clever, original, and very interesting approach to live
> >> migration.  That doesn't necessarily mean a NAK, we should see if it
> >> makes sense for other migration APIs as well (we currently have
> >> difficulties migrating very large/wide guests).
> >> 
> >> What is the typical number of entries in the HPT?  Do you have estimates
> >> of the change rate?
> > 
> > Typically the HPT would have about a million entries, i.e. it would be
> > 16MiB in size.  The usual guideline is to make it about 1/64 of the
> > maximum amount of RAM the guest could ever have, rounded up to a power
> > of two, although we often run with less, say 1/128 or even 1/256.
> 
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

> > Because it is a hash table, updates tend to be scattered throughout
> > the whole table, which is another reason why per-page dirty tracking
> > and updates would be pretty inefficient.
> 
> This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

> > 
> > As for the change rate, it depends on the application of course, but
> > basically every time the guest changes a PTE in its Linux page tables
> > we do the corresponding change to the corresponding HPT entry, so the
> > rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
> > exec, etc. have a high rate of HPT updates.
> 
> If the rate is high enough, then there's no point in

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Anthony Liguori
Avi Kivity  writes:

> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
>> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>>> > +4.78 KVM_PPC_GET_HTAB_FD
>>> > +
>>> > +Capability: KVM_CAP_PPC_HTAB_FD
>>> > +Architectures: powerpc
>>> > +Type: vm ioctl
>>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>>> > +
>>> > +This returns a file descriptor that can be used either to read out the
>>> > +entries in the guest's hashed page table (HPT), or to write entries to
>>> > +initialize the HPT.  The returned fd can only be written to if the
>>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>>> > +can only be read if that bit is clear.  The argument struct looks like
>>> > +this:
>>> > +
>>> > +/* For KVM_PPC_GET_HTAB_FD */
>>> > +struct kvm_get_htab_fd {
>>> > + __u64   flags;
>>> > + __u64   start_index;
>>> > +};
>>> > +
>>> > +/* Values for kvm_get_htab_fd.flags */
>>> > +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
>>> > +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
>>> > +
>>> > +The `start_index' field gives the index in the HPT of the entry at
>>> > +which to start reading.  It is ignored when writing.
>>> > +
>>> > +Reads on the fd will initially supply information about all
>>> > +"interesting" HPT entries.  Interesting entries are those with the
>>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>>> > +all entries.  When the end of the HPT is reached, the read() will
>>> > +return.  
>>> 
>>> What happens if the read buffer is smaller than the HPT size?
>> 
>> That's fine; the read stops when it has filled the buffer and a
>> subsequent read will continue from where the previous one finished.
>> 
>>> What happens if the read buffer size is not a multiple of entry size?
>> 
>> Then we don't use the last few bytes of the buffer.  The read() call
>> returns the number of bytes that were filled in, of course.  In any
>> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
>> so the number of bytes filled in won't necessarily be a multiple of 16
>> bytes.
>
> That's sane and expected, but it should be documented.
>
>> 
>>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>>> to change).
>> 
>> No.
>
> This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size > 0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

>>> > If read() is called again on the fd, it will start again from
>>> > +the beginning of the HPT, but will only return HPT entries that have
>>> > +changed since they were last read.
>>> > +
>>> > +Data read or written is structured as a header (8 bytes) followed by a
>>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>>> > +many valid HPT entries there are and how many invalid entries follow
>>> > +the valid entries.  The invalid entries are not represented explicitly
>>> > +in the stream.  The header format is:
>>> > +
>>> > +struct kvm_get_htab_header {
>>> > + __u32   index;
>>> > + __u16   n_valid;
>>> > + __u16   n_invalid;
>>> > +};
>>> 
>>> This structure forces the kernel to return entries sequentially.  Will
>>> this block changing the data structure in the future?  Or is the
>>> hardware spec sufficiently strict that such changes are not realistic?
>> 
>> By "data structure", do you mean the stream format on the file
>> descriptor, or the HPT structure?  If we want a new stream format,
>> then we would define a bit in the flags field of struct
>> kvm_get_htab_fd to mean "I want the new stream format".  The code
>> fails the ioctl if any unknown flag bits are set, so a new userspace
>> that wants to use the new format could then detect that it is running
>> on an old kernel and fall back to the old format.
>> 
>> The HPT entry format is very unlikely to change in size or basic
>> layout (though the architects do redefine some of the bits
>> occasionally).
>
> I meant the internal data structure that holds HPT entries.
>
> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?
>
>
>> 
>>> > +
>>> > +Writes to the fd create HPT entries starting at the index given in the
>>> > +header; first `n_valid' valid entries with contents from the data
>>> > +written, then `n_invalid' invalid entries, invalidating any previously
>>> > +valid entries found.
>>> 
>>> This scheme is a clever, origina

Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing

2012-10-16 Thread Benjamin Herrenschmidt
On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote:
> > Thanks for looking at this - but in fact this is fixed by my patch
> > entitled "KVM: PPC: Book3S HV: Fix some races in starting secondary
> > threads" submitted back on August 28.
> 
> OK thanks. It seems that patch didn't make 3.7 ?
> 
> I don't see it in kvm-ppc-next either.

Alex, WTF ?

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf

On 10/16/2012 03:47 PM, Avi Kivity wrote:

On 10/16/2012 01:06 PM, Alexander Graf wrote:

On 16.10.2012, at 13:01, Avi Kivity wrote:


On 10/16/2012 12:59 PM, Alexander Graf wrote:

On 16.10.2012, at 12:56, Avi Kivity wrote:


On 10/15/2012 02:02 PM, Alexander Graf wrote:

In order to support vhost, we need to be able to support ioeventfd.

This patch set adds support for ioeventfd to PPC and makes it possible to
do so without implementing irqfd along the way, as it requires an in-kernel
irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.

The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this 
patch set is cleaning up: enable you to build ioeventfd support without 
in-kernel irqchip irqfd support.

Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) 
in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.


So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?


Nope. Well, yes, in a certain special case where the MPIC pushes the 
interrupt vector on interrupt delivery into a special register. But not 
for the "normal" case.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity
On 10/16/2012 01:06 PM, Alexander Graf wrote:
> 
> On 16.10.2012, at 13:01, Avi Kivity wrote:
> 
>> On 10/16/2012 12:59 PM, Alexander Graf wrote:
>>> 
>>> On 16.10.2012, at 12:56, Avi Kivity wrote:
>>> 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
> In order to support vhost, we need to be able to support ioeventfd.
> 
> This patch set adds support for ioeventfd to PPC and makes it possible to
> do so without implementing irqfd along the way, as it requires an 
> in-kernel
> irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
>>> 
>>> The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
>>> this patch set is cleaning up: enable you to build ioeventfd support 
>>> without in-kernel irqchip irqfd support.
>>> 
>>> Vhost can work without in-kernel irqchip too, by listening to an eventfd 
>>> (iiuc) in user space. That path works just fine with these patches applied.
>> 
>> That's all true but it wasn't my point.  I was asking whether you can
>> enable irqfd without in-kernel irqchip.  If your irq input is edge
>> triggered then it's trivial.  If it's level triggered then you can use
>> the new resampling irqfd.
> 
> I'm not sure I fully grasp what you're trying to say :).
> 
> We have a single interrupt line on the core. So whenever any external 
> interrupt gets injected (same thing for MSI), we need to go to the MPIC / 
> XICS and ask it which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.

> 
> So yes, we could create a direct fd channel between vhost and the user space 
> MPIC, but it wouldn't buy us anything. The interrupt injection path would be 
> as long as it is with the current mechanism.
> 

If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity
On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>> > +4.78 KVM_PPC_GET_HTAB_FD
>> > +
>> > +Capability: KVM_CAP_PPC_HTAB_FD
>> > +Architectures: powerpc
>> > +Type: vm ioctl
>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>> > +
>> > +This returns a file descriptor that can be used either to read out the
>> > +entries in the guest's hashed page table (HPT), or to write entries to
>> > +initialize the HPT.  The returned fd can only be written to if the
>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>> > +can only be read if that bit is clear.  The argument struct looks like
>> > +this:
>> > +
>> > +/* For KVM_PPC_GET_HTAB_FD */
>> > +struct kvm_get_htab_fd {
>> > +  __u64   flags;
>> > +  __u64   start_index;
>> > +};
>> > +
>> > +/* Values for kvm_get_htab_fd.flags */
>> > +#define KVM_GET_HTAB_BOLTED_ONLY  ((__u64)0x1)
>> > +#define KVM_GET_HTAB_WRITE((__u64)0x2)
>> > +
>> > +The `start_index' field gives the index in the HPT of the entry at
>> > +which to start reading.  It is ignored when writing.
>> > +
>> > +Reads on the fd will initially supply information about all
>> > +"interesting" HPT entries.  Interesting entries are those with the
>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>> > +all entries.  When the end of the HPT is reached, the read() will
>> > +return.  
>> 
>> What happens if the read buffer is smaller than the HPT size?
> 
> That's fine; the read stops when it has filled the buffer and a
> subsequent read will continue from where the previous one finished.
> 
>> What happens if the read buffer size is not a multiple of entry size?
> 
> Then we don't use the last few bytes of the buffer.  The read() call
> returns the number of bytes that were filled in, of course.  In any
> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
> so the number of bytes filled in won't necessarily be a multiple of 16
> bytes.

That's sane and expected, but it should be documented.

> 
>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>> to change).
> 
> No.

This forces userspace to dedicate a thread for the HPT.

> 
>> > If read() is called again on the fd, it will start again from
>> > +the beginning of the HPT, but will only return HPT entries that have
>> > +changed since they were last read.
>> > +
>> > +Data read or written is structured as a header (8 bytes) followed by a
>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>> > +many valid HPT entries there are and how many invalid entries follow
>> > +the valid entries.  The invalid entries are not represented explicitly
>> > +in the stream.  The header format is:
>> > +
>> > +struct kvm_get_htab_header {
>> > +  __u32   index;
>> > +  __u16   n_valid;
>> > +  __u16   n_invalid;
>> > +};
>> 
>> This structure forces the kernel to return entries sequentially.  Will
>> this block changing the data structure in the future?  Or is the
>> hardware spec sufficiently strict that such changes are not realistic?
> 
> By "data structure", do you mean the stream format on the file
> descriptor, or the HPT structure?  If we want a new stream format,
> then we would define a bit in the flags field of struct
> kvm_get_htab_fd to mean "I want the new stream format".  The code
> fails the ioctl if any unknown flag bits are set, so a new userspace
> that wants to use the new format could then detect that it is running
> on an old kernel and fall back to the old format.
> 
> The HPT entry format is very unlikely to change in size or basic
> layout (though the architects do redefine some of the bits
> occasionally).

I meant the internal data structure that holds HPT entries.

I guess I don't understand the index.  Do we expect changes to be in
contiguous ranges?  And invalid entries to be contiguous as well?  That
doesn't fit with how hash tables work.  Does the index represent the
position of the entry within the table, or something else?


> 
>> > +
>> > +Writes to the fd create HPT entries starting at the index given in the
>> > +header; first `n_valid' valid entries with contents from the data
>> > +written, then `n_invalid' invalid entries, invalidating any previously
>> > +valid entries found.
>> 
>> This scheme is a clever, original, and very interesting approach to live
>> migration.  That doesn't necessarily mean a NAK, we should see if it
>> makes sense for other migration APIs as well (we currently have
>> difficulties migrating very large/wide guests).
>> 
>> What is the typical number of entries in the HPT?  Do you have estimates
>> of the change rate?
> 
> Typically the HPT would have about a million entries, i.e. it would be
> 16MiB in size.  The usual guideline is to make it about 1/64 of the
> maximum amount of RAM the guest 

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras
On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> > +4.78 KVM_PPC_GET_HTAB_FD
> > +
> > +Capability: KVM_CAP_PPC_HTAB_FD
> > +Architectures: powerpc
> > +Type: vm ioctl
> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
> > +Returns: file descriptor number (>= 0) on success, -1 on error
> > +
> > +This returns a file descriptor that can be used either to read out the
> > +entries in the guest's hashed page table (HPT), or to write entries to
> > +initialize the HPT.  The returned fd can only be written to if the
> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> > +can only be read if that bit is clear.  The argument struct looks like
> > +this:
> > +
> > +/* For KVM_PPC_GET_HTAB_FD */
> > +struct kvm_get_htab_fd {
> > +   __u64   flags;
> > +   __u64   start_index;
> > +};
> > +
> > +/* Values for kvm_get_htab_fd.flags */
> > +#define KVM_GET_HTAB_BOLTED_ONLY   ((__u64)0x1)
> > +#define KVM_GET_HTAB_WRITE ((__u64)0x2)
> > +
> > +The `start_index' field gives the index in the HPT of the entry at
> > +which to start reading.  It is ignored when writing.
> > +
> > +Reads on the fd will initially supply information about all
> > +"interesting" HPT entries.  Interesting entries are those with the
> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> > +all entries.  When the end of the HPT is reached, the read() will
> > +return.  
> 
> What happens if the read buffer is smaller than the HPT size?

That's fine; the read stops when it has filled the buffer and a
subsequent read will continue from where the previous one finished.

> What happens if the read buffer size is not a multiple of entry size?

Then we don't use the last few bytes of the buffer.  The read() call
returns the number of bytes that were filled in, of course.  In any
case, the header size is 8 bytes and the HPT entry size is 16 bytes,
so the number of bytes filled in won't necessarily be a multiple of 16
bytes.

> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> to change).

No.

> > If read() is called again on the fd, it will start again from
> > +the beginning of the HPT, but will only return HPT entries that have
> > +changed since they were last read.
> > +
> > +Data read or written is structured as a header (8 bytes) followed by a
> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> > +many valid HPT entries there are and how many invalid entries follow
> > +the valid entries.  The invalid entries are not represented explicitly
> > +in the stream.  The header format is:
> > +
> > +struct kvm_get_htab_header {
> > +   __u32   index;
> > +   __u16   n_valid;
> > +   __u16   n_invalid;
> > +};
> 
> This structure forces the kernel to return entries sequentially.  Will
> this block changing the data structure in the future?  Or is the
> hardware spec sufficiently strict that such changes are not realistic?

By "data structure", do you mean the stream format on the file
descriptor, or the HPT structure?  If we want a new stream format,
then we would define a bit in the flags field of struct
kvm_get_htab_fd to mean "I want the new stream format".  The code
fails the ioctl if any unknown flag bits are set, so a new userspace
that wants to use the new format could then detect that it is running
on an old kernel and fall back to the old format.

The HPT entry format is very unlikely to change in size or basic
layout (though the architects do redefine some of the bits
occasionally).

> > +
> > +Writes to the fd create HPT entries starting at the index given in the
> > +header; first `n_valid' valid entries with contents from the data
> > +written, then `n_invalid' invalid entries, invalidating any previously
> > +valid entries found.
> 
> This scheme is a clever, original, and very interesting approach to live
> migration.  That doesn't necessarily mean a NAK, we should see if it
> makes sense for other migration APIs as well (we currently have
> difficulties migrating very large/wide guests).
> 
> What is the typical number of entries in the HPT?  Do you have estimates
> of the change rate?

Typically the HPT would have about a million entries, i.e. it would be
16MiB in size.  The usual guideline is to make it about 1/64 of the
maximum amount of RAM the guest could ever have, rounded up to a power
of two, although we often run with less, say 1/128 or even 1/256.

Because it is a hash table, updates tend to be scattered throughout
the whole table, which is another reason why per-page dirty tracking
and updates would be pretty inefficient.

As for the change rate, it depends on the application of course, but
basically every time the guest changes a PTE in its Linux page tables
we do the corresponding change to the corresponding HPT entry, so the
rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
exec, etc. have a high rate of HPT updates.

> Suppose new hardwar

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 13:01, Avi Kivity wrote:

> On 10/16/2012 12:59 PM, Alexander Graf wrote:
>> 
>> On 16.10.2012, at 12:56, Avi Kivity wrote:
>> 
>>> On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
>>> 
>>> It's not strictly required.  You have an interrupt line leading to the
>>> core, no?  You could have your irqfd trigger that.
>> 
>> The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
>> this patch set is cleaning up: enable you to build ioeventfd support without 
>> in-kernel irqchip irqfd support.
>> 
>> Vhost can work without in-kernel irqchip too, by listening to an eventfd 
>> (iiuc) in user space. That path works just fine with these patches applied.
> 
> That's all true but it wasn't my point.  I was asking whether you can
> enable irqfd without in-kernel irqchip.  If your irq input is edge
> triggered then it's trivial.  If it's level triggered then you can use
> the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 12:57, Avi Kivity wrote:

> On 10/15/2012 02:02 PM, Alexander Graf wrote:
>> The current eventfd code assumes that when we have eventfd, we also have
>> irqfd for in-kernel interrupt delivery. This is not necessarily true. On
>> PPC we don't have an in-kernel irqchip yet, but we can still support easily
>> support eventfd.
>> 
> 
> Don't you need, in addition, to reject PIO space for ioeventfd?

Yeah, we could tell user space that it's doing something stupid. Not sure how 
critical it is - right now it would just register it and never get triggered, 
because the bus doesn't exist. But sanity checks are a good thing usually.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity
On 10/16/2012 12:59 PM, Alexander Graf wrote:
> 
> On 16.10.2012, at 12:56, Avi Kivity wrote:
> 
>> On 10/15/2012 02:02 PM, Alexander Graf wrote:
>>> In order to support vhost, we need to be able to support ioeventfd.
>>> 
>>> This patch set adds support for ioeventfd to PPC and makes it possible to
>>> do so without implementing irqfd along the way, as it requires an in-kernel
>>> irqchip which we don't have yet.
>> 
>> It's not strictly required.  You have an interrupt line leading to the
>> core, no?  You could have your irqfd trigger that.
> 
> The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
> this patch set is cleaning up: enable you to build ioeventfd support without 
> in-kernel irqchip irqfd support.
> 
> Vhost can work without in-kernel irqchip too, by listening to an eventfd 
> (iiuc) in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 12:56, Avi Kivity wrote:

> On 10/15/2012 02:02 PM, Alexander Graf wrote:
>> In order to support vhost, we need to be able to support ioeventfd.
>> 
>> This patch set adds support for ioeventfd to PPC and makes it possible to
>> do so without implementing irqfd along the way, as it requires an in-kernel
>> irqchip which we don't have yet.
> 
> It's not strictly required.  You have an interrupt line leading to the
> core, no?  You could have your irqfd trigger that.

The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this 
patch set is cleaning up: enable you to build ioeventfd support without 
in-kernel irqchip irqfd support.

Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) 
in user space. That path works just fine with these patches applied.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip

2012-10-16 Thread Avi Kivity
On 10/15/2012 02:02 PM, Alexander Graf wrote:
> The current eventfd code assumes that when we have eventfd, we also have
> irqfd for in-kernel interrupt delivery. This is not necessarily true. On
> PPC we don't have an in-kernel irqchip yet, but we can still support easily
> support eventfd.
> 

Don't you need, in addition, to reject PIO space for ioeventfd?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity
On 10/15/2012 02:02 PM, Alexander Graf wrote:
> In order to support vhost, we need to be able to support ioeventfd.
> 
> This patch set adds support for ioeventfd to PPC and makes it possible to
> do so without implementing irqfd along the way, as it requires an in-kernel
> irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity
On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
> this fd return the contents of the HPT (hashed page table), writes
> create and/or remove entries in the HPT.  There is a new capability,
> KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
> takes an argument structure with the index of the first HPT entry to
> read out and a set of flags.  The flags indicate whether the user is
> intending to read or write the HPT, and whether to return all entries
> or only the "bolted" entries (those with the bolted bit, 0x10, set in
> the first doubleword).
> 
> This is intended for use in implementing qemu's savevm/loadvm and for
> live migration.  Therefore, on reads, the first pass returns information
> about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
> end of the HPT, it returns from the read.  Subsequent reads only return
> information about HPTEs that have changed since they were last read.
> A read that finds no changed HPTEs in the HPT following where the last
> read finished will return 0 bytes.

Copying people with interest in migration.

> +4.78 KVM_PPC_GET_HTAB_FD
> +
> +Capability: KVM_CAP_PPC_HTAB_FD
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to struct kvm_get_htab_fd (in)
> +Returns: file descriptor number (>= 0) on success, -1 on error
> +
> +This returns a file descriptor that can be used either to read out the
> +entries in the guest's hashed page table (HPT), or to write entries to
> +initialize the HPT.  The returned fd can only be written to if the
> +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> +can only be read if that bit is clear.  The argument struct looks like
> +this:
> +
> +/* For KVM_PPC_GET_HTAB_FD */
> +struct kvm_get_htab_fd {
> + __u64   flags;
> + __u64   start_index;
> +};
> +
> +/* Values for kvm_get_htab_fd.flags */
> +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
> +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
> +
> +The `start_index' field gives the index in the HPT of the entry at
> +which to start reading.  It is ignored when writing.
> +
> +Reads on the fd will initially supply information about all
> +"interesting" HPT entries.  Interesting entries are those with the
> +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> +all entries.  When the end of the HPT is reached, the read() will
> +return.  

What happens if the read buffer is smaller than the HPT size?

What happens if the read buffer size is not a multiple of entry size?

Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).

> If read() is called again on the fd, it will start again from
> +the beginning of the HPT, but will only return HPT entries that have
> +changed since they were last read.
> +
> +Data read or written is structured as a header (8 bytes) followed by a
> +series of valid HPT entries (16 bytes) each.  The header indicates how
> +many valid HPT entries there are and how many invalid entries follow
> +the valid entries.  The invalid entries are not represented explicitly
> +in the stream.  The header format is:
> +
> +struct kvm_get_htab_header {
> + __u32   index;
> + __u16   n_valid;
> + __u16   n_invalid;
> +};

This structure forces the kernel to return entries sequentially.  Will
this block changing the data structure in the future?  Or is the
hardware spec sufficiently strict that such changes are not realistic?

> +
> +Writes to the fd create HPT entries starting at the index given in the
> +header; first `n_valid' valid entries with contents from the data
> +written, then `n_invalid' invalid entries, invalidating any previously
> +valid entries found.

This scheme is a clever, original, and very interesting approach to live
migration.  That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).

What is the typical number of entries in the HPT?  Do you have estimates
of the change rate?

Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual->physical translations on x86).  How
will we deal with that?  But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Alexander Graf

On 16.10.2012, at 11:44, Avi Kivity wrote:

> On 10/16/2012 05:59 AM, Paul Mackerras wrote:
>> The mmu_notifier_retry() function, used to test whether any page
>> invalidations are in progress, currently takes a vcpu pointer, though
>> the code only needs the VM's struct kvm pointer.  Forthcoming patches
>> to the powerpc Book3S HV code will need to test for retry within a VM
>> ioctl, where a struct kvm pointer is available but a struct vcpu
>> pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
>> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
>> implements mmu_notifier_retry in terms of it.
> 
> Why not change mmu_notifier_retry() and all its callers?

Why not use Christoffer's patch? :)

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Avi Kivity
On 10/16/2012 05:59 AM, Paul Mackerras wrote:
> The mmu_notifier_retry() function, used to test whether any page
> invalidations are in progress, currently takes a vcpu pointer, though
> the code only needs the VM's struct kvm pointer.  Forthcoming patches
> to the powerpc Book3S HV code will need to test for retry within a VM
> ioctl, where a struct kvm pointer is available but a struct vcpu
> pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
> implements mmu_notifier_retry in terms of it.

Why not change mmu_notifier_retry() and all its callers?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Various Book3s HV fixes that haven't been picked up yet

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 05:08, Paul Mackerras  wrote:

> On Mon, Oct 15, 2012 at 02:00:54PM +0200, Alexander Graf wrote:
>> 
>> Sorry, I can't accept patches that haven't shown up on kvm@vger. Please send 
>> this patch set again with CC to kvm@vger.
> 
> Done; I didn't cc kvm-ppc this time since the patches haven't changed.

Thanks :)

> 
> By the way, what is the purpose of kvm-ppc@vger.kernel.org?

It's basically a tag so people can filter based on it. For me, mails to 
kvm-ppc@vger land straight in my inbox, while mails to kvm@vger go to a folder 
I only read occasionally.

Alex

> 
> Paul.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html