Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing
On 16.10.2012, at 21:33, Benjamin Herrenschmidt wrote: > On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote: >>> Thanks for looking at this - but in fact this is fixed by my patch >>> entitled "KVM: PPC: Book3S HV: Fix some races in starting secondary >>> threads" submitted back on August 28. >> >> OK thanks. It seems that patch didn't make 3.7 ? >> >> I don't see it in kvm-ppc-next either. > > Alex, WTF ? Hrm. Not sure what happened there. I think I waited for your ack, but never actually applied things when it came. My bad :) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote: > On 10/16/2012 01:58 PM, Paul Mackerras wrote: > > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: > >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry > >> to change). > > > > No. > > This forces userspace to dedicate a thread for the HPT. Why? Reads never block in any case. > >> > If read() is called again on the fd, it will start again from > >> > +the beginning of the HPT, but will only return HPT entries that have > >> > +changed since they were last read. > >> > + > >> > +Data read or written is structured as a header (8 bytes) followed by a > >> > +series of valid HPT entries (16 bytes) each. The header indicates how > >> > +many valid HPT entries there are and how many invalid entries follow > >> > +the valid entries. The invalid entries are not represented explicitly > >> > +in the stream. The header format is: > >> > + > >> > +struct kvm_get_htab_header { > >> > +__u32 index; > >> > +__u16 n_valid; > >> > +__u16 n_invalid; > >> > +}; > >> > >> This structure forces the kernel to return entries sequentially. Will > >> this block changing the data structure in the future? Or is the > >> hardware spec sufficiently strict that such changes are not realistic? > > > > By "data structure", do you mean the stream format on the file > > descriptor, or the HPT structure? If we want a new stream format, > > then we would define a bit in the flags field of struct > > kvm_get_htab_fd to mean "I want the new stream format". The code > > fails the ioctl if any unknown flag bits are set, so a new userspace > > that wants to use the new format could then detect that it is running > > on an old kernel and fall back to the old format. > > > > The HPT entry format is very unlikely to change in size or basic > > layout (though the architects do redefine some of the bits > > occasionally). > > I meant the internal data structure that holds HPT entries. Oh, that's just an array, and userspace already knows how big it is. > I guess I don't understand the index. Do we expect changes to be in > contiguous ranges? And invalid entries to be contiguous as well? That > doesn't fit with how hash tables work. Does the index represent the > position of the entry within the table, or something else? The index is just the position in the array. Typically, in each group of 8 it will tend to be the low-numbered ones that are valid, since creating an entry usually uses the first empty slot. So I expect that on the first pass, most of the records will represent 8 HPTEs. On subsequent passes, probably most records will represent a single HPTE. > >> > + > >> > +Writes to the fd create HPT entries starting at the index given in the > >> > +header; first `n_valid' valid entries with contents from the data > >> > +written, then `n_invalid' invalid entries, invalidating any previously > >> > +valid entries found. > >> > >> This scheme is a clever, original, and very interesting approach to live > >> migration. That doesn't necessarily mean a NAK, we should see if it > >> makes sense for other migration APIs as well (we currently have > >> difficulties migrating very large/wide guests). > >> > >> What is the typical number of entries in the HPT? Do you have estimates > >> of the change rate? > > > > Typically the HPT would have about a million entries, i.e. it would be > > 16MiB in size. The usual guideline is to make it about 1/64 of the > > maximum amount of RAM the guest could ever have, rounded up to a power > > of two, although we often run with less, say 1/128 or even 1/256. > > 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does > it warrant a live migration protocol? The qemu people I talked to seemed to think so. > > Because it is a hash table, updates tend to be scattered throughout > > the whole table, which is another reason why per-page dirty tracking > > and updates would be pretty inefficient. > > This suggests a stream format that includes the index in every entry. That would amount to dropping the n_valid and n_invalid fields from the current header format. That would be less efficient for the initial pass (assuming we achieve an average n_valid of at least 2 on the initial pass), and probably less efficient for the incremental updates, since a newly-invalidated entry would have to be represented as 16 zero bytes rather than just an 8-byte header with n_valid=0 and n_invalid=1. I'm assuming here that the initial pass would omit invalid entries. > > > > As for the change rate, it depends on the application of course, but > > basically every time the guest changes a PTE in its Linux page tables > > we do the corresponding change to the corresponding HPT entry, so the > > rate can be quite high. Workloads that do a lot of fork, exit, mmap, > > exec, etc. have a high rate of HPT updates. > > If the rate is high enough, then there's no point in
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
Avi Kivity writes: > On 10/16/2012 01:58 PM, Paul Mackerras wrote: >> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: >>> On 10/16/2012 06:01 AM, Paul Mackerras wrote: >>> > +4.78 KVM_PPC_GET_HTAB_FD >>> > + >>> > +Capability: KVM_CAP_PPC_HTAB_FD >>> > +Architectures: powerpc >>> > +Type: vm ioctl >>> > +Parameters: Pointer to struct kvm_get_htab_fd (in) >>> > +Returns: file descriptor number (>= 0) on success, -1 on error >>> > + >>> > +This returns a file descriptor that can be used either to read out the >>> > +entries in the guest's hashed page table (HPT), or to write entries to >>> > +initialize the HPT. The returned fd can only be written to if the >>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and >>> > +can only be read if that bit is clear. The argument struct looks like >>> > +this: >>> > + >>> > +/* For KVM_PPC_GET_HTAB_FD */ >>> > +struct kvm_get_htab_fd { >>> > + __u64 flags; >>> > + __u64 start_index; >>> > +}; >>> > + >>> > +/* Values for kvm_get_htab_fd.flags */ >>> > +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) >>> > +#define KVM_GET_HTAB_WRITE ((__u64)0x2) >>> > + >>> > +The `start_index' field gives the index in the HPT of the entry at >>> > +which to start reading. It is ignored when writing. >>> > + >>> > +Reads on the fd will initially supply information about all >>> > +"interesting" HPT entries. Interesting entries are those with the >>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise >>> > +all entries. When the end of the HPT is reached, the read() will >>> > +return. >>> >>> What happens if the read buffer is smaller than the HPT size? >> >> That's fine; the read stops when it has filled the buffer and a >> subsequent read will continue from where the previous one finished. >> >>> What happens if the read buffer size is not a multiple of entry size? >> >> Then we don't use the last few bytes of the buffer. The read() call >> returns the number of bytes that were filled in, of course. In any >> case, the header size is 8 bytes and the HPT entry size is 16 bytes, >> so the number of bytes filled in won't necessarily be a multiple of 16 >> bytes. > > That's sane and expected, but it should be documented. > >> >>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry >>> to change). >> >> No. > > This forces userspace to dedicate a thread for the HPT. If no changes are available, does read return a size > 0? I don't think it's necessary to support polling. The kernel should always be able to respond to userspace here. The only catch is whether to return !0 read sizes when there are no changes. At any case, I can't see why a dedicated thread is needed. QEMU is going to poll HPT based on how fast we can send data over the wire. >>> > If read() is called again on the fd, it will start again from >>> > +the beginning of the HPT, but will only return HPT entries that have >>> > +changed since they were last read. >>> > + >>> > +Data read or written is structured as a header (8 bytes) followed by a >>> > +series of valid HPT entries (16 bytes) each. The header indicates how >>> > +many valid HPT entries there are and how many invalid entries follow >>> > +the valid entries. The invalid entries are not represented explicitly >>> > +in the stream. The header format is: >>> > + >>> > +struct kvm_get_htab_header { >>> > + __u32 index; >>> > + __u16 n_valid; >>> > + __u16 n_invalid; >>> > +}; >>> >>> This structure forces the kernel to return entries sequentially. Will >>> this block changing the data structure in the future? Or is the >>> hardware spec sufficiently strict that such changes are not realistic? >> >> By "data structure", do you mean the stream format on the file >> descriptor, or the HPT structure? If we want a new stream format, >> then we would define a bit in the flags field of struct >> kvm_get_htab_fd to mean "I want the new stream format". The code >> fails the ioctl if any unknown flag bits are set, so a new userspace >> that wants to use the new format could then detect that it is running >> on an old kernel and fall back to the old format. >> >> The HPT entry format is very unlikely to change in size or basic >> layout (though the architects do redefine some of the bits >> occasionally). > > I meant the internal data structure that holds HPT entries. > > I guess I don't understand the index. Do we expect changes to be in > contiguous ranges? And invalid entries to be contiguous as well? That > doesn't fit with how hash tables work. Does the index represent the > position of the entry within the table, or something else? > > >> >>> > + >>> > +Writes to the fd create HPT entries starting at the index given in the >>> > +header; first `n_valid' valid entries with contents from the data >>> > +written, then `n_invalid' invalid entries, invalidating any previously >>> > +valid entries found. >>> >>> This scheme is a clever, origina
Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing
On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote: > > Thanks for looking at this - but in fact this is fixed by my patch > > entitled "KVM: PPC: Book3S HV: Fix some races in starting secondary > > threads" submitted back on August 28. > > OK thanks. It seems that patch didn't make 3.7 ? > > I don't see it in kvm-ppc-next either. Alex, WTF ? Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 03:47 PM, Avi Kivity wrote: On 10/16/2012 01:06 PM, Alexander Graf wrote: On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. Couldn't you attach that payload to the irqfd? On x86 an irqfd is associated with a gsi, and a gsi with extra information, including all that is needed to queue an MSI. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. If there is a lot of prioritization and/or queuing logic, then yes. But what about MSI? Doesn't that have a direct path? Nope. Well, yes, in a certain special case where the MPIC pushes the interrupt vector on interrupt delivery into a special register. But not for the "normal" case. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 01:06 PM, Alexander Graf wrote: > > On 16.10.2012, at 13:01, Avi Kivity wrote: > >> On 10/16/2012 12:59 PM, Alexander Graf wrote: >>> >>> On 16.10.2012, at 12:56, Avi Kivity wrote: >>> On 10/15/2012 02:02 PM, Alexander Graf wrote: > In order to support vhost, we need to be able to support ioeventfd. > > This patch set adds support for ioeventfd to PPC and makes it possible to > do so without implementing irqfd along the way, as it requires an > in-kernel > irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. >>> >>> The irqfd code in KVM directly triggers the in-kernel irqchip. That's what >>> this patch set is cleaning up: enable you to build ioeventfd support >>> without in-kernel irqchip irqfd support. >>> >>> Vhost can work without in-kernel irqchip too, by listening to an eventfd >>> (iiuc) in user space. That path works just fine with these patches applied. >> >> That's all true but it wasn't my point. I was asking whether you can >> enable irqfd without in-kernel irqchip. If your irq input is edge >> triggered then it's trivial. If it's level triggered then you can use >> the new resampling irqfd. > > I'm not sure I fully grasp what you're trying to say :). > > We have a single interrupt line on the core. So whenever any external > interrupt gets injected (same thing for MSI), we need to go to the MPIC / > XICS and ask it which line is active. Couldn't you attach that payload to the irqfd? On x86 an irqfd is associated with a gsi, and a gsi with extra information, including all that is needed to queue an MSI. > > So yes, we could create a direct fd channel between vhost and the user space > MPIC, but it wouldn't buy us anything. The interrupt injection path would be > as long as it is with the current mechanism. > If there is a lot of prioritization and/or queuing logic, then yes. But what about MSI? Doesn't that have a direct path? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On 10/16/2012 01:58 PM, Paul Mackerras wrote: > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: >> On 10/16/2012 06:01 AM, Paul Mackerras wrote: >> > +4.78 KVM_PPC_GET_HTAB_FD >> > + >> > +Capability: KVM_CAP_PPC_HTAB_FD >> > +Architectures: powerpc >> > +Type: vm ioctl >> > +Parameters: Pointer to struct kvm_get_htab_fd (in) >> > +Returns: file descriptor number (>= 0) on success, -1 on error >> > + >> > +This returns a file descriptor that can be used either to read out the >> > +entries in the guest's hashed page table (HPT), or to write entries to >> > +initialize the HPT. The returned fd can only be written to if the >> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and >> > +can only be read if that bit is clear. The argument struct looks like >> > +this: >> > + >> > +/* For KVM_PPC_GET_HTAB_FD */ >> > +struct kvm_get_htab_fd { >> > + __u64 flags; >> > + __u64 start_index; >> > +}; >> > + >> > +/* Values for kvm_get_htab_fd.flags */ >> > +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) >> > +#define KVM_GET_HTAB_WRITE((__u64)0x2) >> > + >> > +The `start_index' field gives the index in the HPT of the entry at >> > +which to start reading. It is ignored when writing. >> > + >> > +Reads on the fd will initially supply information about all >> > +"interesting" HPT entries. Interesting entries are those with the >> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise >> > +all entries. When the end of the HPT is reached, the read() will >> > +return. >> >> What happens if the read buffer is smaller than the HPT size? > > That's fine; the read stops when it has filled the buffer and a > subsequent read will continue from where the previous one finished. > >> What happens if the read buffer size is not a multiple of entry size? > > Then we don't use the last few bytes of the buffer. The read() call > returns the number of bytes that were filled in, of course. In any > case, the header size is 8 bytes and the HPT entry size is 16 bytes, > so the number of bytes filled in won't necessarily be a multiple of 16 > bytes. That's sane and expected, but it should be documented. > >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry >> to change). > > No. This forces userspace to dedicate a thread for the HPT. > >> > If read() is called again on the fd, it will start again from >> > +the beginning of the HPT, but will only return HPT entries that have >> > +changed since they were last read. >> > + >> > +Data read or written is structured as a header (8 bytes) followed by a >> > +series of valid HPT entries (16 bytes) each. The header indicates how >> > +many valid HPT entries there are and how many invalid entries follow >> > +the valid entries. The invalid entries are not represented explicitly >> > +in the stream. The header format is: >> > + >> > +struct kvm_get_htab_header { >> > + __u32 index; >> > + __u16 n_valid; >> > + __u16 n_invalid; >> > +}; >> >> This structure forces the kernel to return entries sequentially. Will >> this block changing the data structure in the future? Or is the >> hardware spec sufficiently strict that such changes are not realistic? > > By "data structure", do you mean the stream format on the file > descriptor, or the HPT structure? If we want a new stream format, > then we would define a bit in the flags field of struct > kvm_get_htab_fd to mean "I want the new stream format". The code > fails the ioctl if any unknown flag bits are set, so a new userspace > that wants to use the new format could then detect that it is running > on an old kernel and fall back to the old format. > > The HPT entry format is very unlikely to change in size or basic > layout (though the architects do redefine some of the bits > occasionally). I meant the internal data structure that holds HPT entries. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? > >> > + >> > +Writes to the fd create HPT entries starting at the index given in the >> > +header; first `n_valid' valid entries with contents from the data >> > +written, then `n_invalid' invalid entries, invalidating any previously >> > +valid entries found. >> >> This scheme is a clever, original, and very interesting approach to live >> migration. That doesn't necessarily mean a NAK, we should see if it >> makes sense for other migration APIs as well (we currently have >> difficulties migrating very large/wide guests). >> >> What is the typical number of entries in the HPT? Do you have estimates >> of the change rate? > > Typically the HPT would have about a million entries, i.e. it would be > 16MiB in size. The usual guideline is to make it about 1/64 of the > maximum amount of RAM the guest
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: > On 10/16/2012 06:01 AM, Paul Mackerras wrote: > > +4.78 KVM_PPC_GET_HTAB_FD > > + > > +Capability: KVM_CAP_PPC_HTAB_FD > > +Architectures: powerpc > > +Type: vm ioctl > > +Parameters: Pointer to struct kvm_get_htab_fd (in) > > +Returns: file descriptor number (>= 0) on success, -1 on error > > + > > +This returns a file descriptor that can be used either to read out the > > +entries in the guest's hashed page table (HPT), or to write entries to > > +initialize the HPT. The returned fd can only be written to if the > > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and > > +can only be read if that bit is clear. The argument struct looks like > > +this: > > + > > +/* For KVM_PPC_GET_HTAB_FD */ > > +struct kvm_get_htab_fd { > > + __u64 flags; > > + __u64 start_index; > > +}; > > + > > +/* Values for kvm_get_htab_fd.flags */ > > +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) > > +#define KVM_GET_HTAB_WRITE ((__u64)0x2) > > + > > +The `start_index' field gives the index in the HPT of the entry at > > +which to start reading. It is ignored when writing. > > + > > +Reads on the fd will initially supply information about all > > +"interesting" HPT entries. Interesting entries are those with the > > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise > > +all entries. When the end of the HPT is reached, the read() will > > +return. > > What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. > What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. > Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry > to change). No. > > If read() is called again on the fd, it will start again from > > +the beginning of the HPT, but will only return HPT entries that have > > +changed since they were last read. > > + > > +Data read or written is structured as a header (8 bytes) followed by a > > +series of valid HPT entries (16 bytes) each. The header indicates how > > +many valid HPT entries there are and how many invalid entries follow > > +the valid entries. The invalid entries are not represented explicitly > > +in the stream. The header format is: > > + > > +struct kvm_get_htab_header { > > + __u32 index; > > + __u16 n_valid; > > + __u16 n_invalid; > > +}; > > This structure forces the kernel to return entries sequentially. Will > this block changing the data structure in the future? Or is the > hardware spec sufficiently strict that such changes are not realistic? By "data structure", do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean "I want the new stream format". The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). > > + > > +Writes to the fd create HPT entries starting at the index given in the > > +header; first `n_valid' valid entries with contents from the data > > +written, then `n_invalid' invalid entries, invalidating any previously > > +valid entries found. > > This scheme is a clever, original, and very interesting approach to live > migration. That doesn't necessarily mean a NAK, we should see if it > makes sense for other migration APIs as well (we currently have > difficulties migrating very large/wide guests). > > What is the typical number of entries in the HPT? Do you have estimates > of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. Because it is a hash table, updates tend to be scattered throughout the whole table, which is another reason why per-page dirty tracking and updates would be pretty inefficient. As for the change rate, it depends on the application of course, but basically every time the guest changes a PTE in its Linux page tables we do the corresponding change to the corresponding HPT entry, so the rate can be quite high. Workloads that do a lot of fork, exit, mmap, exec, etc. have a high rate of HPT updates. > Suppose new hardwar
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 16.10.2012, at 13:01, Avi Kivity wrote: > On 10/16/2012 12:59 PM, Alexander Graf wrote: >> >> On 16.10.2012, at 12:56, Avi Kivity wrote: >> >>> On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. >>> >>> It's not strictly required. You have an interrupt line leading to the >>> core, no? You could have your irqfd trigger that. >> >> The irqfd code in KVM directly triggers the in-kernel irqchip. That's what >> this patch set is cleaning up: enable you to build ioeventfd support without >> in-kernel irqchip irqfd support. >> >> Vhost can work without in-kernel irqchip too, by listening to an eventfd >> (iiuc) in user space. That path works just fine with these patches applied. > > That's all true but it wasn't my point. I was asking whether you can > enable irqfd without in-kernel irqchip. If your irq input is edge > triggered then it's trivial. If it's level triggered then you can use > the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip
On 16.10.2012, at 12:57, Avi Kivity wrote: > On 10/15/2012 02:02 PM, Alexander Graf wrote: >> The current eventfd code assumes that when we have eventfd, we also have >> irqfd for in-kernel interrupt delivery. This is not necessarily true. On >> PPC we don't have an in-kernel irqchip yet, but we can still support easily >> support eventfd. >> > > Don't you need, in addition, to reject PIO space for ioeventfd? Yeah, we could tell user space that it's doing something stupid. Not sure how critical it is - right now it would just register it and never get triggered, because the bus doesn't exist. But sanity checks are a good thing usually. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 12:59 PM, Alexander Graf wrote: > > On 16.10.2012, at 12:56, Avi Kivity wrote: > >> On 10/15/2012 02:02 PM, Alexander Graf wrote: >>> In order to support vhost, we need to be able to support ioeventfd. >>> >>> This patch set adds support for ioeventfd to PPC and makes it possible to >>> do so without implementing irqfd along the way, as it requires an in-kernel >>> irqchip which we don't have yet. >> >> It's not strictly required. You have an interrupt line leading to the >> core, no? You could have your irqfd trigger that. > > The irqfd code in KVM directly triggers the in-kernel irqchip. That's what > this patch set is cleaning up: enable you to build ioeventfd support without > in-kernel irqchip irqfd support. > > Vhost can work without in-kernel irqchip too, by listening to an eventfd > (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 16.10.2012, at 12:56, Avi Kivity wrote: > On 10/15/2012 02:02 PM, Alexander Graf wrote: >> In order to support vhost, we need to be able to support ioeventfd. >> >> This patch set adds support for ioeventfd to PPC and makes it possible to >> do so without implementing irqfd along the way, as it requires an in-kernel >> irqchip which we don't have yet. > > It's not strictly required. You have an interrupt line leading to the > core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip
On 10/15/2012 02:02 PM, Alexander Graf wrote: > The current eventfd code assumes that when we have eventfd, we also have > irqfd for in-kernel interrupt delivery. This is not necessarily true. On > PPC we don't have an in-kernel irqchip yet, but we can still support easily > support eventfd. > Don't you need, in addition, to reject PIO space for ioeventfd? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/15/2012 02:02 PM, Alexander Graf wrote: > In order to support vhost, we need to be able to support ioeventfd. > > This patch set adds support for ioeventfd to PPC and makes it possible to > do so without implementing irqfd along the way, as it requires an in-kernel > irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On 10/16/2012 06:01 AM, Paul Mackerras wrote: > A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on > this fd return the contents of the HPT (hashed page table), writes > create and/or remove entries in the HPT. There is a new capability, > KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl > takes an argument structure with the index of the first HPT entry to > read out and a set of flags. The flags indicate whether the user is > intending to read or write the HPT, and whether to return all entries > or only the "bolted" entries (those with the bolted bit, 0x10, set in > the first doubleword). > > This is intended for use in implementing qemu's savevm/loadvm and for > live migration. Therefore, on reads, the first pass returns information > about all HPTEs (or all bolted HPTEs). When the first pass reaches the > end of the HPT, it returns from the read. Subsequent reads only return > information about HPTEs that have changed since they were last read. > A read that finds no changed HPTEs in the HPT following where the last > read finished will return 0 bytes. Copying people with interest in migration. > +4.78 KVM_PPC_GET_HTAB_FD > + > +Capability: KVM_CAP_PPC_HTAB_FD > +Architectures: powerpc > +Type: vm ioctl > +Parameters: Pointer to struct kvm_get_htab_fd (in) > +Returns: file descriptor number (>= 0) on success, -1 on error > + > +This returns a file descriptor that can be used either to read out the > +entries in the guest's hashed page table (HPT), or to write entries to > +initialize the HPT. The returned fd can only be written to if the > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and > +can only be read if that bit is clear. The argument struct looks like > +this: > + > +/* For KVM_PPC_GET_HTAB_FD */ > +struct kvm_get_htab_fd { > + __u64 flags; > + __u64 start_index; > +}; > + > +/* Values for kvm_get_htab_fd.flags */ > +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) > +#define KVM_GET_HTAB_WRITE ((__u64)0x2) > + > +The `start_index' field gives the index in the HPT of the entry at > +which to start reading. It is ignored when writing. > + > +Reads on the fd will initially supply information about all > +"interesting" HPT entries. Interesting entries are those with the > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise > +all entries. When the end of the HPT is reached, the read() will > +return. What happens if the read buffer is smaller than the HPT size? What happens if the read buffer size is not a multiple of entry size? Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). > If read() is called again on the fd, it will start again from > +the beginning of the HPT, but will only return HPT entries that have > +changed since they were last read. > + > +Data read or written is structured as a header (8 bytes) followed by a > +series of valid HPT entries (16 bytes) each. The header indicates how > +many valid HPT entries there are and how many invalid entries follow > +the valid entries. The invalid entries are not represented explicitly > +in the stream. The header format is: > + > +struct kvm_get_htab_header { > + __u32 index; > + __u16 n_valid; > + __u16 n_invalid; > +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? > + > +Writes to the fd create HPT entries starting at the index given in the > +header; first `n_valid' valid entries with contents from the data > +written, then `n_invalid' invalid entries, invalidating any previously > +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Suppose new hardware arrives that supports nesting HPTs, so that kvm is no longer synchronously aware of the guest HPT (similar to how NPT/EPT made kvm unaware of guest virtual->physical translations on x86). How will we deal with that? But I guess this will be a non-guest-transparent and non-userspace-transparent change, unlike NPT/EPT, so a userspace ABI addition will be needed anyway). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
On 16.10.2012, at 11:44, Avi Kivity wrote: > On 10/16/2012 05:59 AM, Paul Mackerras wrote: >> The mmu_notifier_retry() function, used to test whether any page >> invalidations are in progress, currently takes a vcpu pointer, though >> the code only needs the VM's struct kvm pointer. Forthcoming patches >> to the powerpc Book3S HV code will need to test for retry within a VM >> ioctl, where a struct kvm pointer is available but a struct vcpu >> pointer isn't. Therefore this creates a variant of mmu_notifier_retry >> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and >> implements mmu_notifier_retry in terms of it. > > Why not change mmu_notifier_retry() and all its callers? Why not use Christoffer's patch? :) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
On 10/16/2012 05:59 AM, Paul Mackerras wrote: > The mmu_notifier_retry() function, used to test whether any page > invalidations are in progress, currently takes a vcpu pointer, though > the code only needs the VM's struct kvm pointer. Forthcoming patches > to the powerpc Book3S HV code will need to test for retry within a VM > ioctl, where a struct kvm pointer is available but a struct vcpu > pointer isn't. Therefore this creates a variant of mmu_notifier_retry > called kvm_mmu_notifier_retry that takes a struct kvm pointer, and > implements mmu_notifier_retry in terms of it. Why not change mmu_notifier_retry() and all its callers? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] Various Book3s HV fixes that haven't been picked up yet
On 16.10.2012, at 05:08, Paul Mackerras wrote: > On Mon, Oct 15, 2012 at 02:00:54PM +0200, Alexander Graf wrote: >> >> Sorry, I can't accept patches that haven't shown up on kvm@vger. Please send >> this patch set again with CC to kvm@vger. > > Done; I didn't cc kvm-ppc this time since the patches haven't changed. Thanks :) > > By the way, what is the purpose of kvm-ppc@vger.kernel.org? It's basically a tag so people can filter based on it. For me, mails to kvm-ppc@vger land straight in my inbox, while mails to kvm@vger go to a folder I only read occasionally. Alex > > Paul. > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html