Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt
On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:

> I'm not following you.
> 
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

Ok, so I missed that part where you change uiommu to operate on group
fd's rather than device fd's, my apologies if you actually wrote that
down :-) It might be obvious ... bare with me I just flew back from the
US and I am badly jet lagged ...

So I see what you mean, however...

> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> arbitrary mapping (satisfying group constraints) as we do today.
> 
> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> ability and definitely think the uiommu approach is cleaner than the
> ioctl(inherit_iommu) approach.  We considered that approach before but it
> seemed less clean so we went with the explicit uiommu context.

Possibly, the question that interest me the most is what interface will
KVM end up using. I'm also not terribly fan with the (perceived)
discrepancy between using uiommu to create groups but using the group fd
to actually do the mappings, at least if that is still the plan.

If the separate uiommu interface is kept, then anything that wants to be
able to benefit from the ability to put multiple devices (or existing
groups) into such a "meta group" would need to be explicitly modified to
deal with the uiommu APIs.

I tend to prefer such "meta groups" as being something you create
statically using a configuration interface, either via sysfs, netlink or
ioctl's to a "control" vfio device driven by a simple command line tool
(which can have the configuration stored in /etc and re-apply it at
boot).

That way, any program capable of exploiting VFIO "groups" will
automatically be able to exploit those "meta groups" (or groups of
groups) as well as long as they are supported on the system.

If we ever have system specific constraints as to how such groups can be
created, then it can all be handled at the level of that configuration
tool without impact on whatever programs know how to exploit them via
the VFIO interfaces.

> >  .../...
> > 
> >> If we in singleton-group land were building our own "groups" which were 
> >> sets
> >> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> >> with uiommu fds, but it sounds like the current proposal would create 20
> >> singleton groups (x86 iommu w/o PCI bridges => all devices are 
> >> partitionable
> >> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> >> worse than the current explicit uiommu API.
> > 
> > I'd rather have an API to create super-groups (groups of groups)
> > statically and then you can use such groups as normal groups using the
> > same interface. That create/management process could be done via a
> > simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMI BIOS String

2011-08-22 Thread Philipp Hahn
Hello Derek,

On Tuesday 23 August 2011 06:30:13 Derek wrote:
> I found 'virsh edit' is the recommended way to make changes to a VM
> configuration, the XML file is not referenced at boot, it is only saved to
> with 'virsh dumpxml'.  'virsh edit' will update both areas - hypervisor and
> XML file.
>
> Unfortunately the changes I make for smbios and sysinfo (as below) while
> editing don't appear to be accepted when saving with 'virsh edit'.  I
> wonder if they are perhaps not supported with virsh edit yet?

If your domain is running, you get the config from the CURRENTLY RUNNING kvm 
process, but it is saved for the NEXT start. Just shutdown your current 
domain and restart it; rebooting from inside the domain (Ctrl-Alt-Del) is not 
enough, since a new kvm process needs to be forked.

> Does anyone have experience with this, or an idea on why I would receive
> 'bad file descriptor' errors using the kvm command manually.

Perhaps you should switch over to the libvirt-ML.

Sincerely
Philipp Hahn
-- 
Philipp Hahn   Open Source Software Engineer  h...@univention.de
Univention GmbHLinux for Your Businessfon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen fax: +49 421 22 232-99
   http://www.univention.de/

Treffen Sie Univention auf der IT&Business vom 20. bis 22. September 2011
auf dem Gemeinschaftsstand der Open Source Business Alliance in Stuttgart in
Halle 3 Stand 3D27-7.


signature.asc
Description: This is a digitally signed message part.


Re: DMI BIOS String

2011-08-22 Thread Derek
I found 'virsh edit' is the recommended way to make changes to a VM 
configuration, the XML file is not referenced at boot, it is only saved to with 
'virsh dumpxml'.  'virsh edit' will update both areas - hypervisor and XML file.

Unfortunately the changes I make for smbios and sysinfo (as below) while 
editing don't appear to be accepted when saving with 'virsh edit'.  I wonder if 
they are perhaps not supported with virsh edit yet?

Does anyone have experience with this, or an idea on why I would receive 'bad 
file descriptor' errors using the kvm command manually.

Thanks in advance,
Derek

On 23/08/2011, at 11:55 AM, Derek wrote:

> Thanks everyone for the useful replies.
> 
> I am using libvirt, one thing i'm observing after adding the following 
> sysinfo to the VM.xml file is that in the qemu log file for the VM it is not 
> being passed to the command when booting.
> 
>  
>hvm
>
>
>  
>  
>
>  xxx xxx
>
>
>  xxx
>  xxx xxx
>  xxx
>  xxx
>
>  
> 
> Is there a step i'm missing to have the xml file re-read after changes?
> 
> Editing the xml file on the command line, and using libvirt to boot.
> 
> FYI here is the command shown in the logs while booting..
> 
> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin 
> HOME=/root USER=root LOGNAME=root QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M 
> pc-0.12 -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name Test 
> -uuid a8024009-d459-ad03-0fca-304746a93c8a -nodefaults -chardev 
> socket,id=monitor,path=/var/lib/libvirt/qemu/HMC.monitor,server,nowait -mon 
> chardev=monitor,mode=readline -rtc base=utc -boot d -drive 
> file=/install-images/install.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
>  -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -device 
> rtl8139,vlan=0,id=net0,mac=52:54:00:4e:80:a4,bus=pci.0,addr=0x3 -net 
> tap,fd=49,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device 
> isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:4 -vga cirrus -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 
> char device redirected to /dev/pts/4
> 
> I tried running the above manually with -smbios type=1 etc added but I am 
> receiving bad file descriptor errors, perhaps my own error or inexperience 
> with KVM.
> 
> Thanks!
> Derek
> 
> On 23/08/2011, at 8:03 AM, Ryan Harper wrote:
> 
>> * Andrew Cathrow  [2011-08-22 06:55]:
>>> 
>>> 
>>> - Original Message -
 From: "Derek" 
 To: kvm@vger.kernel.org
 Sent: Sunday, August 21, 2011 11:52:19 PM
 Subject: DMI BIOS String
 Hi Folks,
 
 I could not track down any solid info on modifying the DMI BIOS
 string.
>>> 
>>> qemu-kvm -help | grep bios
>>> 
>>> -smbios file=binary
>> 
>> for binary mode, this commit message is most helpful
>> 
>> 
>> commit b6f6e3d3a71cee61320216a42940cfaa9b42a162
>> Author: aliguori 
>> Date:   Fri Apr 17 18:59:56 2009 +
>> 
>>   qemu: Add support for SMBIOS command line otions (Alex Williamson)
>> 
>>   Create a new -smbios option (x86-only) to allow binary SMBIOS entries
>>   to be passed through to the BIOS or modify the default values of
>>   individual fields of type 0 and 1 entries on the command line.
>> 
>>   Binary SMBIOS entries can be generated as follows:
>> 
>>   dmidecode -t 1 -u | grep $'^\t\t[^"]' | xargs -n1 | \
>>   perl -lne 'printf "%c", hex($_)' > smbios_type_1.bin
>> 
>>   These can then be passed to the BIOS using this switch:
>> 
>>-smbios file=smbios_type_1.bin
>> 
>>   Command line generation supports the following syntax:
>> 
>>-smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
>>-smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]
>> [,uuid=$(uuidgen)][,sku=str][,family=str]
>> 
>>   For instance, to add a serial number to the type 1 table:
>> 
>>-smbios type=1,serial=0123456789
>> 
>>   Interface is extensible to support more fields/tables as needed.
>> 
>>   aliguori: remove texi formatting from help output
>> 
>>   Signed-off-by: Alex Williamson 
>>   Signed-off-by: Anthony Liguori 
>> 
>> 
>>> -smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
>>> -smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]
>>> 
>>> or if you're using libvirt
>>> http://libvirt.org/formatdomain.html#elementsSysinfo
>>> 
>>> 
 
 For example, in VirtualBox you can use 'vboxmanage setsextradata' to
 set the BIOS product and vendor string per VM.
 
 Any ideas if this is possible with KVM?
 
 Thanks,
 Derek--
 To unsubscribe from this list: send the line "unsubscribe kvm" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kern

Re: [PATCH] virtio-net: Read MAC only after initializing MSI-X

2011-08-22 Thread Rusty Russell
On Mon, 22 Aug 2011 11:36:13 +0300, "Michael S. Tsirkin"  
wrote:
> Yes. Or maybe 'WHEN' - since if MSI-X is disabled again, it moves back.
> Let's update the spec to make it clearer?

I left the language as is, but added a footnote at the end of the
sentence:

If MSI-X is enabled for the device, two additional fields immediately
follow this header:
[footnote: ie. once you enable MSI-X on the device, the other
   fields move. If you turn it off again, they move back!]

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread David Gibson
On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > capture the plan that I think we agreed to:
> > > 
> > > We need to address both the description and enforcement of device
> > > groups.  Groups are formed any time the iommu does not have resolution
> > > between a set of devices.  On x86, this typically happens when a
> > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > Power, partitionable endpoints define a group.  Grouping information
> > > needs to be exposed for both userspace and kernel internal usage.  This
> > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > 
> > > # cat /sys/devices/pci:00/:00:19.0/iommu_group
> > > 42
> > > 
> > > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > Ok.  Am I correct in thinking these group IDs are representing the
> > minimum granularity, and are therefore always static, defined only by
> > the connected hardware, not by configuration?
> 
> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Right.  And likewise I can see a place for configuration parameters
like the present 'allow_unsafe_irqs'.  But these would be more-or-less
global options which affected the overall granularity, rather than
detailed configuration such as explicitly binding some devices into a
group, yes?

> > > >From there we have a few options.  In the BoF we discussed a model where
> > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > ioctls to enumerate and return a "device" fd for each attached member of
> > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > returning an error on open() of the group fd if there are members of the
> > > group not bound to the vfio driver.  Each device fd would then support a
> > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > group fd.
> > 
> > It seems a slightly strange distinction that the group device appears
> > when any device in the group is bound to vfio, but only becomes usable
> > when all devices are bound.
> > 
> > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > groups when the vfio module is loaded.  The group fd would allow open()
> > > and some set of iommu querying and device enumeration ioctls, but would
> > > error on dma mapping and retrieving device fds until all of the group
> > > devices are bound to the vfio driver.
> > 
> > Which is why I marginally prefer this model, although it's not a big
> > deal.
> 
> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

I'm not entirely sure what you mean here.  But, that's now several
weak votes in favour of the always-present group devices, and none in
favour of the created-when-first-device-bound model, so I suggest we
take the /dev/vfio/$GROUP as our tentative approach.

> > > In either case, the uiommu interface is removed entirely since dma
> > > mapping is done via the group fd.  As necessary in the future, we can
> > > define a more high performance dma mapping interface for streaming dma
> > > via the group fd.  I expect we'll also include architecture specific
> > > group ioctls to describe features and capabilities of the iommu.  The
> > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > to userspace process ownership model.
> > 
> > A 1:1 group<->process correspondance seems wrong to me. But there are
> > many ways you could legitimately write the userspace side of the code,
> > many of them involving some sort of concurrency.  Implementing that
> > concurrency as multiple processes (using explicit shared memory and/or
> > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > shouldn't arbitrarily prohibit.
> > 
> > Obviously, only one UID may be permitted to have the group open at a
> > time, and I think that's enough to prevent them doing any worse than
> > shooting themselves in the foot.
> 
> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group

Re: [PATCH 07/11] KVM: MMU: fast prefetch spte on invlpg path

2011-08-22 Thread Xiao Guangrong
On 08/23/2011 06:28 AM, Marcelo Tosatti wrote:

>> +
>> +if (rmap_can_add(vcpu))
>> +break;
> 
> if (!rmap_can_add(vcpu))
> break;
> 
> ?
> 

Oh, oops, thanks for you point it out. Will fix it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/11] KVM: MMU: fast prefetch spte on invlpg path

2011-08-22 Thread Marcelo Tosatti
On Tue, Aug 16, 2011 at 02:44:42PM +0800, Xiao Guangrong wrote:
> Fast prefetch spte for the unsync shadow page on invlpg path
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_host.h |4 +---
>  arch/x86/kvm/mmu.c  |   38 +++---
>  arch/x86/kvm/paging_tmpl.h  |   30 ++
>  arch/x86/kvm/x86.c  |4 ++--
>  4 files changed, 36 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 58ea3a7..927ba73 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -460,7 +460,6 @@ struct kvm_arch {
>   unsigned int n_requested_mmu_pages;
>   unsigned int n_max_mmu_pages;
>   unsigned int indirect_shadow_pages;
> - atomic_t invlpg_counter;
>   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>   /*
>* Hash table of struct kvm_mmu_page.
> @@ -754,8 +753,7 @@ int fx_init(struct kvm_vcpu *vcpu);
>  
>  void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> -const u8 *new, int bytes,
> -bool guest_initiated);
> +const u8 *new, int bytes);
>  int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
>  int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
>  void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ed3e778..f6de2fc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3530,8 +3530,7 @@ static bool last_updated_pte_accessed(struct kvm_vcpu 
> *vcpu)
>  }
>  
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> -const u8 *new, int bytes,
> -bool guest_initiated)
> +const u8 *new, int bytes)
>  {
>   gfn_t gfn = gpa >> PAGE_SHIFT;
>   union kvm_mmu_page_role mask = { .word = 0 };
> @@ -3540,7 +3539,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>   LIST_HEAD(invalid_list);
>   u64 entry, gentry, *spte;
>   unsigned pte_size, page_offset, misaligned, quadrant, offset;
> - int level, npte, invlpg_counter, r, flooded = 0;
> + int level, npte, r, flooded = 0;
>   bool remote_flush, local_flush, zap_page;
>  
>   /*
> @@ -3555,19 +3554,16 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t 
> gpa,
>  
>   pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
>  
> - invlpg_counter = atomic_read(&vcpu->kvm->arch.invlpg_counter);
> -
>   /*
>* Assume that the pte write on a page table of the same type
>* as the current vcpu paging mode since we update the sptes only
>* when they have the same mode.
>*/
> - if ((is_pae(vcpu) && bytes == 4) || !new) {
> + if (is_pae(vcpu) && bytes == 4) {
>   /* Handle a 32-bit guest writing two halves of a 64-bit gpte */
> - if (is_pae(vcpu)) {
> - gpa &= ~(gpa_t)7;
> - bytes = 8;
> - }
> + gpa &= ~(gpa_t)7;
> + bytes = 8;
> +
>   r = kvm_read_guest(vcpu->kvm, gpa, &gentry, min(bytes, 8));
>   if (r)
>   gentry = 0;
> @@ -3593,22 +3589,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t 
> gpa,
>*/
>   mmu_topup_memory_caches(vcpu);
>   spin_lock(&vcpu->kvm->mmu_lock);
> - if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
> - gentry = 0;
>   kvm_mmu_free_some_pages(vcpu);
>   ++vcpu->kvm->stat.mmu_pte_write;
>   trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
> - if (guest_initiated) {
> - if (gfn == vcpu->arch.last_pt_write_gfn
> - && !last_updated_pte_accessed(vcpu)) {
> - ++vcpu->arch.last_pt_write_count;
> - if (vcpu->arch.last_pt_write_count >= 3)
> - flooded = 1;
> - } else {
> - vcpu->arch.last_pt_write_gfn = gfn;
> - vcpu->arch.last_pt_write_count = 1;
> - vcpu->arch.last_pte_updated = NULL;
> - }
> + if (gfn == vcpu->arch.last_pt_write_gfn
> + && !last_updated_pte_accessed(vcpu)) {
> + ++vcpu->arch.last_pt_write_count;
> + if (vcpu->arch.last_pt_write_count >= 3)
> + flooded = 1;
> + } else {
> + vcpu->arch.last_pt_write_gfn = gfn;
> + vcpu->arch.last_pt_write_count = 1;
> + vcpu->arch.last_pte_updated = NULL;
>   }
>  
>   mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 7862c05..bdc2241 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -670,20 +670,27 @@ static voi

Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread aafabbri



On 8/22/11 2:49 PM, "Benjamin Herrenschmidt" 
wrote:

> 
>>> I wouldn't use uiommu for that.
>> 
>> Any particular reason besides saving a file descriptor?
>> 
>> We use it today, and it seems like a cleaner API than what you propose
>> changing it to.
> 
> Well for one, we are back to square one vs. grouping constraints.

I'm not following you.

You have to enforce group/iommu domain assignment whether you have the
existing uiommu API, or if you change it to your proposed
ioctl(inherit_iommu) API.

The only change needed to VFIO here should be to make uiommu fd assignment
happen on the groups instead of on device fds.  That operation fails or
succeeds according to the group semantics (all-or-none assignment/same
uiommu).

I think the question is: do we force 1:1 iommu/group mapping, or do we allow
arbitrary mapping (satisfying group constraints) as we do today.

I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
ability and definitely think the uiommu approach is cleaner than the
ioctl(inherit_iommu) approach.  We considered that approach before but it
seemed less clean so we went with the explicit uiommu context.

>  .../...
> 
>> If we in singleton-group land were building our own "groups" which were sets
>> of devices sharing the IOMMU domains we wanted, I suppose we could do away
>> with uiommu fds, but it sounds like the current proposal would create 20
>> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
>> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
>> worse than the current explicit uiommu API.
> 
> I'd rather have an API to create super-groups (groups of groups)
> statically and then you can use such groups as normal groups using the
> same interface. That create/management process could be done via a
> simple command line utility or via sysfs banging, whatever...




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMI BIOS String

2011-08-22 Thread Derek
Thanks everyone for the useful replies.

I am using libvirt, one thing i'm observing after adding the following sysinfo 
to the VM.xml file is that in the qemu log file for the VM it is not being 
passed to the command when booting.

  
hvm


  
  

  xxx xxx


  xxx
  xxx xxx
  xxx
  xxx

  

Is there a step i'm missing to have the xml file re-read after changes?

Editing the xml file on the command line, and using libvirt to boot.

FYI here is the command shown in the logs while booting..

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin 
HOME=/root USER=root LOGNAME=root QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M 
pc-0.12 -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name Test -uuid 
a8024009-d459-ad03-0fca-304746a93c8a -nodefaults -chardev 
socket,id=monitor,path=/var/lib/libvirt/qemu/HMC.monitor,server,nowait -mon 
chardev=monitor,mode=readline -rtc base=utc -boot d -drive 
file=/install-images/install.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -device 
rtl8139,vlan=0,id=net0,mac=52:54:00:4e:80:a4,bus=pci.0,addr=0x3 -net 
tap,fd=49,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device 
isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:4 -vga cirrus -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 
char device redirected to /dev/pts/4

I tried running the above manually with -smbios type=1 etc added but I am 
receiving bad file descriptor errors, perhaps my own error or inexperience with 
KVM.

Thanks!
Derek

On 23/08/2011, at 8:03 AM, Ryan Harper wrote:

> * Andrew Cathrow  [2011-08-22 06:55]:
>> 
>> 
>> - Original Message -
>>> From: "Derek" 
>>> To: kvm@vger.kernel.org
>>> Sent: Sunday, August 21, 2011 11:52:19 PM
>>> Subject: DMI BIOS String
>>> Hi Folks,
>>> 
>>> I could not track down any solid info on modifying the DMI BIOS
>>> string.
>> 
>> qemu-kvm -help | grep bios
>> 
>> -smbios file=binary
> 
> for binary mode, this commit message is most helpful
> 
> 
> commit b6f6e3d3a71cee61320216a42940cfaa9b42a162
> Author: aliguori 
> Date:   Fri Apr 17 18:59:56 2009 +
> 
>qemu: Add support for SMBIOS command line otions (Alex Williamson)
> 
>Create a new -smbios option (x86-only) to allow binary SMBIOS entries
>to be passed through to the BIOS or modify the default values of
>individual fields of type 0 and 1 entries on the command line.
> 
>Binary SMBIOS entries can be generated as follows:
> 
>dmidecode -t 1 -u | grep $'^\t\t[^"]' | xargs -n1 | \
>perl -lne 'printf "%c", hex($_)' > smbios_type_1.bin
> 
>These can then be passed to the BIOS using this switch:
> 
> -smbios file=smbios_type_1.bin
> 
>Command line generation supports the following syntax:
> 
> -smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
> -smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]
>  [,uuid=$(uuidgen)][,sku=str][,family=str]
> 
>For instance, to add a serial number to the type 1 table:
> 
> -smbios type=1,serial=0123456789
> 
>Interface is extensible to support more fields/tables as needed.
> 
>aliguori: remove texi formatting from help output
> 
>Signed-off-by: Alex Williamson 
>Signed-off-by: Anthony Liguori 
> 
> 
>> -smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
>> -smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]
>> 
>> or if you're using libvirt
>> http://libvirt.org/formatdomain.html#elementsSysinfo
>> 
>> 
>>> 
>>> For example, in VirtualBox you can use 'vboxmanage setsextradata' to
>>> set the BIOS product and vendor string per VM.
>>> 
>>> Any ideas if this is possible with KVM?
>>> 
>>> Thanks,
>>> Derek--
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Ryan Harper
> Software Engineer; Linux Technology Center
> IBM Corp., Austin, Tx
> ry...@us.ibm.com
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt

> > I wouldn't use uiommu for that.
> 
> Any particular reason besides saving a file descriptor?
> 
> We use it today, and it seems like a cleaner API than what you propose
> changing it to.

Well for one, we are back to square one vs. grouping constraints.

 .../...

> If we in singleton-group land were building our own "groups" which were sets
> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> with uiommu fds, but it sounds like the current proposal would create 20
> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> worse than the current explicit uiommu API.

I'd rather have an API to create super-groups (groups of groups)
statically and then you can use such groups as normal groups using the
same interface. That create/management process could be done via a
simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

> Thanks,
> Aaron
> 
> > 
> > Another option is to make that static configuration APIs via special
> > ioctls (or even netlink if you really like it), to change the grouping
> > on architectures that allow it.
> > 
> > Cheers.
> > Ben.
> > 
> >> 
> >> -Aaron
> >> 
> >>> As necessary in the future, we can
> >>> define a more high performance dma mapping interface for streaming dma
> >>> via the group fd.  I expect we'll also include architecture specific
> >>> group ioctls to describe features and capabilities of the iommu.  The
> >>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> >>> to userspace process ownership model.
> >>> 
> >>> Also on the table is supporting non-PCI devices with vfio.  To do this,
> >>> we need to generalize the read/write/mmap and irq eventfd interfaces.
> >>> We could keep the same model of segmenting the device fd address space,
> >>> perhaps adding ioctls to define the segment offset bit position or we
> >>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> >>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> >>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> >>> event fd(s), per resource fd, etc).  For interrupts we can overload
> >>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> >>> devices support MSI?).
> >>> 
> >>> For qemu, these changes imply we'd only support a model where we have a
> >>> 1:1 group to iommu domain.  The current vfio driver could probably
> >>> become vfio-pci as we might end up with more target specific vfio
> >>> drivers for non-pci.  PCI should be able to maintain a simple -device
> >>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> >>> need to come up with extra options when we need to expose groups to
> >>> guest for pvdma.
> >>> 
> >>> Hope that captures it, feel free to jump in with corrections and
> >>> suggestions.  Thanks,
> >>> 
> >>> Alex
> >>> 
> > 
> > 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread aafabbri



On 8/22/11 1:49 PM, "Benjamin Herrenschmidt" 
wrote:

> On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:
> 
>>> Each device fd would then support a
>>> similar set of ioctls and mapping (mmio/pio/config) interface as current
>>> vfio, except for the obvious domain and dma ioctls superseded by the
>>> group fd.
>>> 
>>> Another valid model might be that /dev/vfio/$GROUP is created for all
>>> groups when the vfio module is loaded.  The group fd would allow open()
>>> and some set of iommu querying and device enumeration ioctls, but would
>>> error on dma mapping and retrieving device fds until all of the group
>>> devices are bound to the vfio driver.
>>> 
>>> In either case, the uiommu interface is removed entirely since dma
>>> mapping is done via the group fd.
>> 
>> The loss in generality is unfortunate. I'd like to be able to support
>> arbitrary iommu domain <-> device assignment.  One way to do this would be
>> to keep uiommu, but to return an error if someone tries to assign more than
>> one uiommu context to devices in the same group.
> 
> I wouldn't use uiommu for that.

Any particular reason besides saving a file descriptor?

We use it today, and it seems like a cleaner API than what you propose
changing it to.

> If the HW or underlying kernel drivers
> support it, what I'd suggest is that you have an (optional) ioctl to
> bind two groups (you have to have both opened already) or for one group
> to "capture" another one.

You'll need other rules there too.. "both opened already, but zero mappings
performed yet as they would have instantiated a default IOMMU domain".

Keep in mind the only case I'm using is singleton groups, a.k.a. devices.

Since what I want is to specify which devices can do things like share
network buffers (in a way that conserves IOMMU hw resources), it seems
cleanest to expose this explicitly, versus some "inherit iommu domain from
another device" ioctl.  What happens if I do something like this:

dev1_fd = open ("/dev/vfio0")
dev2_fd = open ("/dev/vfio1")
dev2_fd.inherit_iommu(dev1_fd)

error = close(dev1_fd)

There are other gross cases as well.

> 
> The binding means under the hood the iommus get shared, with the
> lifetime being that of the "owning" group.

So what happens in the close() above?  EINUSE?  Reset all children?  Still
seems less clean than having an explicit iommu fd.  Without some benefit I'm
not sure why we'd want to change this API.

If we in singleton-group land were building our own "groups" which were sets
of devices sharing the IOMMU domains we wanted, I suppose we could do away
with uiommu fds, but it sounds like the current proposal would create 20
singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
worse than the current explicit uiommu API.

Thanks,
Aaron

> 
> Another option is to make that static configuration APIs via special
> ioctls (or even netlink if you really like it), to change the grouping
> on architectures that allow it.
> 
> Cheers.
> Ben.
> 
>> 
>> -Aaron
>> 
>>> As necessary in the future, we can
>>> define a more high performance dma mapping interface for streaming dma
>>> via the group fd.  I expect we'll also include architecture specific
>>> group ioctls to describe features and capabilities of the iommu.  The
>>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
>>> to userspace process ownership model.
>>> 
>>> Also on the table is supporting non-PCI devices with vfio.  To do this,
>>> we need to generalize the read/write/mmap and irq eventfd interfaces.
>>> We could keep the same model of segmenting the device fd address space,
>>> perhaps adding ioctls to define the segment offset bit position or we
>>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
>>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
>>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
>>> event fd(s), per resource fd, etc).  For interrupts we can overload
>>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
>>> devices support MSI?).
>>> 
>>> For qemu, these changes imply we'd only support a model where we have a
>>> 1:1 group to iommu domain.  The current vfio driver could probably
>>> become vfio-pci as we might end up with more target specific vfio
>>> drivers for non-pci.  PCI should be able to maintain a simple -device
>>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
>>> need to come up with extra options when we need to expose groups to
>>> guest for pvdma.
>>> 
>>> Hope that captures it, feel free to jump in with corrections and
>>> suggestions.  Thanks,
>>> 
>>> Alex
>>> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt

> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).
> 
> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

We can treat it as a hard unplug (like a cardbus gone away).

IE. Dispose of the direct mappings (switch to MMIO emulation) and return
all ff's from reads (& ignore writes).

Then send an unplug event via whatever mechanism the platform provides
(ACPI hotplug controller on x86 for example, we haven't quite sorted out
what to do on power for hotplug yet).

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt
On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:

> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Or even B or range of busses... if you want to enforce strict isolation
you really can't trust anything below a bus level :-)

> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

Could be tho in what form ? returning sysfs pathes ?

> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.

1:1 process has the advantage of linking to an -mm which makes the whole
mmu notifier business doable. How do you want to track down mappings and
do the second level translation in the case of explicit map/unmap (like
on power) if you are not tied to an mm_struct ?

> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Sharing may or may not be possible depending on setups so yes, it's a
bit tricky.

My preference is to have a static interface (and that's actually where
your pet netlink might make some sense :-) to create "synthetic" groups
made of other groups if the arch allows it. But that might not be the
best approach. In another email I also proposed an option for a group to
"capture" another one...

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,
> > 

Another aspect I don't see discussed is how we represent these things to
the guest.

On Power for example, I have a requirement that a given iommu domain is
represented by a single dma window property in the device-tree. What
that means is that that property needs to be either in the node of the
device itself if there's only one device in the group or in a parent
node (ie a bridge or host bridge) if there are multiple devices.

Now I do -not- want to go down the path of simulating P2P bridges,
besides we'll quickly run out of bus numbers if we go there.

For us the most simple and logical approach (which is also what pHyp
uses and what Linux handles well) is really to expose a given PCI host
bridge per group to the guest. Believe it or not, it makes things
easier :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt
On Mon, 2011-08-22 at 09:30 +0300, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci:00/:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci:00/:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

The constraint might not necessarily be a device.

The PCI bridge is just an example. There are other possible constraints.
On POWER for example, it could be a limit in how far I can segment the
DMA address space, forcing me to arbitrarily put devices together, or it
could be a similar constraint related to how the MMIO space is broken
up.

So either that remains a path in which case we do have a separate set of
sysfs nodes representing the groups themselves which may or may not
itself contain a pointer to the "constraining" device, or we just make
that an arbitrary number (in my case the PE#)

Cheers,
Ben

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt
On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:

> > Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> > 
> > In either case, the uiommu interface is removed entirely since dma
> > mapping is done via the group fd.
> 
> The loss in generality is unfortunate. I'd like to be able to support
> arbitrary iommu domain <-> device assignment.  One way to do this would be
> to keep uiommu, but to return an error if someone tries to assign more than
> one uiommu context to devices in the same group.

I wouldn't use uiommu for that. If the HW or underlying kernel drivers
support it, what I'd suggest is that you have an (optional) ioctl to
bind two groups (you have to have both opened already) or for one group
to "capture" another one.

The binding means under the hood the iommus get shared, with the
lifetime being that of the "owning" group.

Another option is to make that static configuration APIs via special
ioctls (or even netlink if you really like it), to change the grouping
on architectures that allow it.

Cheers.
Ben.

> 
> -Aaron
> 
> > As necessary in the future, we can
> > define a more high performance dma mapping interface for streaming dma
> > via the group fd.  I expect we'll also include architecture specific
> > group ioctls to describe features and capabilities of the iommu.  The
> > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > to userspace process ownership model.
> > 
> > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > We could keep the same model of segmenting the device fd address space,
> > perhaps adding ioctls to define the segment offset bit position or we
> > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > event fd(s), per resource fd, etc).  For interrupts we can overload
> > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> > devices support MSI?).
> > 
> > For qemu, these changes imply we'd only support a model where we have a
> > 1:1 group to iommu domain.  The current vfio driver could probably
> > become vfio-pci as we might end up with more target specific vfio
> > drivers for non-pci.  PCI should be able to maintain a simple -device
> > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > need to come up with extra options when we need to expose groups to
> > guest for pvdma.
> > 
> > Hope that captures it, feel free to jump in with corrections and
> > suggestions.  Thanks,
> > 
> > Alex
> > 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: retry non-page-table writing instruction

2011-08-22 Thread Marcelo Tosatti
On Tue, Aug 23, 2011 at 04:21:05AM +0800, Xiao Guangrong wrote:
> On 08/23/2011 03:59 AM, Marcelo Tosatti wrote:
> 
> >> +  if (!vcpu->arch.mmu.direct_map && !mmu_is_nested(vcpu))
> >> +  gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
> > 
> > Why write? 
> > 
> 
> Since the fault is caused by page table written, and the 'gpa' can
> be written after instruction is retied.
> 
> >> +  kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> >> +
> >> +  /*
> >> +   * The shadow pages have been zapped, then we call the page
> >> +   * fault path to change the mapping to writable.
> >> +   */
> >> +  vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, true);
> > 
> > I don't see why is this necessary. Just allowing the instruction to
> > proceed should be enough?
> > 
> 
> It used to avoid later VM-exit, since we will retry the instruction
> but the mapped is still read-only. So we can it to let the mapping become
> writable to avoid page fault again.

Its not like this case is performance sensitive. Usually optimizing
things without the need for it leads to bad results. So please drop 
this.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread aafabbri



On 8/20/11 9:51 AM, "Alex Williamson"  wrote:

> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci:00/:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.

Sounds reasonable.

> Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.
> 
> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.

The loss in generality is unfortunate. I'd like to be able to support
arbitrary iommu domain <-> device assignment.  One way to do this would be
to keep uiommu, but to return an error if someone tries to assign more than
one uiommu context to devices in the same group.


-Aaron

> As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.
> 
> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> devices support MSI?).
> 
> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.
> 
> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,
> 
> Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: retry non-page-table writing instruction

2011-08-22 Thread Xiao Guangrong
On 08/23/2011 03:59 AM, Marcelo Tosatti wrote:

>> +if (!vcpu->arch.mmu.direct_map && !mmu_is_nested(vcpu))
>> +gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
> 
> Why write? 
> 

Since the fault is caused by page table written, and the 'gpa' can
be written after instruction is retied.

>> +kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
>> +
>> +/*
>> + * The shadow pages have been zapped, then we call the page
>> + * fault path to change the mapping to writable.
>> + */
>> +vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, true);
> 
> I don't see why is this necessary. Just allowing the instruction to
> proceed should be enough?
> 

It used to avoid later VM-exit, since we will retry the instruction
but the mapped is still read-only. So we can it to let the mapping become
writable to avoid page fault again.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMI BIOS String

2011-08-22 Thread Ryan Harper
* Andrew Cathrow  [2011-08-22 06:55]:
> 
> 
> - Original Message -
> > From: "Derek" 
> > To: kvm@vger.kernel.org
> > Sent: Sunday, August 21, 2011 11:52:19 PM
> > Subject: DMI BIOS String
> > Hi Folks,
> > 
> > I could not track down any solid info on modifying the DMI BIOS
> > string.
> 
> qemu-kvm -help | grep bios
> 
> -smbios file=binary

for binary mode, this commit message is most helpful


 commit b6f6e3d3a71cee61320216a42940cfaa9b42a162
 Author: aliguori 
 Date:   Fri Apr 17 18:59:56 2009 +

qemu: Add support for SMBIOS command line otions (Alex Williamson)

Create a new -smbios option (x86-only) to allow binary SMBIOS entries
to be passed through to the BIOS or modify the default values of
individual fields of type 0 and 1 entries on the command line.

Binary SMBIOS entries can be generated as follows:

dmidecode -t 1 -u | grep $'^\t\t[^"]' | xargs -n1 | \
perl -lne 'printf "%c", hex($_)' > smbios_type_1.bin

These can then be passed to the BIOS using this switch:

 -smbios file=smbios_type_1.bin

Command line generation supports the following syntax:

 -smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
 -smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]
  [,uuid=$(uuidgen)][,sku=str][,family=str]

For instance, to add a serial number to the type 1 table:

 -smbios type=1,serial=0123456789

Interface is extensible to support more fields/tables as needed.

aliguori: remove texi formatting from help output

Signed-off-by: Alex Williamson 
Signed-off-by: Anthony Liguori 


> -smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
> -smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]
> 
> or if you're using libvirt
> http://libvirt.org/formatdomain.html#elementsSysinfo
> 
> 
> > 
> > For example, in VirtualBox you can use 'vboxmanage setsextradata' to
> > set the BIOS product and vendor string per VM.
> > 
> > Any ideas if this is possible with KVM?
> > 
> > Thanks,
> > Derek--
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: retry non-page-table writing instruction

2011-08-22 Thread Marcelo Tosatti
On Tue, Aug 16, 2011 at 02:42:07PM +0800, Xiao Guangrong wrote:
> If the emulation is caused by #PF and it is non-page_table writing 
> instruction,
> it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
> page and retry this instruction directly
> 
> The idea is from Avi
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_emulate.h |1 +
>  arch/x86/include/asm/kvm_host.h|5 +++
>  arch/x86/kvm/emulate.c |5 +++
>  arch/x86/kvm/mmu.c |   22 +++---
>  arch/x86/kvm/x86.c |   53 
> 
>  5 files changed, 81 insertions(+), 5 deletions(-)
> 
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4814,6 +4814,56 @@ static bool reexecute_instruction(struct kvm_vcpu 
> *vcpu, gva_t gva)
>   return false;
>  }
>  
> +static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
> +   unsigned long cr2,  int emulation_type)
> +{
> + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> + unsigned long last_retry_eip, last_retry_addr, gpa = cr2;
> +
> + last_retry_eip = vcpu->arch.last_retry_eip;
> + last_retry_addr = vcpu->arch.last_retry_addr;
> +
> + /*
> +  * If the emulation is caused by #PF and it is non-page_table
> +  * writing instruction, it means the VM-EXIT is caused by shadow
> +  * page protected, we can zap the shadow page and retry this
> +  * instruction directly.
> +  *
> +  * Note: if the guest uses a non-page-table modifying instruction
> +  * on the PDE that points to the instruction, then we will unmap
> +  * the instruction and go to an infinite loop. So, we cache the
> +  * last retried eip and the last fault address, if we meet the eip
> +  * and the address again, we can break out of the potential infinite
> +  * loop.
> +  */
> + vcpu->arch.last_retry_eip = vcpu->arch.last_retry_addr = 0;
> +
> + if (!(emulation_type & EMULTYPE_RETRY))
> + return false;
> +
> + if (page_table_writing_insn(ctxt))
> + return false;
> +
> + if (ctxt->eip == last_retry_eip && last_retry_addr == cr2)
> + return false;
> +
> + vcpu->arch.last_retry_eip = ctxt->eip;
> + vcpu->arch.last_retry_addr = cr2;
> +
> + if (!vcpu->arch.mmu.direct_map && !mmu_is_nested(vcpu))
> + gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);

Why write? 

> + kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> +
> + /*
> +  * The shadow pages have been zapped, then we call the page
> +  * fault path to change the mapping to writable.
> +  */
> + vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, true);

I don't see why is this necessary. Just allowing the instruction to
proceed should be enough?

Looks good otherwise.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Alex Williamson
On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci:00/:00:19.0/iommu_group
> > 42
> 
> Right, that is mainly for libvirt to provide that information to the
> user in a meaningful way. So userspace is aware that other devices might
> not work anymore when it assigns one to a guest.
> 
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).

That's an interesting idea.  Maybe an interface similar to the current
uiommu interface, where you open() the 2nd group fd and pass the fd via
ioctl to the primary group.  IOMMUs that don't support this would fail
the attach device callback, which would fail the ioctl to bind them.  It
will need to be designed so any group can be removed from the super-set
and the remaining group(s) still works.  This feels like something that
can be added after we get an initial implementation.

> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

The current vfio has the same problem, we can't unbind a device from
vfio while it's attached to a guest.  I think we'd use the same solution
too; send out a netlink packet for a device removal and have the .remove
call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
and SIGBUS the PIDs holding the device if they don't return it
willingly.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PULL][v2] Memory API fix

2011-08-22 Thread Anthony Liguori

On 08/22/2011 11:54 AM, Avi Kivity wrote:

[v2: fix git url...]

Anthony, please pull from:

git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git memory/core


Pulled.  Thanks.

Regards,

Anthony Liguori



to receive a memory API fix. The fix is not required for any of the
already merged patches, but is needed by other patches in my queue.

I am sending core patches separately than the conversions to increase
their visibility.


(and I'll take this opportunity to announce a new layout for the branch:

- memory/core - core fixes, usually not rebased
- memory/batch - pending batch
- memory/queue - the rest of the patches in my queue
- memory/master - a merge of the first three; please post patches based
on this branch

this will allow me to fast-path core fixes without rebasing
memory/batch, except when a dependency is involved.)

Richard Henderson (1):
memory: Fix old_portio vs non-zero offset

memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] posix-aio-compat: fix latency issues

2011-08-22 Thread Jan Kiszka
On 2011-08-14 06:04, Avi Kivity wrote:
> In certain circumstances, posix-aio-compat can incur a lot of latency:
>  - threads are created by vcpu threads, so if vcpu affinity is set,
>aio threads inherit vcpu affinity.  This can cause many aio threads
>to compete for one cpu.
>  - we can create up to max_threads (64) aio threads in one go; since a
>pthread_create can take around 30μs, we have up to 2ms of cpu time
>under a global lock.
> 
> Fix by:
>  - moving thread creation to the main thread, so we inherit the main
>thread's affinity instead of the vcpu thread's affinity.
>  - if a thread is currently being created, and we need to create yet
>another thread, let thread being born create the new thread, reducing
>the amount of time we spend under the main thread.
>  - drop the local lock while creating a thread (we may still hold the
>global mutex, though)
> 
> Note this doesn't eliminate latency completely; scheduler artifacts or
> lack of host cpu resources can still cause it.  We may want pre-allocated
> threads when this cannot be tolerated.
> 
> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.

At this chance: What is the state of getting rid of the remaining delta
between upstream's version and qemu-kvm?

Jan

-- 
Siemens AG
Corporate Technology
CT T DE IT 1
Corporate Competence Center Embedded Linux
Otto-Hahn-Ring 6
81739 Muenchen
Tel.: +49 (89) 636-40042
Fax: +49 (89) 636-45450
mailto:jan.kis...@siemens.com

Siemens Aktiengesellschaft: Chairman of the Supervisory Board: Gerhard
Cromme; Managing Board: Peter Loescher, Chairman, President and Chief
Executive Officer; Roland Busch, Brigitte Ederer, Klaus Helmrich, Joe
Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y.
Solmssen, Michael Suess; Registered offices: Berlin and Munich, Germany;
Commercial registries: Berlin  Charlottenburg, HRB 12300, Munich, HRB
6684; WEEE-Reg.-No. DE 23691322
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Joerg Roedel
On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci:00/:00:19.0/iommu_group
> 42

Right, that is mainly for libvirt to provide that information to the
user in a meaningful way. So userspace is aware that other devices might
not work anymore when it assigns one to a guest.

> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

I am in favour of /dev/vfio/$GROUP. If multiple devices should be
assigned to a guest, there can also be an ioctl to bind a group to an
address-space of another group (certainly needs some care to not allow
that both groups belong to different processes).

Btw, a problem we havn't talked about yet entirely is
driver-deassignment. User space can decide to de-assign the device from
vfio while a fd is open on it. With PCI there is no way to let this fail
(the .release function returns void last time i checked). Is this a
problem, and yes, how we handle that?


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] posix-aio-compat: fix latency issues

2011-08-22 Thread Kevin Wolf
Am 14.08.2011 06:04, schrieb Avi Kivity:
> In certain circumstances, posix-aio-compat can incur a lot of latency:
>  - threads are created by vcpu threads, so if vcpu affinity is set,
>aio threads inherit vcpu affinity.  This can cause many aio threads
>to compete for one cpu.
>  - we can create up to max_threads (64) aio threads in one go; since a
>pthread_create can take around 30μs, we have up to 2ms of cpu time
>under a global lock.
> 
> Fix by:
>  - moving thread creation to the main thread, so we inherit the main
>thread's affinity instead of the vcpu thread's affinity.
>  - if a thread is currently being created, and we need to create yet
>another thread, let thread being born create the new thread, reducing
>the amount of time we spend under the main thread.
>  - drop the local lock while creating a thread (we may still hold the
>global mutex, though)
> 
> Note this doesn't eliminate latency completely; scheduler artifacts or
> lack of host cpu resources can still cause it.  We may want pre-allocated
> threads when this cannot be tolerated.
> 
> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.
> 
> Signed-off-by: Avi Kivity 
> ---
> v2: simplify do_spawn_thread() locking

Thanks, applied to the block branch.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Merging invasive changes (Was Re: [Qemu-devel] [PULL][v2] Memory API fix)

2011-08-22 Thread Anthony Liguori

Hi,

This is a summary from a discussion Avi and I had earlier about merging 
invasive changes (like the memory API) into the tree in a more fluid 
fashion.  Here's my suggestions:


- Invasive changes over a long period of time (multi batches) should be 
done via git pull.  patch (and therefore git-am) sucks at resolving 
conflicts which are a given for these types of changes.  git-merge is 
much better at this.


- Pull requests should be prefixed with '[PULL]'.

- Patches should go to the ML as often as they are created.  Small sets 
of patches can depend on each other.  Don't dump 40 patches and then 
submit a pull request the next day.


- Pull requests should be batched in a 1-2 week period after having 
received reviews or sufficient time to be reviewed.  If it's something 
that really needs review, make sure it gets review before doing the pull 
request.   Good judgement is required here.  Batching is important to 
avoid breaking the tree daily.  Exceptions will happen and that's okay, 
but they should be exceptions.


- Patches destined for a subsystem should be prefixed with a subsystem 
marker.  For instance, '[PATCH] block: fix QED blah blah'.  If I see the 
prefix for a known maintained subsystem, I will delete the patch from my 
own queue.


- Subsystems are 'known maintained' by having an entry in MAINTAINERS 
and by having a responsive maintainer.


- Invasive, but mechanical changes shouldn't be overly split.  60 four 
line patches that makes the same pattern of change are harder to review 
than 6 forty line patches that apply the same pattern repeatedly.


- Non-mechanical changes should be split of course based on logical 
functional areas.


Regards,

Anthony Liguori

On 08/22/2011 11:54 AM, Avi Kivity wrote:

[v2: fix git url...]

Anthony, please pull from:

git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git memory/core

to receive a memory API fix. The fix is not required for any of the
already merged patches, but is needed by other patches in my queue.

I am sending core patches separately than the conversions to increase
their visibility.


(and I'll take this opportunity to announce a new layout for the branch:

- memory/core - core fixes, usually not rebased
- memory/batch - pending batch
- memory/queue - the rest of the patches in my queue
- memory/master - a merge of the first three; please post patches based
on this branch

this will allow me to fast-path core fixes without rebasing
memory/batch, except when a dependency is involved.)

Richard Henderson (1):
memory: Fix old_portio vs non-zero offset

memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL][v2] Memory API fix

2011-08-22 Thread Avi Kivity

[v2: fix git url...]

Anthony, please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git memory/core

to receive a memory API fix.  The fix is not required for any of the 
already merged patches, but is needed by other patches in my queue.


I am sending core patches separately than the conversions to increase 
their visibility.



(and I'll take this opportunity to announce a new layout for the branch:

 - memory/core - core fixes, usually not rebased
 - memory/batch - pending batch
 - memory/queue - the rest of the patches in my queue
 - memory/master - a merge of the first three; please post patches 
based on this branch


this will allow me to fast-path core fixes without rebasing 
memory/batch, except when a dependency is involved.)


Richard Henderson (1):
  memory: Fix old_portio vs non-zero offset

 memory.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] Memory API fix

2011-08-22 Thread Avi Kivity

Anthony, please pull from:

  ssh://master.kernel.org/pub/scm/virt/kvm/qemu-kvm.git memory/core

to receive a memory API fix.  The fix is not required for any of the 
already merged patches, but is needed by other patches in my queue.


I am sending core patches separately than the conversions to increase 
their visibility.



(and I'll take this opportunity to announce a new layout for the branch:

 - memory/core - core fixes, usually not rebased
 - memory/batch - pending batch
 - memory/queue - the rest of the patches in my queue
 - memory/master - a merge of the first three; please post patches 
based on this branch


this will allow me to fast-path core fixes without rebasing 
memory/batch, except when a dependency is involved.)


Richard Henderson (1):
  memory: Fix old_portio vs non-zero offset

 memory.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 00/24] Memory API batch 4: more conversions

2011-08-22 Thread Anthony Liguori

On 08/15/2011 09:17 AM, Avi Kivity wrote:

This patchset converts a few more devices (and boards) to the memory API.
Nothing is really interesting except for the last patch, which brings
proper PAM support (and better SMRAM emulation).  It should also fix the
regressions that were reported in some non-default pc configurations.

v2:
 fix wrong return value in versatile_pci
 don't drop calls to omap_badwidth_*() functions



Applied all.  Thanks.

Regards,

Anthony Liguori



Avi Kivity (24):
   apb_pci: convert to memory API
   apic: convert to memory API
   arm_gic: convert to memory API
   arm_sysctl: convert to memory API
   arm_timer: convert to memory API
   armv7m: convert to memory API
   gt64xxx.c: convert to memory API
   tusb6010: move declarations to new file tusb6010.h
   omap_gpmc/nseries/tusb6010: convert to memory API
   onenand: convert to memory API
   pcie_host: convert to memory API
   ppc405_uc: convert to memory API
   ppc4xx_sdram: convert to memory API
   stellaris_enet: convert to memory API
   sysbus: add a variant of sysbus_init_mmio_cb with an unmap callback
   sh_pci: convert to memory API
   arm11mpcore: use sysbus_init_mmio_cb2
   versatile_pci: convert to memory API
   ppce500_pci: convert to sysbus_init_mmio_cb2()
   sysbus: remove sysbus_init_mmio_cb()
   isa: add isa_address_space()
   pci: add pci_address_space()
   vga: drop get_system_memory() from vga devices and derivatives
   440fx: fix PAM, PCI holes

  hw/apb_pci.c|   84 +++---
  hw/apic.c   |   25 --
  hw/arm11mpcore.c|7 ++-
  hw/arm_gic.c|   22 +++--
  hw/arm_sysctl.c |   27 ---
  hw/arm_timer.c  |   55 --
  hw/armv7m.c |   24 --
  hw/armv7m_nvic.c|3 +-
  hw/cirrus_vga.c |   12 ++--
  hw/devices.h|7 ---
  hw/gt64xxx.c|   36 ++
  hw/isa-bus.c|6 ++
  hw/isa.h|1 +
  hw/mips_jazz.c  |3 +-
  hw/mpcore.c |   37 +++
  hw/nseries.c|1 +
  hw/omap.h   |3 +-
  hw/omap_gpmc.c  |   60 ++--
  hw/onenand.c|   69 +++-
  hw/pc.c |   14 --
  hw/pc.h |   18 +--
  hw/pc_piix.c|   23 +++--
  hw/pci.c|5 ++
  hw/pci.h|1 +
  hw/pcie_host.c  |   98 ++-
  hw/pcie_host.h  |   12 ++--
  hw/piix_pci.c   |  114 -
  hw/ppc405.h |9 ++-
  hw/ppc405_boards.c  |   18 +--
  hw/ppc405_uc.c  |  128 +++---
  hw/ppc440.c |7 ++-
  hw/ppc4xx.h |2 +
  hw/ppc4xx_devs.c|   50 ++--
  hw/ppce500_pci.c|   12 -
  hw/qxl.c|2 +-
  hw/realview_gic.c   |   38 +++
  hw/sh_pci.c |   63 +
  hw/stellaris_enet.c |   29 +---
  hw/sysbus.c |9 ++-
  hw/sysbus.h |5 +-
  hw/tusb6010.c   |   32 ++---
  hw/tusb6010.h   |   28 +++
  hw/versatile_pci.c  |   92 +---
  hw/vga-isa-mm.c |   15 +++---
  hw/vga-isa.c|5 +-
  hw/vga-pci.c|4 +-
  hw/vga.c|9 ++--
  hw/vga_int.h|4 +-
  hw/vmware_vga.c |9 ++--
  49 files changed, 703 insertions(+), 634 deletions(-)
  create mode 100644 hw/tusb6010.h



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GIT PULL] Memory API fixes

2011-08-22 Thread Anthony Liguori

On 08/22/2011 10:42 AM, Avi Kivity wrote:

Anthony, please pull from:

git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
refs/tags/memory-region-fixes-1


Also, please use the tag '[PULL]' as that's what I (and presumably 
others) currently filter for.


Regards,

Anthony Liguori



to receive a memory API fix. The fix is not required for any of the
already merged patches, but is needed by other patches in my queue.

I am sending core patches separately than the conversions to increase
their visibility.

Richard Henderson (1):
memory: Fix old_portio vs non-zero offset

memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GIT PULL] Memory API fixes

2011-08-22 Thread Anthony Liguori

On 08/22/2011 10:42 AM, Avi Kivity wrote:

Anthony, please pull from:

git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
refs/tags/memory-region-fixes-1


I don't fetch tags from subtrees at the moment.  Could you make this a 
branch instead?


Regards,

Anthony Liguori



to receive a memory API fix. The fix is not required for any of the
already merged patches, but is needed by other patches in my queue.

I am sending core patches separately than the conversions to increase
their visibility.

Richard Henderson (1):
memory: Fix old_portio vs non-zero offset

memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Alex Williamson
On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci:00/:00:19.0/iommu_group
> > 42
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> 
> Ok.  Am I correct in thinking these group IDs are representing the
> minimum granularity, and are therefore always static, defined only by
> the connected hardware, not by configuration?

Yes, that's the idea.  An open question I have towards the configuration
side is whether we might add iommu driver specific options to the
groups.  For instance on x86 where we typically have B:D.F granularity,
should we have an option not to trust multi-function devices and use a
B:D granularity for grouping?

> > >From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> 
> It seems a slightly strange distinction that the group device appears
> when any device in the group is bound to vfio, but only becomes usable
> when all devices are bound.
> 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> Which is why I marginally prefer this model, although it's not a big
> deal.

Right, we can also combine models.  Binding a device to vfio
creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
device access until all the group devices are also bound.  I think
the /dev/vfio/$GROUP might help provide an enumeration interface as well
though, which could be useful.

> > In either case, the uiommu interface is removed entirely since dma
> > mapping is done via the group fd.  As necessary in the future, we can
> > define a more high performance dma mapping interface for streaming dma
> > via the group fd.  I expect we'll also include architecture specific
> > group ioctls to describe features and capabilities of the iommu.  The
> > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > to userspace process ownership model.
> 
> A 1:1 group<->process correspondance seems wrong to me. But there are
> many ways you could legitimately write the userspace side of the code,
> many of them involving some sort of concurrency.  Implementing that
> concurrency as multiple processes (using explicit shared memory and/or
> other IPC mechanisms to co-ordinate) seems a valid choice that we
> shouldn't arbitrarily prohibit.
> 
> Obviously, only one UID may be permitted to have the group open at a
> time, and I think that's enough to prevent them doing any worse than
> shooting themselves in the foot.

1:1 group<->process is probably too strong.  Not allowing concurrent
open()s on the group file enforces a single userspace entity is
responsible for that group.  Device fds can be passed to other
processes, but only retrieved via the group fd.  I suppose we could even
branch off the dma interface into a different fd, but it seems like we
would logically want to serialize dma mappings at each iommu group
anyway.  I'm open to alternatives, this just seemed an easy way to do
it.  Restricting on UID implies that we require isolated qemu instances
to run as different UIDs.  I know that's a goal, but I don't know if we
want to make it an assumption in the group security model.

> > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > We could keep the same model of segmenting the device fd address space,
> > perhaps adding ioc

[GIT PULL] Memory API fixes

2011-08-22 Thread Avi Kivity

Anthony, please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git 
refs/tags/memory-region-fixes-1


to receive a memory API fix.  The fix is not required for any of the 
already merged patches, but is needed by other patches in my queue.


I am sending core patches separately than the conversions to increase 
their visibility.


Richard Henderson (1):
  memory: Fix old_portio vs non-zero offset

 memory.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/2] Fix wide ioport access cracking

2011-08-22 Thread Anthony Liguori

On 08/11/2011 02:40 AM, Avi Kivity wrote:

The memory API automatically cracks wide memory accesses into narrower
(usually byte) accesses when needed.  Unfortunately this wasn't implemented
for ioports, leading to an lsi53c895a failure.

This series implements cracking for ioports as well.

Note that the dual implementation is due to the fact the memory API is layered
on top of the original qemu API; eventually the same code will be used for
both ioports and mmio.

Avi Kivity (2):
   memory: abstract cracking of write access ops into a function
   memory: crack wide ioport accesses into smaller ones when needed

  memory.c |  120 +++--
  1 files changed, 77 insertions(+), 43 deletions(-)


Applied.  Thanks.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Roedel, Joerg
On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote:
> On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> > On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> > >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> >
> > >  >  Well, I don't think its really meaningless, but we need some way to
> > >  >  communicate the information about device groups to userspace.
> > >
> > >  I mean the contents of the group descriptor.  There are enough 42s in
> > >  the kernel, it's better if we can replace a synthetic number with
> > >  something meaningful.
> >
> > If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> > sufficient, of course. But the idea was to make it generic enough so
> > that it works with !PCI too.
> >
> 
> We could make it an arch defined string instead of a symlink.  So it 
> doesn't return 42, rather something that can be used by the admin to 
> figure out what the problem was.

Well, ok, it would certainly differ from the in-kernel representation
then and introduce new architecture dependencies into libvirt. But if
the 'group-string' is more meaningful to users then its certainly good.
Suggestions?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] KVM: x86: tag the instructions which are used to write page table

2011-08-22 Thread Avi Kivity

On 08/22/2011 05:32 PM, Marcelo Tosatti wrote:

On Tue, Aug 16, 2011 at 02:41:27PM +0800, Xiao Guangrong wrote:
>  The idea is from Avi:
>  | tag instructions that are typically used to modify the page tables, and
>  | drop shadow if any other instruction is used.
>  | The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
>  | and cmpxchg8b.
>
>  This patch is used to tag the instructions and in the later path, shadow page
>  is dropped if it is written by other instructions

What is the advantage of doing this again? What is the point of
dropping shadow if the instruction is emulated?



So it won't be emulated again; the assumption is that if you addl into a 
page, it isn't a pagetable.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] KVM: x86: tag the instructions which are used to write page table

2011-08-22 Thread Marcelo Tosatti
On Tue, Aug 16, 2011 at 02:41:27PM +0800, Xiao Guangrong wrote:
> The idea is from Avi:
> | tag instructions that are typically used to modify the page tables, and
> | drop shadow if any other instruction is used.
> | The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
> | and cmpxchg8b.
> 
> This patch is used to tag the instructions and in the later path, shadow page
> is dropped if it is written by other instructions

What is the advantage of doing this again? What is the point of
dropping shadow if the instruction is emulated?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pcnet-pci: fix wrong opaque given to I/O accessors

2011-08-22 Thread Gerhard Wiesinger

Hello Avi,

Thnx, fixed: OK, maybe some credits :-)

Acked-by: Gerhard Wiesinger 

This pattern is still present at (maybe some further problems!!!) and I 
guess it has to be fixed, too:


grep -ir 'ops, s, "' .
./hw/rtl8139.c:memory_region_init_io(&s->bar_io, &rtl8139_io_ops, s, 
"rtl8139", 0x100);
./hw/rtl8139.c:memory_region_init_io(&s->bar_mem, &rtl8139_mmio_ops, s, 
"rtl8139", 0x100);
./hw/mac_nvram.c:memory_region_init_io(&s->mem, &macio_nvram_ops, s, 
"macio-nvram",
./hw/escc.c:memory_region_init_io(&s->mmio, &escc_mem_ops, s, "escc",
./hw/es1370.c:memory_region_init_io (&s->io, &es1370_io_ops, s, "es1370", 
256);
./hw/usb-uhci.c:memory_region_init_io(&s->io_bar, &uhci_ioport_ops, s, 
"uhci", 0x20);
./hw/usb-ehci.c:memory_region_init_io(&s->mem, &ehci_mem_ops, s, "ehci", 
MMIO_SIZE);
./hw/ne2000.c:memory_region_init_io(&s->io, &ne2000_ops, s, "ne2000", size);
./hw/ide/ahci.c:memory_region_init_io(&s->mem, &ahci_mem_ops, s, "ahci", 
0x1000);
./hw/ac97.c:memory_region_init_io (&s->io_nam, &ac97_io_nam_ops, s, 
"ac97-nam", 1024);
./hw/ac97.c:memory_region_init_io (&s->io_nabm, &ac97_io_nabm_ops, s, 
"ac97-nabm", 256);
./hw/lsi53c895a.c:memory_region_init_io(&s->mmio_io, &lsi_mmio_ops, s, 
"lsi-mmio", 0x400);
./hw/lsi53c895a.c:memory_region_init_io(&s->ram_io, &lsi_ram_ops, s, 
"lsi-ram", 0x2000);
./hw/lsi53c895a.c:memory_region_init_io(&s->io_io, &lsi_io_ops, s, 
"lsi-io", 256);
./hw/mac_dbdma.c:memory_region_init_io(&s->mem, &dbdma_ops, s, "dbdma", 
0x1000);
./hw/eepro100.c:memory_region_init_io(&s->mmio_bar, &eepro100_ops, s, 
"eepro100-mmio",
./hw/eepro100.c:memory_region_init_io(&s->io_bar, &eepro100_ops, s, 
"eepro100-io",
./hw/eepro100.c:memory_region_init_io(&s->flash_bar, &eepro100_ops, s, 
"eepro100-flash",

Ciao,
Gerhard

--
http://www.wiesinger.com/

On Mon, 22 Aug 2011, Avi Kivity wrote:


Another casualty of the memory API conversion.

Signed-off-by: Avi Kivity 
---
hw/pcnet-pci.c |4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pcnet-pci.c b/hw/pcnet-pci.c
index 13d9380..51e1320 100644
--- a/hw/pcnet-pci.c
+++ b/hw/pcnet-pci.c
@@ -290,10 +290,10 @@ static int pci_pcnet_init(PCIDevice *pci_dev)
pci_conf[PCI_MAX_LAT] = 0xff;

/* Handler for memory-mapped I/O */
-memory_region_init_io(&d->state.mmio, &pcnet_mmio_ops, d, "pcnet-mmio",
+memory_region_init_io(&d->state.mmio, &pcnet_mmio_ops, s, "pcnet-mmio",
  PCNET_PNPMMIO_SIZE);

-memory_region_init_io(&d->io_bar, &pcnet_io_ops, d, "pcnet-io",
+memory_region_init_io(&d->io_bar, &pcnet_io_ops, s, "pcnet-io",
  PCNET_IOPORT_SIZE);
pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &d->io_bar);

--
1.7.5.3



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pcnet-pci: fix wrong opaque given to I/O accessors

2011-08-22 Thread Avi Kivity
Another casualty of the memory API conversion.

Signed-off-by: Avi Kivity 
---
 hw/pcnet-pci.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pcnet-pci.c b/hw/pcnet-pci.c
index 13d9380..51e1320 100644
--- a/hw/pcnet-pci.c
+++ b/hw/pcnet-pci.c
@@ -290,10 +290,10 @@ static int pci_pcnet_init(PCIDevice *pci_dev)
 pci_conf[PCI_MAX_LAT] = 0xff;
 
 /* Handler for memory-mapped I/O */
-memory_region_init_io(&d->state.mmio, &pcnet_mmio_ops, d, "pcnet-mmio",
+memory_region_init_io(&d->state.mmio, &pcnet_mmio_ops, s, "pcnet-mmio",
   PCNET_PNPMMIO_SIZE);
 
-memory_region_init_io(&d->io_bar, &pcnet_io_ops, d, "pcnet-io",
+memory_region_init_io(&d->io_bar, &pcnet_io_ops, s, "pcnet-io",
   PCNET_IOPORT_SIZE);
 pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &d->io_bar);
 
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM: x86 emulator: make prototype of ->write_std() the same as ->write_emulated

2011-08-22 Thread Avi Kivity
This allows implementations to reuse code.

Signed-off-by: Avi Kivity 
---
 arch/x86/include/asm/kvm_emulate.h |3 ++-
 arch/x86/kvm/x86.c |4 ++--
 arch/x86/kvm/x86.h |2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 6040d11..31d2d80 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -105,7 +105,8 @@ struct x86_emulate_ops {
 *  @bytes: [IN ] Number of bytes to write to memory.
 */
int (*write_std)(struct x86_emulate_ctxt *ctxt,
-unsigned long addr, void *val, unsigned int bytes,
+unsigned long addr, const void *val,
+unsigned int bytes,
 struct x86_exception *fault);
/*
 * fetch: Read bytes of standard (non-emulated/special) memory.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6b37f18..cb7b825 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3989,12 +3989,12 @@ static int kvm_read_guest_virt_system(struct 
x86_emulate_ctxt *ctxt,
 }
 
 int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
-  gva_t addr, void *val,
+  gva_t addr, const void *val,
   unsigned int bytes,
   struct x86_exception *exception)
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-   void *data = val;
+   const void *data = val;
int r = X86EMUL_CONTINUE;
 
while (bytes) {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d36fe23..a1a0a9b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -122,7 +122,7 @@ int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
struct x86_exception *exception);
 
 int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
-   gva_t addr, void *val, unsigned int bytes,
+   gva_t addr, const void *val, unsigned int bytes,
struct x86_exception *exception);
 
 #endif
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM: x86 emulator: fuzz tester

2011-08-22 Thread Avi Kivity
The x86 emulator is directly exposed to guest code; therefore it is part
of the directly exposed attack surface.  To reduce the risk of
vulnerabilities, this patch adds a fuzz test that runs random instructions
through the emulator.  A vulnerability will usually result in an oops.

One way to run the test is via KVM itself:

  qemu -enable-kvm -smp 4 -serial stdio -kernel bzImage \
  -append 'console=ttyS0 test_emulator.iterations=10'

this requires that the test module be built into the kernel.

Signed-off-by: Avi Kivity 
---
 arch/x86/Kbuild  |1 +
 arch/x86/kvm/Kconfig |   11 +
 arch/x86/kvm/Makefile|1 +
 arch/x86/kvm/test-emulator.c |  533 ++
 4 files changed, 546 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kvm/test-emulator.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 0e9dec6..0d80e6f 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -1,5 +1,6 @@
 
 obj-$(CONFIG_KVM) += kvm/
+obj-$(CONFIG_KVM_EMULATOR_TEST) += kvm/
 
 # Xen paravirtualization support
 obj-$(CONFIG_XEN) += xen/
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..9ffc30a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -76,6 +76,17 @@ config KVM_MMU_AUDIT
 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 audit  KVM MMU at runtime.
 
+config KVM_EMULATOR_TEST
+tristate "KVM emulator self test"
+   depends on KVM
+   ---help---
+Build test code that checks the x86 emulator during boot or module
+ insertion.  If built as a module, it will be called test-emulator.ko.
+
+The emulator test will run for as many iterations as are specified by
+ the emulator_test.iterations parameter; all processors will be
+ utilized.  When the test is complete, results are reported in dmesg.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..fc4a9e2 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -19,3 +19,4 @@ kvm-amd-y += svm.o
 obj-$(CONFIG_KVM)  += kvm.o
 obj-$(CONFIG_KVM_INTEL)+= kvm-intel.o
 obj-$(CONFIG_KVM_AMD)  += kvm-amd.o
+obj-$(CONFIG_KVM_EMULATOR_TEST) += test-emulator.o
diff --git a/arch/x86/kvm/test-emulator.c b/arch/x86/kvm/test-emulator.c
new file mode 100644
index 000..1e3a22f
--- /dev/null
+++ b/arch/x86/kvm/test-emulator.c
@@ -0,0 +1,533 @@
+/*
+ * x86 instruction emulator test
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ *
+ * Authors:
+ *   Avi Kivity   
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static ulong iterations = 0;
+module_param(iterations, ulong, S_IRUGO);
+
+struct test_context {
+   struct work_struct work;
+   struct completion completion;
+   struct x86_emulate_ctxt ctxt;
+   struct test_context *next;
+   bool failed;
+   u8 insn[15];
+   bool insn_base_valid;
+   ulong insn_base;
+   struct test_seg {
+   u16 selector;
+   struct desc_struct desc;
+   u32 base3;
+   bool valid;
+   } segs[8];
+   ulong iterations;
+   ulong completed;
+   ulong decoded;
+   ulong emulated;
+   ulong nofault;
+   ulong failures;
+};
+
+static u64 random64(void)
+{
+   return random32() | ((u64)random32() << 32);
+}
+
+static ulong randlong(void)
+{
+   if (sizeof(ulong) == sizeof(u32))
+   return random32();
+   else
+   return random64();
+}
+
+static struct test_context *to_test(struct x86_emulate_ctxt *ctxt)
+{
+   return container_of(ctxt, struct test_context, ctxt);
+}
+
+static void fail(struct x86_emulate_ctxt *ctxt, const char *msg, ...)
+   __attribute__((format(printf, 2, 3)));
+
+static void fail(struct x86_emulate_ctxt *ctxt, const char *msg, ...)
+{
+   va_list args;
+   char s[200];
+
+   va_start(args, msg);
+   vsnprintf(s, sizeof(s), msg, args);
+   va_end(args);
+   printk("emulator test failure: %s\n", s);
+   to_test(ctxt)->failed = true;
+}
+
+static int test_fill_exception(struct x86_exception *ex)
+{
+   if (random32() % 4 == 0) {
+   if (ex) {
+   ex->vector = random32();
+   ex->error_code_valid = random32();
+   ex->error_code = random32();
+   ex->nested_page_fault = random32();
+   ex->address = random64();
+   }
+   return X86EMUL_PROPAGATE_FAULT;
+   }
+   return X86EMUL_CONTINUE;
+}
+
+static int rand_error(void)
+{
+   switch (random32() % 8) {
+   cas

[PATCH 2/3] KVM: x86 emulator: export main interface

2011-08-22 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0453c07..aea313b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3685,6 +3685,7 @@ done:
 
return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
 }
+EXPORT_SYMBOL_GPL(x86_decode_insn);
 
 static bool string_insn_completed(struct x86_emulate_ctxt *ctxt)
 {
@@ -4249,3 +4250,4 @@ twobyte_insn:
 cannot_emulate:
return EMULATION_FAILED;
 }
+EXPORT_SYMBOL_GPL(x86_emulate_insn);
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Emulator fuzz tester

2011-08-22 Thread Avi Kivity
As it is exposed directly to guest code, the x86 emulator is an interesting
target for exploiters: a vulnerability may lead to compromise of the entire
host.

In an attempt to catch vulnerabilities before they make it into production
code, this patchset adds a fuzz tester for the emulator.  Instructions
are synthesized and fed into the emulator; a vulnerability will usually
result in an access violation.

I tried to make the emulator test build an run in userspace; this proved too
difficult, so the test is built as part of the kernel.  It can still be run
in userspace, via KVM:

  qemu -enable-kvm -smp 4 -serial stdio -kernel bzImage \
  -append 'console=ttyS0 test_emulator.iterations=10'

  ...
  starting emulator test
  emulator fuzz test results
instructions: 10
decoded:94330032
emulated:   92529152
nofault: 117
failures:  0
  emulator test: PASS
  ...

One billion random instructions failed to find a vulnerability, so either
the emulator is really good, or the test is really bad, or we need a lot more
runtime.

Lucas, how would we go about integrating this into kvm-autotest?

Avi Kivity (3):
  KVM: x86 emulator: make prototype of ->write_std() the same as
->write_emulated
  KVM: x86 emulator: export main interface
  KVM: x86 emulator: fuzz tester

 arch/x86/Kbuild|1 +
 arch/x86/include/asm/kvm_emulate.h |3 +-
 arch/x86/kvm/Kconfig   |   11 +
 arch/x86/kvm/Makefile  |1 +
 arch/x86/kvm/emulate.c |2 +
 arch/x86/kvm/test-emulator.c   |  533 
 arch/x86/kvm/x86.c |4 +-
 arch/x86/kvm/x86.h |2 +-
 8 files changed, 553 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/kvm/test-emulator.c

-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Avi Kivity

On 08/22/2011 04:15 PM, Roedel, Joerg wrote:

On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
>  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

>  >  Well, I don't think its really meaningless, but we need some way to
>  >  communicate the information about device groups to userspace.
>
>  I mean the contents of the group descriptor.  There are enough 42s in
>  the kernel, it's better if we can replace a synthetic number with
>  something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.



We could make it an arch defined string instead of a symlink.  So it 
doesn't return 42, rather something that can be used by the admin to 
figure out what the problem was.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Roedel, Joerg
On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

> > Well, I don't think its really meaningless, but we need some way to
> > communicate the information about device groups to userspace.
> 
> I mean the contents of the group descriptor.  There are enough 42s in 
> the kernel, it's better if we can replace a synthetic number with 
> something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Avi Kivity

On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
>  On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
>  >  On the AMD IOMMU side this information is stored in the IVRS ACPI table.
>  >  Not sure about the VT-d side, though.
>
>  I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)


Well, too bad.



>  I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.



I mean the contents of the group descriptor.  There are enough 42s in 
the kernel, it's better if we can replace a synthetic number with 
something meaningful.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] kvm tools: Seperate virtio-pci layer out of virtio-blk

2011-08-22 Thread Sasha Levin
Signed-off-by: Sasha Levin 
---
 tools/kvm/virtio/blk.c |  263 
 1 files changed, 88 insertions(+), 175 deletions(-)

diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c
index 2ccf585..2e047d7 100644
--- a/tools/kvm/virtio/blk.c
+++ b/tools/kvm/virtio/blk.c
@@ -1,10 +1,8 @@
 #include "kvm/virtio-blk.h"
 
 #include "kvm/virtio-pci-dev.h"
-#include "kvm/irq.h"
 #include "kvm/disk-image.h"
 #include "kvm/virtio.h"
-#include "kvm/ioport.h"
 #include "kvm/mutex.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
@@ -12,10 +10,12 @@
 #include "kvm/threadpool.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
 
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -41,88 +41,19 @@ struct blk_dev {
pthread_mutex_t mutex;
struct list_headlist;
 
+   struct virtio_pci   vpci;
struct virtio_blk_configblk_config;
struct disk_image   *disk;
-   u64 base_addr;
-   u32 host_features;
-   u32 guest_features;
-   u16 config_vector;
-   u8  status;
-   u8  isr;
int compat_id;
-
-   /* virtio queue */
-   u16 queue_selector;
+   u32 features;
 
struct virt_queue   vqs[NUM_VIRT_QUEUES];
struct blk_dev_job  jobs[VIRTIO_BLK_QUEUE_SIZE];
u16 job_idx;
-   struct pci_device_headerpci_hdr;
 };
 
 static LIST_HEAD(bdevs);
 
-static bool virtio_blk_dev_in(struct blk_dev *bdev, void *data, unsigned long 
offset, int size)
-{
-   u8 *config_space = (u8 *) &bdev->blk_config;
-
-   if (size != 1)
-   return false;
-
-   ioport__write8(data, config_space[offset - VIRTIO_MSI_CONFIG_VECTOR]);
-
-   return true;
-}
-
-static bool virtio_blk_pci_io_in(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
-{
-   struct blk_dev *bdev;
-   u16 offset;
-   bool ret = true;
-
-   bdev= ioport->priv;
-   offset  = port - bdev->base_addr;
-
-   mutex_lock(&bdev->mutex);
-
-   switch (offset) {
-   case VIRTIO_PCI_HOST_FEATURES:
-   ioport__write32(data, bdev->host_features);
-   break;
-   case VIRTIO_PCI_GUEST_FEATURES:
-   ret = false;
-   break;
-   case VIRTIO_PCI_QUEUE_PFN:
-   ioport__write32(data, bdev->vqs[bdev->queue_selector].pfn);
-   break;
-   case VIRTIO_PCI_QUEUE_NUM:
-   ioport__write16(data, VIRTIO_BLK_QUEUE_SIZE);
-   break;
-   case VIRTIO_PCI_QUEUE_SEL:
-   case VIRTIO_PCI_QUEUE_NOTIFY:
-   ret = false;
-   break;
-   case VIRTIO_PCI_STATUS:
-   ioport__write8(data, bdev->status);
-   break;
-   case VIRTIO_PCI_ISR:
-   ioport__write8(data, bdev->isr);
-   kvm__irq_line(kvm, bdev->pci_hdr.irq_line, VIRTIO_IRQ_LOW);
-   bdev->isr = VIRTIO_IRQ_LOW;
-   break;
-   case VIRTIO_MSI_CONFIG_VECTOR:
-   ioport__write16(data, bdev->config_vector);
-   break;
-   default:
-   ret = virtio_blk_dev_in(bdev, data, offset, size);
-   break;
-   };
-
-   mutex_unlock(&bdev->mutex);
-
-   return ret;
-}
-
 static void virtio_blk_do_io_request(struct kvm *kvm, void *param)
 {
struct virtio_blk_outhdr *req;
@@ -172,7 +103,7 @@ static void virtio_blk_do_io_request(struct kvm *kvm, void 
*param)
virt_queue__set_used_elem(queue, head, block_cnt);
mutex_unlock(&bdev->mutex);
 
-   virt_queue__trigger_irq(queue, bdev->pci_hdr.irq_line, &bdev->isr, kvm);
+   virtio_pci__signal_vq(kvm, &bdev->vpci, queue - bdev->vqs);
 }
 
 static void virtio_blk_do_io(struct kvm *kvm, struct virt_queue *vq, struct 
blk_dev *bdev)
@@ -191,82 +122,93 @@ static void virtio_blk_do_io(struct kvm *kvm, struct 
virt_queue *vq, struct blk_
}
 }
 
-static bool virtio_blk_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
+static void ioevent_callback(struct kvm *kvm, void *param)
 {
-   struct blk_dev *bdev;
-   u16 offset;
-   bool ret = true;
+   struct blk_dev *bdev = param;
 
-   bdev= ioport->priv;
-   offset  = port - bdev->base_addr;
+   virtio_blk_do_io(kvm, &bdev->vqs[0], bdev);
+}
 
-   mutex_lock(&bdev->mutex);
+static void set_config(struct kvm *kvm, void *dev, u8 data, u32 offset)
+{
+   struct blk_dev *bdev = dev;
 
-   switch (offset) {
-   case VIRTIO_PCI_GUEST_FEATURES:
-  

[PATCH 5/6] kvm tools: Seperate virtio-pci layer out of virtio-9p

2011-08-22 Thread Sasha Levin
Signed-off-by: Sasha Levin 
---
 tools/kvm/builtin-run.c   |8 +-
 tools/kvm/include/kvm/virtio-9p.h |   16 ++--
 tools/kvm/virtio/9p.c |  247 +++--
 3 files changed, 111 insertions(+), 160 deletions(-)

diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index 501e0bf..00f9b25 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -106,7 +106,7 @@ static int img_name_parser(const struct option *opt, const 
char *arg, int unset)
char tmp[PATH_MAX];
 
if (realpath(arg, tmp) == 0 ||
-   virtio_9p__init(kvm, tmp, "/dev/root") < 0)
+   virtio_9p__register(kvm, tmp, "/dev/root") < 0)
die("Unable to initialize virtio 9p");
using_rootfs = 1;
return 0;
@@ -144,7 +144,7 @@ static int virtio_9p_rootdir_parser(const struct option 
*opt, const char *arg, i
tag_name++;
}
if (realpath(arg, tmp)) {
-   if (virtio_9p__init(kvm, tmp, tag_name) < 0)
+   if (virtio_9p__register(kvm, tmp, tag_name) < 0)
die("Unable to initialize virtio 9p");
} else
die("Failed resolving 9p path");
@@ -598,7 +598,7 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
strlcat(real_cmdline, kernel_cmdline, sizeof(real_cmdline));
 
if (!using_rootfs && !image_filename[0]) {
-   if (virtio_9p__init(kvm, "/", "/dev/root") < 0)
+   if (virtio_9p__register(kvm, "/", "/dev/root") < 0)
die("Unable to initialize virtio 9p");
 
using_rootfs = 1;
@@ -650,6 +650,8 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
if (!network)
network = DEFAULT_NETWORK;
 
+   virtio_9p__init(kvm);
+
if (strncmp(network, "none", 4)) {
net_params.guest_ip = guest_ip;
net_params.host_ip = host_ip;
diff --git a/tools/kvm/include/kvm/virtio-9p.h 
b/tools/kvm/include/kvm/virtio-9p.h
index 70db831..7ce56b7 100644
--- a/tools/kvm/include/kvm/virtio-9p.h
+++ b/tools/kvm/include/kvm/virtio-9p.h
@@ -3,9 +3,11 @@
 #include "kvm/virtio.h"
 #include "kvm/pci.h"
 #include "kvm/threadpool.h"
+#include "kvm/virtio-pci.h"
 
 #include 
 #include 
+#include 
 
 #define NUM_VIRT_QUEUES1
 #define VIRTQUEUE_NUM  128
@@ -38,21 +40,18 @@ struct p9_dev_job {
 };
 
 struct p9_dev {
-   u8  status;
-   u8  isr;
-   u16 config_vector;
-   u32 features;
+   struct list_headlist;
+   struct virtio_pci   vpci;
+   
struct virtio_9p_config *config;
-   u16 base_addr;
int compat_id;
+   u32 features;
 
/* virtio queue */
-   u16 queue_selector;
struct virt_queue   vqs[NUM_VIRT_QUEUES];
struct p9_dev_job   jobs[NUM_VIRT_QUEUES];
struct p9_fid   fids[VIRTIO_P9_MAX_FID];
charroot_dir[PATH_MAX];
-   struct pci_device_header pci_hdr;
 };
 
 struct p9_pdu {
@@ -67,7 +66,8 @@ struct p9_pdu {
 
 struct kvm;
 
-int virtio_9p__init(struct kvm *kvm, const char *root, const char *tag_name);
+int virtio_9p__register(struct kvm *kvm, const char *root, const char 
*tag_name);
+int virtio_9p__init(struct kvm *kvm);
 int virtio_p9_pdu_readf(struct p9_pdu *pdu, const char *fmt, ...);
 int virtio_p9_pdu_writef(struct p9_pdu *pdu, const char *fmt, ...);
 
diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
index 8a4fe73..e5c106a 100644
--- a/tools/kvm/virtio/9p.c
+++ b/tools/kvm/virtio/9p.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+static LIST_HEAD(devs);
+
 /* Warning: Immediately use value returned from this function */
 static const char *rel_to_abs(struct p9_dev *p9dev,
  const char *path, char *abs_path)
@@ -28,62 +30,6 @@ static const char *rel_to_abs(struct p9_dev *p9dev,
return abs_path;
 }
 
-static bool virtio_p9_dev_in(struct p9_dev *p9dev, void *data,
-unsigned long offset,
-int size)
-{
-   u8 *config_space = (u8 *) p9dev->config;
-
-   if (size != 1)
-   return false;
-
-   ioport__write8(data, config_space[offset - VIRTIO_MSI_CONFIG_VECTOR]);
-
-   return true;
-}
-
-static bool virtio_p9_pci_io_in(struct ioport *ioport, struct kvm *kvm,
-   u16 port, void *data, int size)
-{
-   bool ret = true;
-   unsigned long offset;
-   struct p9_dev *p9dev = ioport->priv;
-
-
-   offset = port - p9dev->base_addr;
-
-   switch (offset) {
-   case VIRTIO_PCI_HOST_FEATURES:
-   ioport__write32(data, p9dev->features);
- 

[PATCH 4/6] kvm tools: Seperate virtio-pci layer out of virtio-balloon

2011-08-22 Thread Sasha Levin
Signed-off-by: Sasha Levin 
---
 tools/kvm/virtio/balloon.c |  267 
 1 files changed, 97 insertions(+), 170 deletions(-)

diff --git a/tools/kvm/virtio/balloon.c b/tools/kvm/virtio/balloon.c
index 7f3fe65..6b93121 100644
--- a/tools/kvm/virtio/balloon.c
+++ b/tools/kvm/virtio/balloon.c
@@ -2,24 +2,24 @@
 
 #include "kvm/virtio-pci-dev.h"
 
-#include "kvm/disk-image.h"
 #include "kvm/virtio.h"
-#include "kvm/ioport.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
 #include "kvm/pci.h"
 #include "kvm/threadpool.h"
-#include "kvm/irq.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
 
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define NUM_VIRT_QUEUES3
@@ -29,17 +29,12 @@
 #define VIRTIO_BLN_STATS   2
 
 struct bln_dev {
-   struct pci_device_header pci_hdr;
struct list_headlist;
+   struct virtio_pci   vpci;
 
-   u16 base_addr;
-   u8  status;
-   u8  isr;
-   u16 config_vector;
-   u32 host_features;
+   u32 features;
 
/* virtio queue */
-   u16 queue_selector;
struct virt_queue   vqs[NUM_VIRT_QUEUES];
struct thread_pool__job jobs[NUM_VIRT_QUEUES];
 
@@ -56,68 +51,6 @@ struct bln_dev {
 static struct bln_dev bdev;
 extern struct kvm *kvm;
 
-static bool virtio_bln_dev_in(void *data, unsigned long offset, int size)
-{
-   u8 *config_space = (u8 *) &bdev.config;
-
-   if (size != 1)
-   return false;
-
-   ioport__write8(data, config_space[offset - VIRTIO_MSI_CONFIG_VECTOR]);
-
-   return true;
-}
-
-static bool virtio_bln_dev_out(void *data, unsigned long offset, int size)
-{
-   u8 *config_space = (u8 *) &bdev.config;
-
-   if (size != 1)
-   return false;
-
-   config_space[offset - VIRTIO_MSI_CONFIG_VECTOR] = *(u8 *)data;
-
-   return true;
-}
-
-static bool virtio_bln_pci_io_in(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
-{
-   unsigned long offset;
-   bool ret = true;
-
-   offset = port - bdev.base_addr;
-
-   switch (offset) {
-   case VIRTIO_PCI_HOST_FEATURES:
-   ioport__write32(data, bdev.host_features);
-   break;
-   case VIRTIO_PCI_GUEST_FEATURES:
-   case VIRTIO_PCI_QUEUE_SEL:
-   case VIRTIO_PCI_QUEUE_NOTIFY:
-   ret = false;
-   break;
-   case VIRTIO_PCI_QUEUE_PFN:
-   ioport__write32(data, bdev.vqs[bdev.queue_selector].pfn);
-   break;
-   case VIRTIO_PCI_QUEUE_NUM:
-   ioport__write16(data, VIRTIO_BLN_QUEUE_SIZE);
-   break;
-   case VIRTIO_PCI_STATUS:
-   ioport__write8(data, bdev.status);
-   break;
-   case VIRTIO_PCI_ISR:
-   ioport__write8(data, bdev.isr);
-   kvm__irq_line(kvm, bdev.pci_hdr.irq_line, VIRTIO_IRQ_LOW);
-   bdev.isr = VIRTIO_IRQ_LOW;
-   break;
-   default:
-   ret = virtio_bln_dev_in(data, offset, size);
-   break;
-   };
-
-   return ret;
-}
-
 static bool virtio_bln_do_io_request(struct kvm *kvm, struct bln_dev *bdev, 
struct virt_queue *queue)
 {
struct iovec iov[VIRTIO_BLN_QUEUE_SIZE];
@@ -182,13 +115,13 @@ static void virtio_bln_do_io(struct kvm *kvm, void *param)
 
if (vq == &bdev.vqs[VIRTIO_BLN_STATS]) {
virtio_bln_do_stat_request(kvm, &bdev, vq);
-   virt_queue__trigger_irq(vq, bdev.pci_hdr.irq_line, &bdev.isr, 
kvm);
+   virtio_pci__signal_vq(kvm, &bdev.vpci, VIRTIO_BLN_STATS);
return;
}
 
while (virt_queue__available(vq)) {
virtio_bln_do_io_request(kvm, &bdev, vq);
-   virt_queue__trigger_irq(vq, bdev.pci_hdr.irq_line, &bdev.isr, 
kvm);
+   virtio_pci__signal_vq(kvm, &bdev.vpci, vq - bdev.vqs);
}
 }
 
@@ -197,82 +130,13 @@ static void ioevent_callback(struct kvm *kvm, void *param)
thread_pool__do_job(param);
 }
 
-static bool virtio_bln_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
-{
-   unsigned long offset;
-   bool ret = true;
-   struct ioevent ioevent;
-
-   offset = port - bdev.base_addr;
-
-   switch (offset) {
-   case VIRTIO_MSI_QUEUE_VECTOR:
-   case VIRTIO_PCI_GUEST_FEATURES:
-   break;
-   case VIRTIO_PCI_QUEUE_PFN: {
-   struct virt_queue *queue;
-   void *p;
-
-   compat__remove_message(bdev.compat_id);
-
-   queue   = &bdev.vqs[bdev.queue_selector];
-   queue->pfn  = ioport__read32(data);
-   p

[PATCH 3/6] kvm tools: Seperate virtio-pci layer out of virtio-net

2011-08-22 Thread Sasha Levin
Signed-off-by: Sasha Levin 
---
 tools/kvm/virtio/net.c |  344 ++--
 1 files changed, 100 insertions(+), 244 deletions(-)

diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c
index aa4536b..6247642 100644
--- a/tools/kvm/virtio/net.c
+++ b/tools/kvm/virtio/net.c
@@ -1,19 +1,19 @@
 #include "kvm/virtio-pci-dev.h"
 #include "kvm/virtio-net.h"
 #include "kvm/virtio.h"
-#include "kvm/ioport.h"
 #include "kvm/types.h"
 #include "kvm/mutex.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
-#include "kvm/pci.h"
 #include "kvm/irq.h"
 #include "kvm/uip.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -32,18 +32,10 @@
 #define VIRTIO_NET_RX_QUEUE0
 #define VIRTIO_NET_TX_QUEUE1
 
-static struct pci_device_header pci_header = {
-   .vendor_id  = PCI_VENDOR_ID_REDHAT_QUMRANET,
-   .device_id  = PCI_DEVICE_ID_VIRTIO_NET,
-   .header_type= PCI_HEADER_TYPE_NORMAL,
-   .revision_id= 0,
-   .class  = 0x02,
-   .subsys_vendor_id   = 
PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET,
-   .subsys_id  = VIRTIO_ID_NET,
-};
-
 struct net_dev;
 
+extern struct kvm *kvm;
+
 struct net_dev_operations {
int (*rx)(struct iovec *iov, u16 in, struct net_dev *ndev);
int (*tx)(struct iovec *iov, u16 in, struct net_dev *ndev);
@@ -51,21 +43,12 @@ struct net_dev_operations {
 
 struct net_dev {
pthread_mutex_t mutex;
+   struct virtio_pci   vpci;
 
struct virt_queue   vqs[VIRTIO_NET_NUM_QUEUES];
struct virtio_net_configconfig;
-   u32 host_features;
-   u32 guest_features;
-   u16 config_vector;
-   u8  status;
-   u8  isr;
-   u16 queue_selector;
-   u16 base_addr;
-   u32 vq_vector[VIRTIO_NET_NUM_QUEUES];
-   u32 gsis[VIRTIO_NET_NUM_QUEUES];
-   u32 msix_io_block;
+   u32 features;
int compat_id;
-   boolmsix_enabled;
 
pthread_t   io_rx_thread;
pthread_mutex_t io_rx_lock;
@@ -90,14 +73,6 @@ static struct net_dev ndev = {
.config = {
.status = VIRTIO_NET_S_LINK_UP,
},
-   .host_features  = 1UL << VIRTIO_NET_F_MAC
-   | 1UL << VIRTIO_NET_F_CSUM
-   | 1UL << VIRTIO_NET_F_HOST_UFO
-   | 1UL << VIRTIO_NET_F_HOST_TSO4
-   | 1UL << VIRTIO_NET_F_HOST_TSO6
-   | 1UL << VIRTIO_NET_F_GUEST_UFO
-   | 1UL << VIRTIO_NET_F_GUEST_TSO4
-   | 1UL << VIRTIO_NET_F_GUEST_TSO6,
.info = {
.buf_nr = 20,
}
@@ -131,7 +106,7 @@ static void *virtio_net_rx_thread(void *p)
virt_queue__set_used_elem(vq, head, len);
 
/* We should interrupt guest right now, otherwise 
latency is huge. */
-   kvm__irq_trigger(kvm, ndev.gsis[VIRTIO_NET_RX_QUEUE]);
+   virtio_pci__signal_vq(kvm, &ndev.vpci, 
VIRTIO_NET_RX_QUEUE);
}
 
}
@@ -168,7 +143,7 @@ static void *virtio_net_tx_thread(void *p)
virt_queue__set_used_elem(vq, head, len);
}
 
-   kvm__irq_trigger(kvm, ndev.gsis[VIRTIO_NET_TX_QUEUE]);
+   virtio_pci__signal_vq(kvm, &ndev.vpci, VIRTIO_NET_TX_QUEUE);
}
 
pthread_exit(NULL);
@@ -177,112 +152,6 @@ static void *virtio_net_tx_thread(void *p)
 
 }
 
-static bool virtio_net_pci_io_device_specific_out(struct kvm *kvm, void *data,
-   unsigned long offset, 
int size)
-{
-   u8 *config_space = (u8 *)&ndev.config;
-   int type;
-   u32 config_offset;
-
-   type = virtio__get_dev_specific_field(offset - 20, ndev.msix_enabled, 
0, &config_offset);
-   if (type == VIRTIO_PCI_O_MSIX) {
-   if (offset == VIRTIO_MSI_CONFIG_VECTOR) {
-   ndev.config_vector  = ioport__read16(data);
-   } else {
-   u32 gsi;
-   u32 vec;
-
-   vec = ndev.vq_vector[ndev.queue_selector] = 
ioport__rea

[PATCH 2/6] kvm tools: Seperate virtio-pci layer out of virtio-rng

2011-08-22 Thread Sasha Levin
Signed-off-by: Sasha Levin 
---
 tools/kvm/virtio/rng.c |  254 ++--
 1 files changed, 74 insertions(+), 180 deletions(-)

diff --git a/tools/kvm/virtio/rng.c b/tools/kvm/virtio/rng.c
index 93258f6..5bea56c 100644
--- a/tools/kvm/virtio/rng.c
+++ b/tools/kvm/virtio/rng.c
@@ -2,16 +2,13 @@
 
 #include "kvm/virtio-pci-dev.h"
 
-#include "kvm/disk-image.h"
 #include "kvm/virtio.h"
-#include "kvm/ioport.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
-#include "kvm/pci.h"
 #include "kvm/threadpool.h"
-#include "kvm/irq.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
 
 #include 
 #include 
@@ -21,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NUM_VIRT_QUEUES1
 #define VIRTIO_RNG_QUEUE_SIZE  128
@@ -32,68 +30,39 @@ struct rng_dev_job {
 };
 
 struct rng_dev {
-   struct pci_device_header pci_hdr;
struct list_headlist;
+   struct virtio_pci   vpci;
 
-   u16 base_addr;
-   u8  status;
-   u8  isr;
-   u16 config_vector;
int fd;
-   u32 vq_vector[NUM_VIRT_QUEUES];
-   u32 msix_io_block;
int compat_id;
 
/* virtio queue */
-   u16 queue_selector;
struct virt_queue   vqs[NUM_VIRT_QUEUES];
struct rng_dev_job  jobs[NUM_VIRT_QUEUES];
 };
 
 static LIST_HEAD(rdevs);
 
-static bool virtio_rng_pci_io_in(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
+static void set_config(struct kvm *kvm, void *dev, u8 data, u32 offset)
 {
-   unsigned long offset;
-   bool ret = true;
-   struct rng_dev *rdev;
+   /* Unused */
+}
 
-   rdev = ioport->priv;
-   offset = port - rdev->base_addr;
-
-   switch (offset) {
-   case VIRTIO_PCI_HOST_FEATURES:
-   case VIRTIO_PCI_GUEST_FEATURES:
-   case VIRTIO_PCI_QUEUE_SEL:
-   case VIRTIO_PCI_QUEUE_NOTIFY:
-   ret = false;
-   break;
-   case VIRTIO_PCI_QUEUE_PFN:
-   ioport__write32(data, rdev->vqs[rdev->queue_selector].pfn);
-   break;
-   case VIRTIO_PCI_QUEUE_NUM:
-   ioport__write16(data, VIRTIO_RNG_QUEUE_SIZE);
-   break;
-   case VIRTIO_PCI_STATUS:
-   ioport__write8(data, rdev->status);
-   break;
-   case VIRTIO_PCI_ISR:
-   ioport__write8(data, rdev->isr);
-   kvm__irq_line(kvm, rdev->pci_hdr.irq_line, VIRTIO_IRQ_LOW);
-   rdev->isr = VIRTIO_IRQ_LOW;
-   break;
-   case VIRTIO_MSI_CONFIG_VECTOR:
-   ioport__write16(data, rdev->config_vector);
-   break;
-   case VIRTIO_MSI_QUEUE_VECTOR:
-   ioport__write16(data, rdev->vq_vector[rdev->queue_selector]);
-   break;
-   default:
-   ret = false;
-   break;
-   };
+static u8 get_config(struct kvm *kvm, void *dev, u32 offset)
+{
+   /* Unused */
+   return 0;
+}
 
-   return ret;
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+   /* Unused */
+   return 0;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+   /* Unused */
 }
 
 static bool virtio_rng_do_io_request(struct kvm *kvm, struct rng_dev *rdev, 
struct virt_queue *queue)
@@ -119,171 +88,96 @@ static void virtio_rng_do_io(struct kvm *kvm, void *param)
while (virt_queue__available(vq))
virtio_rng_do_io_request(kvm, rdev, vq);
 
-   kvm__irq_line(kvm, rdev->pci_hdr.irq_line, VIRTIO_IRQ_HIGH);
+   virtio_pci__signal_vq(kvm, &rdev->vpci, vq - rdev->vqs);
 }
 
-static bool virtio_rng_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
 {
-   unsigned long offset;
-   bool ret = true;
-   struct rng_dev *rdev;
+   struct rng_dev *rdev = dev;
+   struct virt_queue *queue;
+   struct rng_dev_job *job;
+   void *p;
+   struct ioevent ioevent;
 
-   rdev = ioport->priv;
-   offset = port - rdev->base_addr;
+   compat__remove_message(rdev->compat_id);
 
-   switch (offset) {
-   case VIRTIO_PCI_GUEST_FEATURES:
-   break;
-   case VIRTIO_PCI_QUEUE_PFN: {
-   struct virt_queue *queue;
-   struct rng_dev_job *job;
-   void *p;
+   queue   = &rdev->vqs[vq];
+   queue->pfn  = pfn;
+   p   = guest_pfn_to_host(kvm, queue->pfn);
 
-   compat__remove_message(rdev->compat_id);
+   job = &rdev->jobs[vq];
 
-   queue   = &rdev->vqs[rdev->queue_selector];
-   qu

[PATCH 1/6] kvm tools: Seperate virtio-pci layer

2011-08-22 Thread Sasha Levin
This patch builds a virtio-pci layer which can be used by virtio
devices as a layer to interact with virtio-pci. The purpose of the
patch is to seperate the common virtio-pci layer from being replicated
in all virtio devices.

The new layer provides a callback interface to receive information about
virtio events.

This allows us to share the entire functionality of virtio-pci throughout all
virtio devices, for example - we don't need to implement MSI-X for each device
and can just do it once for virtio-pci.

Signed-off-by: Sasha Levin 
---
 tools/kvm/Makefile |1 +
 tools/kvm/include/kvm/ioport.h |1 +
 tools/kvm/include/kvm/virtio-pci.h |   51 
 tools/kvm/virtio/pci.c |  231 
 4 files changed, 284 insertions(+), 0 deletions(-)
 create mode 100644 tools/kvm/include/kvm/virtio-pci.h
 create mode 100644 tools/kvm/virtio/pci.c

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index 316c2c9..669386f 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -52,6 +52,7 @@ OBJS  += virtio/core.o
 OBJS   += virtio/net.o
 OBJS   += virtio/rng.o
 OBJS+= virtio/balloon.o
+OBJS   += virtio/pci.o
 OBJS   += disk/blk.o
 OBJS   += disk/qcow.o
 OBJS   += disk/raw.o
diff --git a/tools/kvm/include/kvm/ioport.h b/tools/kvm/include/kvm/ioport.h
index 45c3856..5b857dd 100644
--- a/tools/kvm/include/kvm/ioport.h
+++ b/tools/kvm/include/kvm/ioport.h
@@ -4,6 +4,7 @@
 #include "kvm/rbtree-interval.h"
 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/tools/kvm/include/kvm/virtio-pci.h 
b/tools/kvm/include/kvm/virtio-pci.h
new file mode 100644
index 000..4524a7f
--- /dev/null
+++ b/tools/kvm/include/kvm/virtio-pci.h
@@ -0,0 +1,51 @@
+#ifndef KVM__VIRTIO_PCI_H
+#define KVM__VIRTIO_PCI_H
+
+#include "kvm/pci.h"
+
+#include 
+
+#define VIRTIO_PCI_MAX_VQ 3
+
+struct kvm;
+
+struct virtio_pci_ops {
+   void (*set_config)(struct kvm *kvm, void *dev, u8 data, u32 offset);
+   u8 (*get_config)(struct kvm *kvm, void *dev, u32 offset);
+
+   u32 (*get_host_features)(struct kvm *kvm, void *dev);
+   void (*set_guest_features)(struct kvm *kvm, void *dev, u32 features);
+
+   int (*init_vq)(struct kvm *kvm, void *dev, u32 vq, u32 pfn);
+   int (*notify_vq)(struct kvm *kvm, void *dev, u32 vq);
+   int (*get_pfn_vq)(struct kvm *kvm, void *dev, u32 vq);
+   int (*get_size_vq)(struct kvm *kvm, void *dev, u32 vq);
+};
+
+struct virtio_pci {
+   struct pci_device_header pci_hdr;
+   struct virtio_pci_ops   ops;
+   void*dev;
+
+   u16 base_addr;
+   u8  status;
+   u8  isr;
+
+   /* MSI-X */
+   u16 config_vector;
+   u32 config_gsi;
+   u32 vq_vector[VIRTIO_PCI_MAX_VQ];
+   u32 gsis[VIRTIO_PCI_MAX_VQ];
+   u32 msix_io_block;
+   int msix_enabled;
+
+   /* virtio queue */
+   u16 queue_selector;
+};
+
+int virtio_pci__init(struct kvm *kvm, struct virtio_pci *vpci, void *dev,
+   int device_id, int subsys_id);
+int virtio_pci__signal_vq(struct kvm *kvm, struct virtio_pci *vpci, u32 vq);
+int virtio_pci__signal_config(struct kvm *kvm, struct virtio_pci *vpci);
+
+#endif
diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c
new file mode 100644
index 000..6d086fa
--- /dev/null
+++ b/tools/kvm/virtio/pci.c
@@ -0,0 +1,231 @@
+#include "kvm/virtio-pci.h"
+
+#include "kvm/ioport.h"
+#include "kvm/kvm.h"
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/irq.h"
+#include "kvm/virtio.h"
+
+#include 
+#include 
+
+static bool virtio_pci__specific_io_in(struct kvm *kvm, struct virtio_pci 
*vpci, u16 port,
+   void *data, int size, int offset)
+{
+   u32 config_offset;
+   int type = virtio__get_dev_specific_field(offset - 20, 
vpci->msix_enabled,
+   0, &config_offset);
+   if (type == VIRTIO_PCI_O_MSIX) {
+   switch (offset) {
+   case VIRTIO_MSI_CONFIG_VECTOR:
+   ioport__write16(data, vpci->config_vector);
+   break;
+   case VIRTIO_MSI_QUEUE_VECTOR:
+   ioport__write16(data, 
vpci->vq_vector[vpci->queue_selector]);
+   break;
+   };
+
+   return true;
+   } else if (type == VIRTIO_PCI_O_CONFIG) {
+   u8 cfg;
+
+   cfg = vpci->ops.get_config(kvm, vpci->dev, config_offset);
+   ioport__write8(data, cfg);
+   return true;
+   }
+
+   return false;
+}
+
+static bool virtio_pci__io_in(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
+{
+   unsigned long offset;
+   bool ret = true;
+   struct v

Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Roedel, Joerg
On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> > On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> > Not sure about the VT-d side, though.
> 
> I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)

> I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Avi Kivity

On 08/22/2011 03:36 PM, Roedel, Joerg wrote:

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
>  On 08/22/2011 01:46 PM, Joerg Roedel wrote:
>  >  That does not work. The bridge in question may not even be visible as a
>  >  PCI device, so you can't link to it. This is the case on a few PCIe
>  >  cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
>  >  the PCIe interface (yes, I have seen those cards).
>
>  How does the kernel detect that devices behind the invisible bridge must
>  be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.



I see.  There is no sysfs node representing it?

I'd rather not add another meaningless identifier.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Roedel, Joerg
On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A non-responsive guest problem

2011-08-22 Thread Stefan Hajnoczi
On Mon, Aug 22, 2011 at 10:37 AM, Paul  wrote:
> Hi,
> I found the clock of guest OS has been changed. For example, today was Aug
> 22, but I found the time of guest was Mar 22 from the VNC desktop. The clock
> source of guest was kvm-clock. Was it related to KVM clock bug? How about it
> if I changed the clock to tsc?

If the guest is using 100% CPU but the kernel is still responsive at
some level you can use SysRq to gather information:
http://en.wikipedia.org/wiki/Magic_SysRq_key

Especially Alt+SysRQ+t is interesting because it prints the current
tasks to the console.

If you are able to get interactive access again then top(1) would be
interesting.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMI BIOS String

2011-08-22 Thread Andrew Cathrow


- Original Message -
> From: "Derek" 
> To: kvm@vger.kernel.org
> Sent: Sunday, August 21, 2011 11:52:19 PM
> Subject: DMI BIOS String
> Hi Folks,
> 
> I could not track down any solid info on modifying the DMI BIOS
> string.

qemu-kvm -help | grep bios

-smbios file=binary
-smbios type=0[,vendor=str][,version=str][,date=str][,release=%d.%d]
-smbios type=1[,manufacturer=str][,product=str][,version=str][,serial=str]

or if you're using libvirt
http://libvirt.org/formatdomain.html#elementsSysinfo


> 
> For example, in VirtualBox you can use 'vboxmanage setsextradata' to
> set the BIOS product and vendor string per VM.
> 
> Any ideas if this is possible with KVM?
> 
> Thanks,
> Derek--
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Avi Kivity

On 08/22/2011 01:46 PM, Joerg Roedel wrote:

>  $ readlink /sys/devices/pci:00/:00:19.0/iommu_group
>  ../../../path/to/device/which/represents/the/resource/constraint
>
>  (the pci-to-pci bridge on x86, or whatever node represents partitionable
>  endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).



How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread Joerg Roedel
On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci:00/:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci:00/:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/2] Fix wide ioport access cracking

2011-08-22 Thread Avi Kivity

On 08/11/2011 10:01 PM, Gerhard Wiesinger wrote:

On Thu, 11 Aug 2011, Avi Kivity wrote:
Or maybe it's just -O2 screwing up debug information.  Please change 
./configure to set -O1 and redo.


Please print *r.memory as well.


./configure --target-list=x86_64-softmmu,i386-softmmu --enable-debug
Rest below.



Please run again using

   gdb -x memory.gdb --args qemu-system-x86_64 ...

and the attached memory.gdb.  Please post the entire log generated.

--
error compiling committee.c: too many arguments to function

handle SIGUSR2 pass noprint
handle SIG38 pass noprint
def dump_mr
  set $mr = $arg0
  printf "%p/%p addr %x parent %p name %s\n", $mr, $mr.ops, $mr.addr, 
$mr.parent, $mr.name
end
break memory_region_add_subregion
commands 1
  dump_mr subregion
  cont
end
break memory_region_del_subregion
commands 2
  dump_mr subregion
  printf "parent %p\n", mr
  cont
end
break memory_region_destroy
commands 3
  dump_mr mr
  cont
end
break memory_region_init
commands 4
  cont
end
break memory_region_init_io
commands 5
  cont
end
break memory_region_init_ram_ptr
commands 6
  cont
end
break memory_region_init_alias
commands 7
  cont
end
run

Re: [PATCH 1/3] Avoid the use of deprecated gnutls gnutls_*_set_priority functions.

2011-08-22 Thread Gerd Hoffmann

On 08/22/11 10:26, Stefan Hajnoczi wrote:

On Mon, Jul 4, 2011 at 11:00 PM, Raghavendra D Prabhu
  wrote:

The gnutls_*_set_priority family of functions has been marked deprecated
in 2.12.x. These functions have been superceded by
gnutls_priority_set_direct().

Signed-off-by: Raghavendra D Prabhu
---
  ui/vnc-tls.c |   20 +---
  1 files changed, 1 insertions(+), 19 deletions(-)


Gerd,
You reported these F16 build failures.  I'm getting them now too and
Raghavendra's patch would be useful to fix them.


Indeed.  Patch looks good to me.

cheers,
  Gerd

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A non-responsive guest problem

2011-08-22 Thread Paul
Hi,

I found the clock of guest OS has been changed. For example, today was Aug
22, but I found the time of guest was Mar 22 from the VNC desktop. The clock
source of guest was kvm-clock. Was it related to KVM clock bug? How about it
if I changed the clock to tsc?

Thanks,
Paul

On Fri, Aug 19, 2011 at 9:18 PM, Stefan Hajnoczi  wrote:
>
> On Thu, Aug 18, 2011 at 8:42 AM, Paul  wrote:
> > Today I saw the guest OS hung and was no responsive. In the host, I
> > found the guest was running via virsh command. But I couldn't use ssh
> > to connect the guest, and even couldn't ping it. I could use VNC saw
> > the desktop of VNC, but I couldn't move the mouse pointer. In the
> > host, the qemu-kvm process occupied almost 100% CPU.
> >
> > The host was Redhat Enterprise Linux 6 64bit (not SP1).  The CPU was
> > Intel quad-core Q9550S. The guest was SUSE Linux Enterprise Server 11
> > SP1 64bit. The guest had been running for two weeks before it hung.
>
> Any interesting messages in dmesg on the host?
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMI BIOS String

2011-08-22 Thread Daniel P. Berrange
On Mon, Aug 22, 2011 at 03:52:19PM +1200, Derek wrote:
> Hi Folks,
> 
> I could not track down any solid info on modifying the DMI BIOS string.
> 
> For example, in VirtualBox you can use 'vboxmanage setsextradata' to
> set the BIOS product and vendor string per VM.
> 
> Any ideas if this is possible with KVM?

If using QEMU directly you can use '-smbios' args. eg

-smbios "type=0,vendor=LENOVO,version=6FET82WW (3.12 )"
-smbios 
"type=1,manufacturer=Fedora,product=Virt-Manager,version=0.8.2-3.fc14,serial=32dfcb37-5af1-552b-357c-be8c3aa38310,uuid=c7a5fdbd-edaf-9455-926a-d65c16db1809,sku=1234567890,family=Red
 Hat"

If using QEMU via libvirt you can use the following:

  http://libvirt.org/formatdomain.html#elementsSysinfo


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-08-22 Thread Avi Kivity

On 08/17/2011 07:19 AM, Liu, Jinsong wrote:

 From a9670ddff84080c56183e2d678189e100f891174 Mon Sep 17 00:00:00 2001
From: Liu, Jinsong
Date: Wed, 17 Aug 2011 11:36:28 +0800
Subject: [PATCH] KVM: emulate lapic tsc deadline timer for hvm


kvm doesn't have hvm.


This patch emulate lapic tsc deadline timer for hvm:
Enumerate tsc deadline timer capacibility by CPUID;
Enable tsc deadline timer mode by LAPIC MMIO;
Start tsc deadline timer by MSR;



diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 4258aac..28bcf48 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -120,6 +120,7 @@
  #define X86_FEATURE_X2APIC(4*32+21) /* x2APIC */
  #define X86_FEATURE_MOVBE (4*32+22) /* MOVBE instruction */
  #define X86_FEATURE_POPCNT  (4*32+23) /* POPCNT instruction */
+#define X86_FEATURE_TSC_DEADLINE_TIMER(4*32+24) /* Tsc deadline timer */
  #define X86_FEATURE_AES   (4*32+25) /* AES instructions */
  #define X86_FEATURE_XSAVE (4*32+26) /* XSAVE/XRSTOR/XSETBV/XGETBV */
  #define X86_FEATURE_OSXSAVE   (4*32+27) /* "" XSAVE enabled in the OS */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 307e3cf..28f7128 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -635,6 +635,7 @@ struct kvm_x86_ops {
int (*check_intercept)(struct kvm_vcpu *vcpu,
   struct x86_instruction_info *info,
   enum x86_intercept_stage stage);
+   u64 (*guest_to_host_tsc)(u64 guest_tsc);
  };


Please put this near the other tsc functions.  Add a comment explaining 
what value is returned under nested virtualization.


Please add the svm callback implementation.



--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -229,6 +229,8 @@
  #define MSR_IA32_APICBASE_ENABLE  (1<<11)
  #define MSR_IA32_APICBASE_BASE(0xf<<12)

+#define MSR_IA32_TSCDEADLINE   0x06e0
+
  #define MSR_IA32_UCODE_WRITE  0x0079
  #define MSR_IA32_UCODE_REV0x008b
  


Need to add to msrs_to_save so live migration works.



@@ -665,28 +682,30 @@ static void update_divide_count(struct kvm_lapic *apic)
  static void start_apic_timer(struct kvm_lapic *apic)
  {
ktime_t now = apic->lapic_timer.timer.base->get_time();
-
-   apic->lapic_timer.period = (u64)apic_get_reg(apic, APIC_TMICT) *
-   APIC_BUS_CYCLE_NS * apic->divide_count;
atomic_set(&apic->lapic_timer.pending, 0);

-   if (!apic->lapic_timer.period)
-   return;
-   /*
-* Do not allow the guest to program periodic timers with small
-* interval, since the hrtimers are not throttled by the host
-* scheduler.
-*/
-   if (apic_lvtt_period(apic)) {
-   if (apic->lapic_timer.period<  NSEC_PER_MSEC/2)
-   apic->lapic_timer.period = NSEC_PER_MSEC/2;
-   }
+   /* lapic timer in oneshot or peroidic mode */
+   if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) {
+   apic->lapic_timer.period = (u64)apic_get_reg(apic, APIC_TMICT)
+   * APIC_BUS_CYCLE_NS * apic->divide_count;

-   hrtimer_start(&apic->lapic_timer.timer,
- ktime_add_ns(now, apic->lapic_timer.period),
- HRTIMER_MODE_ABS);
+   if (!apic->lapic_timer.period)
+   return;
+   /*
+* Do not allow the guest to program periodic timers with small
+* interval, since the hrtimers are not throttled by the host
+* scheduler.
+*/
+   if (apic_lvtt_period(apic)) {
+   if (apic->lapic_timer.period<  NSEC_PER_MSEC/2)
+   apic->lapic_timer.period = NSEC_PER_MSEC/2;
+   }

-   apic_debug("%s: bus cycle is %" PRId64 "ns, now 0x%016"
+   hrtimer_start(&apic->lapic_timer.timer,
+ ktime_add_ns(now, apic->lapic_timer.period),
+ HRTIMER_MODE_ABS);
+
+   apic_debug("%s: bus cycle is %" PRId64 "ns, now 0x%016"
   PRIx64 ", "
   "timer initial count 0x%x, period %lldns, "
   "expire @ 0x%016" PRIx64 ".\n", __func__,
@@ -695,6 +714,26 @@ static void start_apic_timer(struct kvm_lapic *apic)
   apic->lapic_timer.period,
   ktime_to_ns(ktime_add_ns(now,
apic->lapic_timer.period)));
+   }
+
+   /* lapic timer in tsc deadline mode */
+   if (apic_lvtt_tscdeadline(apic)) {


'else if' is slightly better, since it shows the reader the options are 
mutually exclusive.



+   u64 tsc_now, tsc_target, tsc_delta, nsec;
+
+ 

Re: [PATCH] virtio-net: Read MAC only after initializing MSI-X

2011-08-22 Thread Michael S. Tsirkin
On Mon, Aug 22, 2011 at 09:54:50AM +0930, Rusty Russell wrote:
> On Fri, 19 Aug 2011 18:23:35 +0300, "Michael S. Tsirkin"  
> wrote:
> > On Sat, Aug 13, 2011 at 11:51:01AM +0300, Sasha Levin wrote:
> > > The MAC of a virtio-net device is located at the first field of the device
> > > specific header. This header is located at offset 20 if the device doesn't
> > > support MSI-X or offset 24 if it does.
> > > 
> > > Current code in virtnet_probe() used to probe the MAC before checking for
> > > MSI-X, which means that the read was always made from offset 20 regardless
> > > of whether MSI-X in enabled or not.
> > > 
> > > This patch moves the MAC probe to after the detection of whether MSI-X is
> > > enabled. This way the MAC will be read from offset 24 if the device indeed
> > > supports MSI-X.
> > > 
> > > Cc: Rusty Russell 
> > > Cc: Michael S. Tsirkin 
> > > Cc: virtualizat...@lists.linux-foundation.org
> > > Cc: net...@vger.kernel.org
> > > Cc: kvm@vger.kernel.org
> > > Signed-off-by: Sasha Levin 
> > 
> > I am not sure I see a bug in virtio: the config pace layout simply
> > changes as msix is enabled and disabled (and if you look at the latest
> > draft, also on whether 64 bit features are enabled).
> > It doesn't depend on msix capability being present in device.
> > 
> > The spec seems to be explicit enough:
> > If MSI-X is enabled for the device, two additional fields immediately
> > follow this header.
> > 
> > So I'm guessing the bug is in kvm tools which assume
> > same layout for when msix is enabled and disabled.
> > qemu-kvm seems to do the right thing so the device
> > seems to get the correct mac.
> 
> So, the config space moves once MSI-X is enabled?  In which case, it
> should say "ONCE MSI-X is enabled..."
> 
> Thanks,
> Rusty.

Yes. Or maybe 'WHEN' - since if MSI-X is disabled again, it moves back.
Let's update the spec to make it clearer?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Avoid the use of deprecated gnutls gnutls_*_set_priority functions.

2011-08-22 Thread Stefan Hajnoczi
On Mon, Jul 4, 2011 at 11:00 PM, Raghavendra D Prabhu
 wrote:
> The gnutls_*_set_priority family of functions has been marked deprecated
> in 2.12.x. These functions have been superceded by
> gnutls_priority_set_direct().
>
> Signed-off-by: Raghavendra D Prabhu 
> ---
>  ui/vnc-tls.c |   20 +---
>  1 files changed, 1 insertions(+), 19 deletions(-)

Gerd,
You reported these F16 build failures.  I'm getting them now too and
Raghavendra's patch would be useful to fix them.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Host where KSM appears to save a negative amount of memory

2011-08-22 Thread Chris Webb
Hugh Dickins  writes:

> KSM chooses to show the numbers pages_shared and pages_sharing as
> exclusive counts: pages_sharing indicates the saving being made.  So it
> would be perfectly reasonable to add those two numbers together to get
> the "total" number of pages sharing, the number you expected it to show;
> but it doesn't make sense to subtract shared from sharing.

Hi. Many thanks for your helpful and detailed explanation. I've fixed our
monitoring to correctly use just pages_sharing to measure the savings. I
think I just assumed the meanings of pages_shared and pages_sharing from
their names. This means that ksm has been saving even more memory than we
thought on our hosts in the past!

Best wishes,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/24] Memory API batch 4: more conversions

2011-08-22 Thread Avi Kivity

On 08/15/2011 05:17 PM, Avi Kivity wrote:

This patchset converts a few more devices (and boards) to the memory API.
Nothing is really interesting except for the last patch, which brings
proper PAM support (and better SMRAM emulation).  It should also fix the
regressions that were reported in some non-default pc configurations.



Ping?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html