RE: Bug? 100% load on core after physically removing USB storage from host

2012-06-18 Thread Veruca Salt




> Date: Sat, 16 Jun 2012 13:19:53 +0800
> Subject: Re: Bug? 100% load on core after physically removing USB storage 
> from host
> From: centos.ad...@gmail.com
> To: verucasal...@hotmail.co.uk
> CC: stefa...@gmail.com; kvm@vger.kernel.org; kra...@redhat.com
>
> On 6/14/12, Veruca Salt  wrote:
> >> >>> qemu-kvm-0.12.1.2-2.209.el6_2.4.x86_64
> >
> > We had the same problem with 0.13
> > We were using it on Sandy Bridge motherboards when it happened. It was an
> > issue then, but we hanged to 1.0 a long time ago.
> > Why are you using 0.12 years after it was replaced?
>
> It's the default on the EL6 based distributions I think, and I don't
> really want to change from the defaults unless I really have and 0.12
> worked fine so far. This bug is a minor inconvenience but nothing
> show-stopping yet.
>
> Plus I'm not a full time server admin and diverging from the standard
> stack tends to backfire on me later especially if I overlook noting
> down some minor but crucial detail :D

Ah, I see. I don't know how to help; we are using custom hosts, so upgrading is 
the first thing we try rather than the last.
  --
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

2012-06-18 Thread YOSHIDA Masanori

Hi, Avi, Yanfei

I'm YOSHIDA Masanori from Hitachi, a developer of livedump.

>> And here is a new case from the LinuxCon Japan:
>>
>> Developers from Hitach are now developing a new livedump mechanism for the
>> same reason as ours. They have come to the situation *many times* that guest
>> machines crashed due to host's failures, in particular, under development.
>
> This has happened to me as well, possible even more times .  I don't
> use crash dumps for debugging but different people may use different
> techniques.

As Yanfei's introduction, I'm developing livedump for cases where
guests crash due to host's failures.

Especially in very important systems, it is strongly requested to
identify the root cause of any failure even if it is very rare.
For this purpose, crash dumps must be obtained.
Therefore, we think livedump technique must be applied to the
virtualization system on that kind of area.

>>> After the buggy situation is reproduced, we panic the host *manually*.
>>> Then we could use userland tools to get guest machine's crash dump from 
host machine's
>>> with the feature provided by this patch set. Finally we could analyse them 
separately

>>> to find which side causes the problem.
>>>
>>
>> Could you please tell me your attitude towards this patch?
>
> I still dislike it conceptually.  But let me do a technical review of
> the latest version.

Actually, current implementation of livedump is just a core part,
which dumps only the image of kernel space. And I'd like to expand
it to obtain guest image at the same time too.
Also for this situation, VMCSINFO seems necessary to be exported.

Thanks,
YOSHIDA Masanori

Yokohama Research Laboratory,
Hitachi, Ltd.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Rusty Russell
On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:
> This patch introduces bio-based IO path for virtio-blk.

Why make it optional?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Asias He

On 06/18/2012 03:46 PM, Rusty Russell wrote:

On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:

This patch introduces bio-based IO path for virtio-blk.


Why make it optional?


request-based IO path is useful for users who do not want to bypass the 
IO scheduler in guest kernel, e.g. users using spinning disk. For users 
using fast disk device, e.g. SSD device, they can use bio-based IO path.


--
Asias


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Avi Kivity
On 06/16/2012 07:34 PM, Alex Williamson wrote:
> I'm looking for opinions on this approach.  For vfio device assignment
> we minimally need a way to get EOIs from the in-kernel irqchip out to
> userspace.  Getting that out via an eventfd would allow us to bounce
> all level interrupts out to userspace, where we would de-assert the
> device interrupt in qemu and unmask the physical device.  Ideally we
> could deassert the interrupt in KVM, which allows us to send the EOI
> directly to vfio.  To do that, we need to use a new IRQ source ID so
> the guest sees the logical OR of qemu requested state and external
> device state.  This allows external devices to share interrupts with
> emulated devices, just like KVM assignment.  That means the means we
> also need to use a new source ID when injecting the interrupt via
> irqfd.
> 
> Rather than creating a source ID allocation interface, extending irqfd
> to support a user supplied source ID, and creating another new
> interface to get the EOI out, I think it works out better to bundle
> these all together as just a level irqfd interface.  This way we don't
> allow users to create unbalanced states where a level interrupt is
> asserted, but has no way of being deasserted.  I believe the below
> does this, but needs testing and validation with an implementation in
> qemu.
> 

Missing documentation, which helps at least one reviewer review.  It's
not just for a commit.

>  
> +static void
> +irqfd_inject_level(struct work_struct *work)
> +{
> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> +
> + kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
> +}
> +
> +static void
> +irqfd_ack_level(struct kvm_irq_ack_notifier *notifier)
> +{
> + struct _irqfd *irqfd  = container_of(notifier, struct _irqfd, notifier);
> +
> + kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 0);
> + eventfd_signal(irqfd->eoi_eventfd, 1);
> +}
> +

I don't understand how this works.  A level IRQ isn't de-asserted by the
EOI, it's de-asserted by its source.

Consider the following sequence:

deviceguest

  event
   assert
  interrupt
   interrupt handler
handle event
clear ISR bit
   deassert
  event
assert
   EOI

What should happen is that the interrupt will be redelivered
immmediately after the EOI, but that won't happen with your API since
the EOI ack notifier will deassert the interrupt and nothing will
re-assert it.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug? 100% load on core after physically removing USB storage from host

2012-06-18 Thread Stefan Hajnoczi
On Sat, Jun 16, 2012 at 6:16 AM, Emmanuel Noobadmin
 wrote:
> On 6/13/12, Stefan Hajnoczi  wrote:
>>Since system time is a large chunk you could use strace -f -p $(pgrep
>>qemu-kvm) or other system call tracing tools to see what the qemu-kvm
>>process is doing.
>
> The command you gave didn't work so I replace $(pgrep) with PID of the
> process running the VM after checking that -p was the PID option.
>
> strace -f -p 19424 produces the following repeating lines
>
> [pid 19424] ioctl(0, USBDEVFS_REAPURBNDELAY, 0x7fff8fc43d48) = -1
> ENOTTY (Inappropriate ioctl for device)

This looks suspicious: QEMU is calling ioctl() on standard input (fd
0) with a USB operation code.  The USB layer probably set the
USBHostDevice->fd field to 0 or freed USBHostDevice but left the
async_complete() select(2) handler registered on file descriptor 16.

I believe the call is coming from hw/usb/host-linux.c:async_complete()
but am not using the same source tree as your qemu-kvm so I could be
off.  The code suggests that QEMU also logs an error message
("USBDEVFS_REAPURBNDELAY: Inappropriate ioctl for device") when this
happens.  If you want, check the libvirt log file for this guest - it
probably has tons of these messages in it.

Gerd: Does this give you enough information to recognize whether this
has already been fixed in qemu.git or how to fix it?

Emmanuel: You mentioned that upgrading to a newer version might not be
worth it, but if you're willing to test qemu.git/master just to see if
the problem has already been fixed then that would be helpful.
http://wiki.qemu.org/Download

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Biweekly KVM Test report, kernel 25e531a9... qemu 0a948cbb...

2012-06-18 Thread Ren, Yongjie
Hi All,

This is KVM upstream test result against kvm.git next branch and qemu-kvm.git 
master branch.
kvm.git next branch: 25e531a988ea5a64bd97a72dc9d3c65ad5850120 based on 
kernel 3.5.0-rc1
qemu-kvm.git master branch: 0a948cbb1835e5f36990e173966d30bc4c8cc038

We found 1 new bug and verified 1 fixed bug in the past two weeks. 

New issue (1):
1. network in bridge mode doesn't work in SMP Linux guest 
  https://bugs.launchpad.net/qemu/+bug/1013467
  -- UP Linux guest or SMP Win guest doesn't have this issue.

Fixed issue (1):
1.disk error when guest boot up via qcow2 image
  https://bugs.launchpad.net/qemu/+bug/1002121
  --Jan Kiszka's patch fixed this issue.

Old issues (3):
--
1. (Nested-virt)L1 (kvm on kvm)guest panic with parameter "-cpu host" in qemu 
command line.
  https://bugs.launchpad.net/qemu/+bug/994378
2. Can't install or boot up 32bit win8 guest.
  https://bugs.launchpad.net/qemu/+bug/1007269
3. VT-d/SR-IOV doesn't work in the guest
  https://bugzilla.kernel.org/show_bug.cgi?id=43328

Test environment:
==
  Platform   Westmere-EPSandybridge-EP
  CPU Cores   2432
  Memory size 24G   32G


Best Regards,
 Yongjie Ren  (Jay)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts

2012-06-18 Thread Ren, Yongjie
> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Monday, June 11, 2012 6:22 PM
> To: Avi Kivity
> Cc: Jan Kiszka; Thomas Gleixner; Alex Williamson; kvm@vger.kernel.org;
> mtosa...@redhat.com; linux-ker...@vger.kernel.org; Ren, Yongjie
> Subject: Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI
> interrupts
> 
> On Mon, Jun 11, 2012 at 01:01:41PM +0300, Avi Kivity wrote:
> > On 06/08/2012 05:50 PM, Jan Kiszka wrote:
> > >
> > >>
> > >> Pls correct me if I'm wrong.
> > >
> > > Well, IIRC, the "don't loop over all vcpus with IRQs or preemption
> > > disabled" was one argument against direct legacy interrupt injection as
> > > well. That's what I kept in mind from those discussions. Maybe Avi can
> > > comment on the current position.
> >
> > It's still my position.
> >
> > IMO we need something like struct gfn_to_hva_cache for interrupts.  If
> > it's in the cache, we fast-path it from the interrupt handler.  If not,
> > fall back to a workqueue and let it refill the cache.
> 
> And you class the irqfd behaviour of injecting multicast
> with interrupts disabled a bug then?
> 
Hi Avi & Michael,
Any more news on this issue ?

> > --
> > error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Jan Kiszka
On 2012-06-18 10:02, Avi Kivity wrote:
> On 06/16/2012 07:34 PM, Alex Williamson wrote:
>> I'm looking for opinions on this approach.  For vfio device assignment
>> we minimally need a way to get EOIs from the in-kernel irqchip out to
>> userspace.  Getting that out via an eventfd would allow us to bounce
>> all level interrupts out to userspace, where we would de-assert the
>> device interrupt in qemu and unmask the physical device.  Ideally we
>> could deassert the interrupt in KVM, which allows us to send the EOI
>> directly to vfio.  To do that, we need to use a new IRQ source ID so
>> the guest sees the logical OR of qemu requested state and external
>> device state.  This allows external devices to share interrupts with
>> emulated devices, just like KVM assignment.  That means the means we
>> also need to use a new source ID when injecting the interrupt via
>> irqfd.
>>
>> Rather than creating a source ID allocation interface, extending irqfd
>> to support a user supplied source ID, and creating another new
>> interface to get the EOI out, I think it works out better to bundle
>> these all together as just a level irqfd interface.  This way we don't
>> allow users to create unbalanced states where a level interrupt is
>> asserted, but has no way of being deasserted.  I believe the below
>> does this, but needs testing and validation with an implementation in
>> qemu.
>>
> 
> Missing documentation, which helps at least one reviewer review.  It's
> not just for a commit.
> 
>>  
>> +static void
>> +irqfd_inject_level(struct work_struct *work)
>> +{
>> +struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>> +
>> +kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
>> +}
>> +
>> +static void
>> +irqfd_ack_level(struct kvm_irq_ack_notifier *notifier)
>> +{
>> +struct _irqfd *irqfd  = container_of(notifier, struct _irqfd, notifier);
>> +
>> +kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 0);
>> +eventfd_signal(irqfd->eoi_eventfd, 1);
>> +}
>> +
> 
> I don't understand how this works.  A level IRQ isn't de-asserted by the
> EOI, it's de-asserted by its source.
> 
> Consider the following sequence:
> 
> deviceguest
> 
>   event
>assert
>   interrupt
>interrupt handler
> handle event
> clear ISR bit
>deassert
>   event
> assert
>EOI
> 
> What should happen is that the interrupt will be redelivered
> immmediately after the EOI, but that won't happen with your API since
> the EOI ack notifier will deassert the interrupt and nothing will
> re-assert it.

As it's level triggered and we unmask the physical source, another
host-side interrupt will be triggered and then reported to the guest.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Improve virtio-blk performance

2012-06-18 Thread Stefan Hajnoczi
On Mon, Jun 18, 2012 at 7:53 AM, Asias He  wrote:
> Fio test shows it gives, 28%, 24%, 21%, 16% IOPS boost and 32%, 17%, 21%, 16%
> latency improvement for sequential read/write, random read/write respectively.

Sounds great.  What storage configuration did you use (single spinning
disk, SSD, storage array) and are these numbers for parallel I/O or
sequential I/O?

What changed since Minchan worked on this?  I remember he wasn't
satisfied that this was a clear win.  Your numbers are strong so
either you fixed something important or you are looking at different
benchmark configurations.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Avi Kivity
On 06/18/2012 11:52 AM, Jan Kiszka wrote:
>> 
>> I don't understand how this works.  A level IRQ isn't de-asserted by the
>> EOI, it's de-asserted by its source.
>> 
>> Consider the following sequence:
>> 
>> deviceguest
>> 
>>   event
>>assert
>>   interrupt
>>interrupt handler
>> handle event
>> clear ISR bit
>>deassert
>>   event
>> assert
>>EOI
>> 
>> What should happen is that the interrupt will be redelivered
>> immmediately after the EOI, but that won't happen with your API since
>> the EOI ack notifier will deassert the interrupt and nothing will
>> re-assert it.
> 
> As it's level triggered and we unmask the physical source, another
> host-side interrupt will be triggered and then reported to the guest.

That works for real devices.  What about emulated devices (vhost,
msi-less ivshmem clone)?

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Stefan Hajnoczi
On Mon, Jun 18, 2012 at 7:53 AM, Asias He  wrote:
> +static void virtblk_add_buf_wait(struct virtio_blk *vblk,
> +                                struct virtblk_req *vbr,
> +                                unsigned long out,
> +                                unsigned long in)
> +{
> +       DEFINE_WAIT(wait);
> +
> +       for (;;) {
> +               prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
> +                                         TASK_UNINTERRUPTIBLE);
> +
> +               spin_lock_irq(vblk->disk->queue->queue_lock);
> +               if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
> +                                     GFP_ATOMIC) < 0) {
> +                       spin_unlock_irq(vblk->disk->queue->queue_lock);
> +                       io_schedule();
> +               } else {
> +                       virtqueue_kick(vblk->vq);
> +                       spin_unlock_irq(vblk->disk->queue->queue_lock);
> +                       break;
> +               }
> +
> +       }

Is there a meaningful condition where we would cancel this request
(e.g. signal)?  If the vring fills up for some reason and we get stuck
here the user might wish to kill hung processes.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Improve virtio-blk performance

2012-06-18 Thread Asias He

On 06/18/2012 05:14 PM, Stefan Hajnoczi wrote:

On Mon, Jun 18, 2012 at 7:53 AM, Asias He  wrote:

Fio test shows it gives, 28%, 24%, 21%, 16% IOPS boost and 32%, 17%, 21%, 16%
latency improvement for sequential read/write, random read/write respectively.


Sounds great.  What storage configuration did you use (single spinning
disk, SSD, storage array) and are these numbers for parallel I/O or
sequential I/O?


I used ramdisk as the backend storage.


What changed since Minchan worked on this?  I remember he wasn't
satisfied that this was a clear win.  Your numbers are strong so
either you fixed something important or you are looking at different
benchmark configurations.


I am using kvm tool instead qemu. He wasn't satisfied the poor 
sequential performance. I removed the plug and unplug operation and bio 
completion batching. You can grab Michan's patch and make a diff to see 
the details.


Here is the fio's config file.

[global]
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
group_reporting
norandommap
ioscheduler=noop
thread
bs=512
size=4MB
direct=1
filename=/dev/vdb
numjobs=256
ioengine=aio
iodepth=64
loops=3

[seq-read]
stonewall
rw=read

[seq-write]
stonewall
rw=write

[rnd-read]
stonewall
rw=randread

[rnd-write]
stonewall
rw=randwrite



--
Asias


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 6/6] KVM: introduce readonly memslot

2012-06-18 Thread Avi Kivity
On 06/16/2012 05:11 AM, Marcelo Tosatti wrote:
> 
> Can you introduce a separate exit reason, say KVM_EXIT_READ_FAULT, with
> information about the fault?

I think you mean WRITE_FAULT.  But what's wrong with the normal mmio exit?

> 
> Then perform this exit only if userspace allows it by explicit enable, 
> and by default have the exit_read_fault handler jump to the mmio
> handler. 


I don't get this.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 6/6] KVM: introduce readonly memslot

2012-06-18 Thread Avi Kivity
On 06/12/2012 05:49 AM, Xiao Guangrong wrote:
> In current code, if we map a readonly memory space from host to guest
> and the page is not currently mapped in the host, we will get a fault-pfn
> and async is not allowed, then the vm will crash
> 
> Address Avi's idea, we introduce readonly memory region to map ROM/ROMD
> to the guest
> 
> index 4b96bc2..b551db1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -688,7 +688,7 @@ void update_memslots(struct kvm_memslots *slots, struct 
> kvm_memory_slot *new)
> 
>  static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
>  {
> - if (mem->flags & ~KVM_MEM_LOG_DIRTY_PAGES)
> + if (mem->flags & ~(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY))
>   return -EINVAL;

Only x86 supports readonly so far.

> 
> -static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
> -  gfn_t *nr_pages)
> +static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t 
> gfn,
> +  gfn_t *nr_pages, bool write)
>  {
> - if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> + if (!slot || slot->flags & KVM_MEMSLOT_INVALID ||
> +   ((slot->flags & KVM_MEM_READONLY) && write))
>   return bad_hva();
> 
>   if (nr_pages)
> @@ -1045,6 +1046,12 @@ static unsigned long gfn_to_hva_many(struct 
> kvm_memory_slot *slot, gfn_t gfn,
>   return gfn_to_hva_memslot(slot, gfn);
>  }
> 
> +static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
> +  gfn_t *nr_pages)
> +{
> + return __gfn_to_hva_many(slot, gfn, nr_pages, true);
> +}

We have dozens of translation functions: read/write guest virtual, guest
physical (nested and non nested), host virtual, host physical,
atomic/nonatomic, sync/async, with/without slot lookup, and probably a
few more I forgot.

I think we should refactor this into a series of on-step translations:

   /*
* Translate gva/len write access to a number of tlb entries
* (due to cross-page splits) or a fault
*/
   gva_to_tlb(gva, len, ACCESS_WRITE, &translation);
   /*
* Translate tlb entries to callbacks that do I/O (either directly
* or through KVM_EXIT_MMIO, provided there is no exception pending
*/
   tlb_to_io(&translation, &iolist, IO_ATOMIC);
   /*
* Initiate I/O (if no exception)
*/
   run_iolist(&iolist, data);

   struct gpa_scatterlist {
   unsigned nr_entries;
   struct {
   gpa_t gpa;
   unsigned len;
   } entry[2];
   struct x86_exception exception;
   };

   struct kvm_iolist {
   unsigned nr_entries;
   struct kvm_ioentry {
   struct kvm_memslot *slot;  /* NULL for mmio */
   struct something *kernel_iodevice;
   gfn_t page_in_slot;
   unsigned offset_in_page;
   unsigned len;
   void (*iofunc)(struct kvm_ioentry *entry, void *data);
   } entry[2];
   struct x86_exception execption;
   };

This is of course outside the scope of this patchset, just something to
think about (and write opinions on).



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 12:33:01PM +0300, Avi Kivity wrote:
> On 06/18/2012 11:52 AM, Jan Kiszka wrote:
> >> 
> >> I don't understand how this works.  A level IRQ isn't de-asserted by the
> >> EOI, it's de-asserted by its source.
> >> 
> >> Consider the following sequence:
> >> 
> >> deviceguest
> >> 
> >>   event
> >>assert
> >>   interrupt
> >>interrupt handler
> >> handle event
> >> clear ISR bit
> >>deassert
> >>   event
> >> assert
> >>EOI
> >> 
> >> What should happen is that the interrupt will be redelivered
> >> immmediately after the EOI, but that won't happen with your API since
> >> the EOI ack notifier will deassert the interrupt and nothing will
> >> re-assert it.
> > 
> > As it's level triggered and we unmask the physical source, another
> > host-side interrupt will be triggered and then reported to the guest.
> 
> That works for real devices.  What about emulated devices

It's broken for userspace too.  I guess it
should track the logical state of the device per source id.  On ack, it
should clear it for assigned devices only, do the
logical OR over all source IDs and set level to that.

> (vhost,
> msi-less ivshmem clone)?

I guess vhost can poll eventfd and reinject an interrupt.
Of course to bypass qemu completely we also need to support reads over
ioeventfd somehow.


> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 04:03:23PM +0800, Asias He wrote:
> On 06/18/2012 03:46 PM, Rusty Russell wrote:
> >On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:
> >>This patch introduces bio-based IO path for virtio-blk.
> >
> >Why make it optional?
> 
> request-based IO path is useful for users who do not want to bypass
> the IO scheduler in guest kernel, e.g. users using spinning disk.
> For users using fast disk device, e.g. SSD device, they can use
> bio-based IO path.

OK I guess but then it should be per-device. There could be
a mix of slow and fast disks :)

> -- 
> Asias
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Avi Kivity
On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:

>> (vhost,
>> msi-less ivshmem clone)?
> 
> I guess vhost can poll eventfd and reinject an interrupt.
> Of course to bypass qemu completely we also need to support reads over
> ioeventfd somehow.
> 

eventfd is not suitable for level triggered interrupts as far as I can
tell.  It's about passing edges, and level interrupts aren't.  We need
something like a pipe for communicating state (or just use KVM_IRQ_LINE).


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/6] KVM: pass slot to hva_to_pfn

2012-06-18 Thread Avi Kivity
On 06/12/2012 05:48 AM, Xiao Guangrong wrote:
> This parameter will be used in the later patch
> 
> 
> - return hva_to_pfn(kvm, addr, atomic, async, write_fault, writable);
> + return hva_to_pfn(kvm, slot, addr, atomic, async, write_fault,
> +   writable);
>  }
> 
>  pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
> @@ -1205,7 +1210,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
>struct kvm_memory_slot *slot, gfn_t gfn)
>  {
>   unsigned long addr = gfn_to_hva_memslot(slot, gfn);
> - return hva_to_pfn(kvm, addr, false, NULL, true, NULL);
> + return hva_to_pfn(kvm, slot, addr, false, NULL, true, NULL);
>  }
> 
>  pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
> @@ -1213,7 +1218,7 @@ pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
>  {
>   unsigned long addr = gfn_to_hva_memslot(slot, gfn);
> 
> - return hva_to_pfn(kvm, addr, true, NULL, true, NULL);
> + return hva_to_pfn(kvm, slot, addr, true, NULL, true, NULL);
>  }

It was unreadable before, but now it's even more unreadable.



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/6] KVM: introduce gfn_to_hva_read/kvm_read_hva/kvm_read_hva_atomic

2012-06-18 Thread Avi Kivity
On 06/12/2012 05:48 AM, Xiao Guangrong wrote:
> This set of functions is only used to read data from host space, read is
> a special case in the later patch
> 
> 
> +/*
> + * The hva returned by this function is only allowed to be read.
> + * It should pair with kvm_read_hva() or kvm_read_hva_atomic().
> + */
> +static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn)
> +{
> + return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
> +}
> +
> +static int kvm_read_hva(void *data, void *hva, int len)
> +{
> + return __copy_from_user(data, (void __user *)hva, len);
> +
> +}
> +
> +static int kvm_read_hva_atomic(void *data, void *hva, int len)
> +{
> + return __copy_from_user_inatomic(data, (void __user *)hva, len);
> +
> +}

Why cast to __user?  Make it __user in the first place.

Also these are just simple wrappers, why to them at all?



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Rusty Russell
On Mon, 18 Jun 2012 16:03:23 +0800, Asias He  wrote:
> On 06/18/2012 03:46 PM, Rusty Russell wrote:
> > On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:
> >> This patch introduces bio-based IO path for virtio-blk.
> >
> > Why make it optional?
> 
> request-based IO path is useful for users who do not want to bypass the 
> IO scheduler in guest kernel, e.g. users using spinning disk. For users 
> using fast disk device, e.g. SSD device, they can use bio-based IO path.

Users using a spinning disk still get IO scheduling in the host though.
What benefit is there in doing it in the guest as well?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 02:53:10PM +0800, Asias He wrote:
> +static void virtblk_make_request(struct request_queue *q, struct bio *bio)
> +{
> + struct virtio_blk *vblk = q->queuedata;
> + unsigned int num, out = 0, in = 0;
> + struct virtblk_req *vbr;
> +
> + BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
> + BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
> +
> + vbr = virtblk_alloc_req(vblk, GFP_NOIO);
> + if (!vbr) {
> + bio_endio(bio, -ENOMEM);
> + return;
> + }
> +
> + vbr->bio = bio;
> + vbr->req = NULL;
> + vbr->out_hdr.type = 0;
> + vbr->out_hdr.sector = bio->bi_sector;
> + vbr->out_hdr.ioprio = bio_prio(bio);
> +
> + sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
> +
> + num = blk_bio_map_sg(q, bio, vbr->sg + out);
> +
> + sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
> +sizeof(vbr->status));
> +
> + if (num) {
> + if (bio->bi_rw & REQ_WRITE) {
> + vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
> + out += num;
> + } else {
> + vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
> + in += num;
> + }
> + }
> +
> + spin_lock_irq(vblk->disk->queue->queue_lock);
> + if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
> +   GFP_ATOMIC) < 0) {
> + spin_unlock_irq(vblk->disk->queue->queue_lock);

Any implications of dropping lock like that?
E.g. for suspend. like we are still discussing with
unlocked kick?

> + virtblk_add_buf_wait(vblk, vbr, out, in);
> + } else {
> + virtqueue_kick(vblk->vq);

Why special case the first call? task state manipulation so expensive?

> + spin_unlock_irq(vblk->disk->queue->queue_lock);
> + }
> +}
> +
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Stefan Hajnoczi
On Mon, Jun 18, 2012 at 11:21 AM, Michael S. Tsirkin  wrote:
> On Mon, Jun 18, 2012 at 02:53:10PM +0800, Asias He wrote:
>> +static void virtblk_make_request(struct request_queue *q, struct bio *bio)
>> +{
>> +     struct virtio_blk *vblk = q->queuedata;
>> +     unsigned int num, out = 0, in = 0;
>> +     struct virtblk_req *vbr;
>> +
>> +     BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
>> +     BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
>> +
>> +     vbr = virtblk_alloc_req(vblk, GFP_NOIO);
>> +     if (!vbr) {
>> +             bio_endio(bio, -ENOMEM);
>> +             return;
>> +     }
>> +
>> +     vbr->bio = bio;
>> +     vbr->req = NULL;
>> +     vbr->out_hdr.type = 0;
>> +     vbr->out_hdr.sector = bio->bi_sector;
>> +     vbr->out_hdr.ioprio = bio_prio(bio);
>> +
>> +     sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
>> +
>> +     num = blk_bio_map_sg(q, bio, vbr->sg + out);
>> +
>> +     sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
>> +                sizeof(vbr->status));
>> +
>> +     if (num) {
>> +             if (bio->bi_rw & REQ_WRITE) {
>> +                     vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
>> +                     out += num;
>> +             } else {
>> +                     vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
>> +                     in += num;
>> +             }
>> +     }
>> +
>> +     spin_lock_irq(vblk->disk->queue->queue_lock);
>> +     if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
>> +                           GFP_ATOMIC) < 0) {
>> +             spin_unlock_irq(vblk->disk->queue->queue_lock);
>
> Any implications of dropping lock like that?
> E.g. for suspend. like we are still discussing with
> unlocked kick?

Since we aquired the lock in this function there should be no problem.
 Whatever protects against vblk or vblk->disk disappearing upon
entering this function also protects after unlocking queue_lock.
Otherwise all .make_request_fn() functions would be broken.

I'd still like to understand the details though.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 01:14:13PM +0300, Avi Kivity wrote:
> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
> 
> >> (vhost,
> >> msi-less ivshmem clone)?
> > 
> > I guess vhost can poll eventfd and reinject an interrupt.
> > Of course to bypass qemu completely we also need to support reads over
> > ioeventfd somehow.
> > 
> 
> eventfd is not suitable for level triggered interrupts as far as I can
> tell.  It's about passing edges, and level interrupts aren't.  We need
> something like a pipe for communicating state (or just use KVM_IRQ_LINE).

Just using KVM_IRQ_LINE won't be enough.
There's no software event when device deasserts the
line, and no stadard way for drivers to do this.
So kvm needs to trigger on some event to poll the real interrupt
state, to check whether it's ok to clear for the guest,
and forward it to userspace.


> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Improve virtio-blk performance

2012-06-18 Thread Stefan Hajnoczi
On Mon, Jun 18, 2012 at 10:39 AM, Asias He  wrote:
> On 06/18/2012 05:14 PM, Stefan Hajnoczi wrote:
>>
>> On Mon, Jun 18, 2012 at 7:53 AM, Asias He  wrote:
>>>
>>> Fio test shows it gives, 28%, 24%, 21%, 16% IOPS boost and 32%, 17%, 21%,
>>> 16%
>>> latency improvement for sequential read/write, random read/write
>>> respectively.
>>
>>
>> Sounds great.  What storage configuration did you use (single spinning
>> disk, SSD, storage array) and are these numbers for parallel I/O or
>> sequential I/O?
>
>
> I used ramdisk as the backend storage.

As long as the latency is decreasing that's good.  But It's worth
keeping in mind that these percentages are probably wildly different
on real storage devices and/or qemu-kvm.  What we don't know here is
whether this bottleneck matters in real environments - results with
real storage and with qemu-kvm would be interesting.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts

2012-06-18 Thread Avi Kivity
On 06/11/2012 01:21 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 11, 2012 at 01:01:41PM +0300, Avi Kivity wrote:
>> On 06/08/2012 05:50 PM, Jan Kiszka wrote:
>> > 
>> >> 
>> >> Pls correct me if I'm wrong.
>> > 
>> > Well, IIRC, the "don't loop over all vcpus with IRQs or preemption
>> > disabled" was one argument against direct legacy interrupt injection as
>> > well. That's what I kept in mind from those discussions. Maybe Avi can
>> > comment on the current position.
>> 
>> It's still my position.
>> 
>> IMO we need something like struct gfn_to_hva_cache for interrupts.  If
>> it's in the cache, we fast-path it from the interrupt handler.  If not,
>> fall back to a workqueue and let it refill the cache.
> 
> And you class the irqfd behaviour of injecting multicast
> with interrupts disabled a bug then?

Yes (a minor one).


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Avi Kivity
On 06/18/2012 01:55 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 18, 2012 at 01:14:13PM +0300, Avi Kivity wrote:
>> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
>> 
>> >> (vhost,
>> >> msi-less ivshmem clone)?
>> > 
>> > I guess vhost can poll eventfd and reinject an interrupt.
>> > Of course to bypass qemu completely we also need to support reads over
>> > ioeventfd somehow.
>> > 
>> 
>> eventfd is not suitable for level triggered interrupts as far as I can
>> tell.  It's about passing edges, and level interrupts aren't.  We need
>> something like a pipe for communicating state (or just use KVM_IRQ_LINE).
> 
> Just using KVM_IRQ_LINE won't be enough.
> There's no software event when device deasserts the
> line, and no stadard way for drivers to do this.
> So kvm needs to trigger on some event to poll the real interrupt
> state, to check whether it's ok to clear for the guest,
> and forward it to userspace.

This in turn only applies to real devices.  Emulated devices of course
know the state of the emulated interrupt line.

Maybe we need different interfaces then.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Dor Laor

On 06/18/2012 01:05 PM, Rusty Russell wrote:

On Mon, 18 Jun 2012 16:03:23 +0800, Asias He  wrote:

On 06/18/2012 03:46 PM, Rusty Russell wrote:

On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:

This patch introduces bio-based IO path for virtio-blk.


Why make it optional?


request-based IO path is useful for users who do not want to bypass the
IO scheduler in guest kernel, e.g. users using spinning disk. For users
using fast disk device, e.g. SSD device, they can use bio-based IO path.


Users using a spinning disk still get IO scheduling in the host though.
What benefit is there in doing it in the guest as well?


The io scheduler waits for requests to merge and thus batch IOs 
together. It's not important w.r.t spinning disks since the host can do 
it but it causes much less vmexits which is the key issue for VMs.




Cheers,
Rusty.
___
Virtualization mailing list
virtualizat...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 02:03:40PM +0300, Avi Kivity wrote:
> On 06/18/2012 01:55 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 18, 2012 at 01:14:13PM +0300, Avi Kivity wrote:
> >> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
> >> 
> >> >> (vhost,
> >> >> msi-less ivshmem clone)?
> >> > 
> >> > I guess vhost can poll eventfd and reinject an interrupt.
> >> > Of course to bypass qemu completely we also need to support reads over
> >> > ioeventfd somehow.
> >> > 
> >> 
> >> eventfd is not suitable for level triggered interrupts as far as I can
> >> tell.  It's about passing edges, and level interrupts aren't.  We need
> >> something like a pipe for communicating state (or just use KVM_IRQ_LINE).
> > 
> > Just using KVM_IRQ_LINE won't be enough.
> > There's no software event when device deasserts the
> > line, and no stadard way for drivers to do this.
> > So kvm needs to trigger on some event to poll the real interrupt
> > state, to check whether it's ok to clear for the guest,
> > and forward it to userspace.
> 
> This in turn only applies to real devices.  Emulated devices of course
> know the state of the emulated interrupt line.

Or to be more exact, real devices know the state of the line too.
It's the hyprvisor that doesn't know the state, because
of the abovementioned lack of standards, and because
the state control can be mapped directly into guest.

For example the hypervisor knows about virtio ISR
register so there's no problem.

> Maybe we need different interfaces then.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 43501] all 32bit binaries produce "Illegal Instruction" after KVM migration from AMD -> Intel host

2012-06-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=43501





--- Comment #5 from Avi Kivity   2012-06-18 11:22:21 ---
I'm not sure that the vendor string is passed correctly.

Please verify that when starting the guest on an Intel host, /proc/cpuinfo
shows AuthenticAMD for vendor_id.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Sasha Levin
On Mon, 2012-06-18 at 14:14 +0300, Dor Laor wrote:
> On 06/18/2012 01:05 PM, Rusty Russell wrote:
> > On Mon, 18 Jun 2012 16:03:23 +0800, Asias He  wrote:
> >> On 06/18/2012 03:46 PM, Rusty Russell wrote:
> >>> On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:
>  This patch introduces bio-based IO path for virtio-blk.
> >>>
> >>> Why make it optional?
> >>
> >> request-based IO path is useful for users who do not want to bypass the
> >> IO scheduler in guest kernel, e.g. users using spinning disk. For users
> >> using fast disk device, e.g. SSD device, they can use bio-based IO path.
> >
> > Users using a spinning disk still get IO scheduling in the host though.
> > What benefit is there in doing it in the guest as well?
> 
> The io scheduler waits for requests to merge and thus batch IOs 
> together. It's not important w.r.t spinning disks since the host can do 
> it but it causes much less vmexits which is the key issue for VMs. 

Is the amount of exits caused by virtio-blk significant at all with
EVENT_IDX?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-spec: flexible configuration layout

2012-06-18 Thread Michael S. Tsirkin
On Wed, Nov 09, 2011 at 02:36:43PM +0200, Pekka Enberg wrote:
> On Wed, 9 Nov 2011, Sasha Levin wrote:
> >They don't exist in kernel code either, for same reason as above.
> >
> >Nothing will break if we remove it since no one really used it, we were
> >probably the first and only implementation of the spec which considered
> >them :)
> 
> As long as we are able to run older versions of the KVM tool with
> newer kernels and vice versa, I see no reason why we can't drop
> 64-bit features from the KVM tool.
> 
>   Pekka

So what happened? Did you guys do this?  Need to know what to do to make
progress.  IIUC Rusty removed the new fields in 0.9.3.
Does your tool still use them? Did any version of the tool released by
distros do so?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Introduce hva_to_gfn() for kvm_handle_hva()

2012-06-18 Thread Avi Kivity
On 06/15/2012 02:31 PM, Takuya Yoshikawa wrote:
> This restricts hva handling in mmu code and makes it easier to extend
> kvm_handle_hva() so that it can treat a range of addresses later in this
> patch series.
> 
> 
>  
>   kvm_for_each_memslot(memslot, slots) {
> - unsigned long start = memslot->userspace_addr;
> - unsigned long end;
> -
> - end = start + (memslot->npages << PAGE_SHIFT);
> - if (hva >= start && hva < end) {
> - gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
> - gfn_t gfn = memslot->base_gfn + gfn_offset;
> + gfn_t gfn = hva_to_gfn(hva, memslot);
>  
> + if (gfn >= memslot->base_gfn &&
> + gfn < memslot->base_gfn + memslot->npages) {

First you convert it, then you check if the conversion worked?  Let's
make is a straightforward check-then-convert (or check-and-convert).

>   ret = 0;
>  
>   for (j = PT_PAGE_TABLE_LEVEL;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27ac8a4..92b2029 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -740,6 +740,13 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> base_gfn, int level)
>   (base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
>  }
>  
> +static inline gfn_t hva_to_gfn(unsigned long hva, struct kvm_memory_slot 
> *slot)
> +{
> + gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
> +
> + return slot->base_gfn + gfn_offset;
> +}

Should be named hva_to_gfn_memslot(), like the below, to emphasise it
isn't generic.

> +
>  static inline unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot,
>  gfn_t gfn)
>  {
> 


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-spec: flexible configuration layout

2012-06-18 Thread Sasha Levin
On Mon, 2012-06-18 at 14:54 +0300, Michael S. Tsirkin wrote:
> On Wed, Nov 09, 2011 at 02:36:43PM +0200, Pekka Enberg wrote:
> > On Wed, 9 Nov 2011, Sasha Levin wrote:
> > >They don't exist in kernel code either, for same reason as above.
> > >
> > >Nothing will break if we remove it since no one really used it, we were
> > >probably the first and only implementation of the spec which considered
> > >them :)
> > 
> > As long as we are able to run older versions of the KVM tool with
> > newer kernels and vice versa, I see no reason why we can't drop
> > 64-bit features from the KVM tool.
> > 
> > Pekka
> 
> So what happened? Did you guys do this?  Need to know what to do to make
> progress.  IIUC Rusty removed the new fields in 0.9.3.
> Does your tool still use them? Did any version of the tool released by
> distros do so?

Yup, they were removed quite a while ago.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio-pci: add MMIO property

2012-06-18 Thread Michael S. Tsirkin
On Tue, Mar 20, 2012 at 10:22:42AM +1030, Rusty Russell wrote:
> On Mon, 19 Mar 2012 17:13:06 -0500, Anthony Liguori  
> wrote:
> > > Maybe just make this a hidden option like x-miio?
> > 
> > x-violate-the-virtio-spec-to-trick-old-linux-drivers-into-working-on-power?
> 
> "To configure the device, we use the first I/O region of the PCI
> device."
> 
> Meh, it does sound a little like we are specifying that it's an PCI I/O
> bar.
> 
> Let's resurrect the PCI-v2 idea, which is ready to implement now, and a
> nice cleanup?  Detach it from the change-of-ring-format idea which is
> turning out to be a tarpit.
> 
> Thanks,
> Rusty.

Yes. But it seems silly to even write code to play with device config in
memory when we agreed the right thing to do is to use a config vq
everywhere.

Now a question: does a oconfig vq look like a PCI specific
feature to you, a work-around for lack of multibyte atomic
accesses? If yes it's sane to make it a PCI capability.
Or is it something most transports would need? If yes we
need a feature bit and this is a chicken and egg problem ...

> -- 
>   How could I marry someone with more hair than me?  http://baldalex.org
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-spec: flexible configuration layout

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 02:05:17PM +0200, Sasha Levin wrote:
> On Mon, 2012-06-18 at 14:54 +0300, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 02:36:43PM +0200, Pekka Enberg wrote:
> > > On Wed, 9 Nov 2011, Sasha Levin wrote:
> > > >They don't exist in kernel code either, for same reason as above.
> > > >
> > > >Nothing will break if we remove it since no one really used it, we were
> > > >probably the first and only implementation of the spec which considered
> > > >them :)
> > > 
> > > As long as we are able to run older versions of the KVM tool with
> > > newer kernels and vice versa, I see no reason why we can't drop
> > > 64-bit features from the KVM tool.
> > > 
> > >   Pekka
> > 
> > So what happened? Did you guys do this?  Need to know what to do to make
> > progress.  IIUC Rusty removed the new fields in 0.9.3.
> > Does your tool still use them? Did any version of the tool released by
> > distros do so?
> 
> Yup, they were removed quite a while ago.

Thanks, that's nice. So we can reuse that bit if we need it
and it won't break your code.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: MMU: Make kvm_handle_hva() handle range of addresses

2012-06-18 Thread Avi Kivity
On 06/15/2012 02:32 PM, Takuya Yoshikawa wrote:
> When guest's memory is backed by THP pages, MMU notifier needs to call
> kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to
> invalidate a range of pages which constitute one huge page:
> 
>   for each guest page
> for each memslot
>   if page is in memslot
> unmap using rmap
> 
> This means although every page in that range is expected to be found in
> the same memslot, we are forced to check unrelated memslots many times.
> If the guest has more memslots, the situation will become worse.
> 
> This patch, together with the following patch, solves this problem by
> introducing kvm_handle_hva_range() which makes the loop look like this:
> 
>   for each memslot
> for each guest page in memslot
>   unmap using rmap
> 
> In this new processing, the actual work is converted to the loop over
> rmap array which is much more cache friendly than before.


Moreover, if the pages are in no slot (munmap of some non-guest memory),
then we're iterating over all those pages for no purpose.

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ba57b3b..3629f9b 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1185,10 +1185,13 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, 
> unsigned long *rmapp,
>   return 0;
>  }
>  
> -static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> -   unsigned long data,
> -   int (*handler)(struct kvm *kvm, unsigned long *rmapp,
> -  unsigned long data))
> +static int kvm_handle_hva_range(struct kvm *kvm,
> + unsigned long start_hva,
> + unsigned long end_hva,
> + unsigned long data,
> + int (*handler)(struct kvm *kvm,
> +unsigned long *rmapp,
> +unsigned long data))
>  {
>   int j;
>   int ret;
> @@ -1199,10 +1202,13 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned 
> long hva,
>   slots = kvm_memslots(kvm);
>  
>   kvm_for_each_memslot(memslot, slots) {
> - gfn_t gfn = hva_to_gfn(hva, memslot);
> + gfn_t gfn = hva_to_gfn(start_hva, memslot);
> + gfn_t end_gfn = hva_to_gfn(end_hva, memslot);

These will return random results which you then use in min/max later, no?

> +
> + gfn = max(gfn, memslot->base_gfn);
> + end_gfn = min(end_gfn, memslot->base_gfn + memslot->npages);
>  
> - if (gfn >= memslot->base_gfn &&
> - gfn < memslot->base_gfn + memslot->npages) {
> + for (; gfn < end_gfn; gfn++) {
>   ret = 0;
>  
>   for (j = PT_PAGE_TABLE_LEVEL;
> @@ -1212,7 +1218,9 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned 
> long hva,
>   rmapp = __gfn_to_rmap(gfn, j, memslot);
>   ret |= handler(kvm, rmapp, data);

Potential for improvement: don't do 512 iterations on same large page.

Something like

if ((gfn ^ prev_gfn) & mask(level))
ret |= handler(...)

with clever selection of the first prev_gfn so it always matches (~gfn
maybe).


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 43501] all 32bit binaries produce "Illegal Instruction" after KVM migration from AMD -> Intel host

2012-06-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=43501





--- Comment #6 from Avi Kivity   2012-06-18 12:44:18 ---
(In reply to comment #4)
> (In reply to comment #1)
> > What host kernel are you using?
> 
> 2.6.32-220.17.1.el6.x86_64 (SL 6.2)

This bugzilla is for upstream kernels.  Use the vendor bugzilla for vendor
kernels.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 02/15] KVM: use KVM_CAP_IRQ_ROUTING to protect the routing related code

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:07 PM, Christoffer Dall wrote:
> From: Marc Zyngier 
> 
> The KVM code sometimes uses CONFIG_HAVE_KVM_IRQCHIP to protect
> code that is related to IRQ routing, which not all in-kernel
> irqchips may support.
> 
> Use KVM_CAP_IRQ_ROUTING instead.

Thanks, applied.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 03/15] KVM: Introduce __KVM_HAVE_IRQ_LINE

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:07 PM, Christoffer Dall wrote:
> This is a preparatory patch for the KVM/ARM implementation. KVM/ARM will use
> the KVM_IRQ_LINE ioctl, which is currently conditional on
> __KVM_HAVE_IOAPIC, but ARM obviously doesn't have any IOAPIC support and we
> need a separate define.
> 

Thanks, applied.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 04/15] KVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:07 PM, Christoffer Dall wrote:
> From: Marc Zyngier 
> 
> In order to avoid compilation failure when KVM is not compiled in,
> guard the mmu_notifier specific sections with both CONFIG_MMU_NOTIFIER
> and KVM_ARCH_WANT_MMU_NOTIFIER, like it is being done in the rest of
> the KVM code.
> 
>  
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>   struct mmu_notifier mmu_notifier;
>   unsigned long mmu_notifier_seq;
>   long mmu_notifier_count;
> @@ -780,7 +780,7 @@ struct kvm_stats_debugfs_item {
>  extern struct kvm_stats_debugfs_item debugfs_entries[];
>  extern struct dentry *kvm_debugfs_dir;
>  
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long 
> mmu_seq)
>  {

Why not have Kconfig select CONFIG_MMU_NOTIFIER?


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 06/15] ARM: KVM: Hypervisor identity mapping

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:07 PM, Christoffer Dall wrote:
> Adds support in the identity mapping feature that allows KVM to setup
> identity mapping for the Hyp mode with the AP[1] bit set as required by
> the specification and also supports freeing created sub pmd's after
> finished use.
> 
> These two functions:
>  - hyp_idmap_add(pgd, addr, end);
>  - hyp_idmap_del(pgd, addr, end);
> are essentially calls to the same function as the non-hyp versions but
> with a different argument value. KVM calls these functions to setup
> and teardown the identity mapping used to initialize the hypervisor.
> 
> Note, the hyp-version of the _del function actually frees the pmd's
> pointed to by the pgd as opposed to the non-hyp version which just
> clears them.


I asked previously what happens if two data structures share a page, and
one of them is removed.  Is that handled now?  How?

Why not just identity map all memory?  You can use large pages so it's
fast and doesn't consume a lot of page table memory.--
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: MMU: Make kvm_handle_hva() handle range of addresses

2012-06-18 Thread Takuya Yoshikawa
On Mon, 18 Jun 2012 15:11:42 +0300
Avi Kivity  wrote:


> > kvm_for_each_memslot(memslot, slots) {
> > -   gfn_t gfn = hva_to_gfn(hva, memslot);
> > +   gfn_t gfn = hva_to_gfn(start_hva, memslot);
> > +   gfn_t end_gfn = hva_to_gfn(end_hva, memslot);
> 
> These will return random results which you then use in min/max later, no?

Yes, I will follow your advice: check-then-convert (or check-and-convert).

> > @@ -1212,7 +1218,9 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned 
> > long hva,
> > rmapp = __gfn_to_rmap(gfn, j, memslot);
> > ret |= handler(kvm, rmapp, data);
> 
> Potential for improvement: don't do 512 iterations on same large page.
> 
> Something like
> 
> if ((gfn ^ prev_gfn) & mask(level))
> ret |= handler(...)
> 
> with clever selection of the first prev_gfn so it always matches (~gfn
> maybe).

Really nice.
I'm sure that will make this much faster!

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 14836] memory allocation bug. The used memory of host is not freed after close of virtual machine(s).

2012-06-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=14836


Alan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 CC||a...@lxorguk.ukuu.org.uk
 Resolution||OBSOLETE




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 14836] memory allocation bug. The used memory of host is not freed after close of virtual machine(s).

2012-06-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=14836


Alan  changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 10/15] ARM: KVM: Inject IRQs and FIQs from userspace

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:08 PM, Christoffer Dall wrote:
> From: Christoffer Dall 
> 
> Userspace can inject IRQs and FIQs through the KVM_IRQ_LINE VM ioctl.
> This ioctl is used since the sematics are in fact two lines that can be
> either raised or lowered on the VCPU - the IRQ and FIQ lines.
> 
> KVM needs to know which VCPU it must operate on and whether the FIQ or
> IRQ line is raised/lowered. Hence both pieces of information is packed
> in the kvm_irq_level->irq field. The irq fild value will be:
>   IRQ: vcpu_index << 1
>   FIQ: (vcpu_index << 1) | 1
> 
> This is documented in Documentation/kvm/api.txt.
> 
> The effect of the ioctl is simply to simply raise/lower the
> corresponding irq_line field on the VCPU struct, which will cause the
> world-switch code to raise/lower virtual interrupts when running the
> guest on next switch. The wait_for_interrupt flag is also cleared for
> raised IRQs or FIQs causing an idle VCPU to become active again. CPUs
> in guest mode are kicked to make sure they refresh their interrupt status.

>  
> +static int kvm_arch_vm_ioctl_irq_line(struct kvm *kvm,
> +   struct kvm_irq_level *irq_level)
> +{
> + int mask;
> + unsigned int vcpu_idx;
> + struct kvm_vcpu *vcpu;
> + unsigned long old, new, *ptr;
> +
> + vcpu_idx = irq_level->irq >> 1;
> + if (vcpu_idx >= KVM_MAX_VCPUS)
> + return -EINVAL;
> +
> + vcpu = kvm_get_vcpu(kvm, vcpu_idx);
> + if (!vcpu)
> + return -EINVAL;
> +
> + if ((irq_level->irq & 1) == KVM_ARM_IRQ_LINE)
> + mask = HCR_VI;
> + else /* KVM_ARM_FIQ_LINE */
> + mask = HCR_VF;
> +
> + trace_kvm_set_irq(irq_level->irq, irq_level->level, 0);
> +
> + ptr = (unsigned long *)&vcpu->arch.irq_lines;
> + do {
> + old = ACCESS_ONCE(*ptr);
> + if (irq_level->level)
> + new = old | mask;
> + else
> + new = old & ~mask;
> +
> + if (new == old)
> + return 0; /* no change */
> + } while (cmpxchg(ptr, old, new) != old);

Isn't this a complicated


   if (level)
   set_bit()
   else
   clear_bit()

?

> +
> + /*
> +  * The vcpu irq_lines field was updated, wake up sleeping VCPUs and
> +  * trigger a world-switch round on the running physical CPU to set the
> +  * virtual IRQ/FIQ fields in the HCR appropriately.
> +  */
> + kvm_vcpu_kick(vcpu);

No need to wake when the line is asserted so you can make this
conditional on level.

> +
> + return 0;
> +}
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>unsigned int ioctl, unsigned long arg)
>  {
> @@ -298,7 +345,20 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
> kvm_dirty_log *log)
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> - return -EINVAL;
> + struct kvm *kvm = filp->private_data;
> + void __user *argp = (void __user *)arg;
> +
> + switch (ioctl) {
> + case KVM_IRQ_LINE: {
> + struct kvm_irq_level irq_event;
> +
> + if (copy_from_user(&irq_event, argp, sizeof irq_event))
> + return -EFAULT;
> + return kvm_arch_vm_ioctl_irq_line(kvm, &irq_event);
> + }
> + default:
> + return -EINVAL;
> + }
>  }

Should be in common code guarded by the define introduced previously.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 11/15] ARM: KVM: World-switch implementation

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:08 PM, Christoffer Dall wrote:
> Provides complete world-switch implementation to switch to other guests
> running in non-secure modes. Includes Hyp exception handlers that
> capture necessary exception information and stores the information on
> the VCPU and KVM structures.
> 
> Switching to Hyp mode is done through a simple HVC instructions. The
> exception vector code will check that the HVC comes from VMID==0 and if
> so will store the necessary state on the Hyp stack, which will look like
> this (growing downwards, see the hyp_hvc handler):
>   ...
>   Hyp_Sp + 4: spsr (Host-SVC cpsr)
>   Hyp_Sp: lr_usr
> 
> When returning from Hyp mode to SVC mode, another HVC instruction is
> executed from Hyp mode, which is taken in the Hyp_Svc handler. The Hyp
> stack pointer should be where it was left from the above initial call,
> since the values on the stack will be used to restore state (see
> hyp_svc).
> 
> Otherwise, the world-switch is pretty straight-forward. All state that
> can be modified by the guest is first backed up on the Hyp stack and the
> VCPU values is loaded onto the hardware. State, which is not loaded, but
> theoretically modifiable by the guest is protected through the
> virtualiation features to generate a trap and cause software emulation.
> Upon guest returns, all state is restored from hardware onto the VCPU
> struct and the original state is restored from the Hyp-stack onto the
> hardware.
> 
> One controversy may be the back-door call to __irq_svc (the host
> kernel's own physical IRQ handler) which is called when a physical IRQ
> exception is taken in Hyp mode while running in the guest.
> 
> SMP support using the VMPIDR calculated on the basis of the host MPIDR
> and overriding the low bits with KVM vcpu_id contributed by Marc Zyngier.
> 
> Reuse of VMIDs has been implemented by Antonios Motakis and adapated from
> a separate patch into the appropriate patches introducing the
> functionality. Note that the VMIDs are stored per VM as required by the ARM
> architecture reference manual.
> 
> +
> +/**
> + * update_vttbr - Update the VTTBR with a valid VMID before the guest runs
> + * @kvm  The guest that we are about to run
> + *
> + * Called from kvm_arch_vcpu_ioctl_run before entering the guest to ensure 
> the
> + * VM has a valid VMID, otherwise assigns a new one and flushes corresponding
> + * caches and TLBs.
> + */
> +static void update_vttbr(struct kvm *kvm)
> +{
> + phys_addr_t pgd_phys;
> +
> + spin_lock(&kvm_vmid_lock);

Premature, but this is sad.  I suggest you split vmid generation from
next available vmid.  This allows you to make the generation counter
atomic so it may be read outside the lock.

You can do

if (likely(kvm->arch.vmd_gen) == atomic_read(&kvm_vmid_gen)))
   return;

spin_lock(...

> +
> + /*
> +  *  Check that the VMID is still valid.
> +  *  (The hardware supports only 256 values with the value zero
> +  *   reserved for the host, so we check if an assigned value has rolled
> +  *   over a sequence, which requires us to assign a new value and flush
> +  *   necessary caches and TLBs on all CPUs.)
> +  */
> + if (unlikely((kvm->arch.vmid ^ next_vmid) >> VMID_BITS)) {
> + /* Check for a new VMID generation */
> + if (unlikely((next_vmid & VMID_MASK) == 0)) {
> + /* Check for the (very unlikely) 64-bit wrap around */

Unlikely?  it's impossible.

> +
> +/**
> + * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
> + * @vcpu:The VCPU pointer
> + * @run: The kvm_run structure pointer used for userspace state exchange
> + *
> + * This function is called through the VCPU_RUN ioctl called from user 
> space. It
> + * will execute VM code in a loop until the time slice for the process is 
> used
> + * or some emulation is needed from user space in which case the function 
> will
> + * return with return value 0 and with the kvm_run structure filled in with 
> the
> + * required data for the requested emulation.
> + */
>  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> - return -EINVAL;
> + int ret = 0;
> + sigset_t sigsaved;
> +
> + if (vcpu->sigset_active)
> + sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
> +
> + run->exit_reason = KVM_EXIT_UNKNOWN;
> + while (run->exit_reason == KVM_EXIT_UNKNOWN) {

It's not a good idea to read stuff from run unless it's part of the ABI,
since userspace can play with it while you're reading it.  It's harmless
here but in other places it can lead to a vulnerability.

> + /*
> +  * Check conditions before entering the guest
> +  */
> + if (need_resched())
> + kvm_resched(vcpu);

I think cond_resched() is all that's needed these days (kvm_resched()
predates preempt notifiers).

> +
> + if (signal_pending(current)) {
> + 

Re: [PATCH v8 13/15] ARM: KVM: Handle guest faults in KVM

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:09 PM, Christoffer Dall wrote:
> From: Christoffer Dall 
> 
> Handles the guest faults in KVM by mapping in corresponding user pages
> in the 2nd stage page tables.
> 
> Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
> pgprot_guest variables used to map 2nd stage memory for KVM guests.
> 
> Leverages MMU notifiers on KVM/ARM by supporting the kvm_unmap_hva() 
> operation,
> where we remove the HVA from the 2nd stage translation. All other KVM MMU
> notifierhooks are NOPs.

I think you must at least support change_pte (possibly by unmapping).
Andrea?



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 14/15] ARM: KVM: Handle I/O aborts

2012-06-18 Thread Avi Kivity
On 06/15/2012 10:09 PM, Christoffer Dall wrote:
> When the guest accesses I/O memory this will create data abort
> exceptions and they are handled by decoding the HSR information
> (physical address, read/write, length, register) and forwarding reads
> and writes to QEMU which performs the device emulation.
> 
> Certain classes of load/store operations do not support the syndrome
> information provided in the HSR and we therefore must be able to fetch
> the offending instruction from guest memory and decode it manually.
> 
> This requires changing the general flow somewhat since new calls to run
> the VCPU must check if there's a pending MMIO load and perform the write
> after userspace has made the data available.
> 
>  
>   memslot = gfn_to_memslot(vcpu->kvm, gfn);
> diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h
> index e474a0a..bd3a6cc 100644
> --- a/arch/arm/kvm/trace.h
> +++ b/arch/arm/kvm/trace.h
> @@ -39,6 +39,27 @@ TRACE_EVENT(kvm_exit,
>   TP_printk("PC: 0x%08lx", __entry->vcpu_pc)
>  );
>  
> +TRACE_EVENT(kvm_mmio_emulate,
> + TP_PROTO(unsigned long vcpu_pc, unsigned long instr,
> +  unsigned long cpsr),
> + TP_ARGS(vcpu_pc, instr, cpsr),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long,  vcpu_pc )
> + __field(unsigned long,  instr   )
> + __field(unsigned long,  cpsr)
> + ),
> +
> + TP_fast_assign(
> + __entry->vcpu_pc= vcpu_pc;
> + __entry->vcpu_pc= instr;
> + __entry->vcpu_pc= cpsr;

-ECUTANDPASTE

> + ),
> +
> + TP_printk("Emulate MMIO at: 0x%08lx (instr: %08lx, cpsr: %08lx)",
> +   __entry->vcpu_pc, __entry->instr, __entry->cpsr)
> +);
> +


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 09:00 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 17, 2012 at 04:15:44PM -0600, Alex Williamson wrote:
> > On Sun, Jun 17, 2012 at 3:38 PM, Alex Williamson
> >  wrote:
> > > On Sun, 2012-06-17 at 21:44 +0300, Michael S. Tsirkin wrote:
> > >> On Sat, Jun 16, 2012 at 10:34:39AM -0600, Alex Williamson wrote:
> > >> > I'm looking for opinions on this approach.  For vfio device assignment
> > >> > we minimally need a way to get EOIs from the in-kernel irqchip out to
> > >> > userspace.  Getting that out via an eventfd would allow us to bounce
> > >> > all level interrupts out to userspace, where we would de-assert the
> > >> > device interrupt in qemu and unmask the physical device.  Ideally we
> > >> > could deassert the interrupt in KVM, which allows us to send the EOI
> > >> > directly to vfio.  To do that, we need to use a new IRQ source ID so
> > >> > the guest sees the logical OR of qemu requested state and external
> > >> > device state.
> > >>
> > >> Given that yopu want to involve userspace anyway, why insist on irqfd
> > >> for this?  You can simply use KVM_IRQ_LINE_STATUS from qemu, no?
> > >
> > > Well, actually I'd like to have a way to bypass userspace, which the
> > > combination of an irqfd + eventfd w/ deassert does.
> 
> 
> Hmm but above you say
>   > >> > Getting that out via an eventfd would allow us to bounce
>   > >> > all level interrupts out to userspace, where we would de-assert 
> the
>   > >> > device interrupt in qemu and unmask the physical device.
> so what is the plan?

Sorry if this wasn't clear, I was attempting to state the minimal
required support vs a more ideal scenario.  The patch here ignores the
minimal support and implements the higher performance route.

> >  I'm not quite sure
> > > I understand how KVM_IRQ_LINE_STATUS would work for this.  AIUI, that
> > > effectively gives us a way to post an interrupt AND let us know whether
> > > it was masked, coalesced, or delivered.  So I'd have to poll by posting
> > > a potentially spurious interrupt and if it was spurious unmask the
> > > physical device and wait for a real interrupt?  What am I missing,
> > > because that seems barely functional?  Thanks,
> > 
> > Just to clarify, setting the interrupt from qemu isn't a problem.  We
> > can do that just like any other device.  The unique aspect is that we
> > need to know when the guest has issued an EOI so that we can unmask
> > the physical device interrupt and wait for it to fire again.  This is
> > where I don't understand how KVM_IRQ_LINE_STATUS helps us.
> > The minimal support I mention above just requires informing userspace
> > about the EOI, then we can deassert and unmask from qemu.  That means
> > we issue two more ioctl before we're enabled for the next interrupt.
> 
> Exactly.
> 
> > Rather than invent a new interface for a sub-optimal implementation,
> > fixing irqfd to support level triggered interrupts is potentially more
> > useful and I think this implementation is not specific to device
> > assignment.  BTW, what happens with vhost use of irqfd when the guest
> > runs out of MSI vectors?  Could it use this interface for that?
> > Thanks,
> > 
> > Alex
> 
> 
> Sure. OTOH this never was a real issue - if it was
> we could teach Linux to share MSI interrupt.


Or optimize the route without changing the guest.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 08:51 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 17, 2012 at 03:38:57PM -0600, Alex Williamson wrote:
> > On Sun, 2012-06-17 at 21:44 +0300, Michael S. Tsirkin wrote:
> > > On Sat, Jun 16, 2012 at 10:34:39AM -0600, Alex Williamson wrote:
> > > > I'm looking for opinions on this approach.  For vfio device assignment
> > > > we minimally need a way to get EOIs from the in-kernel irqchip out to
> > > > userspace.  Getting that out via an eventfd would allow us to bounce
> > > > all level interrupts out to userspace, where we would de-assert the
> > > > device interrupt in qemu and unmask the physical device.  Ideally we
> > > > could deassert the interrupt in KVM, which allows us to send the EOI
> > > > directly to vfio.  To do that, we need to use a new IRQ source ID so
> > > > the guest sees the logical OR of qemu requested state and external
> > > > device state.
> > > 
> > > Given that yopu want to involve userspace anyway, why insist on irqfd
> > > for this?  You can simply use KVM_IRQ_LINE_STATUS from qemu, no?
> > 
> > Well, actually I'd like to have a way to bypass userspace, which the
> > combination of an irqfd + eventfd w/ deassert does.
> 
> Above you said "bounce all level interrupts out to userspace"?

And then go on to discuss what I'd rather implement instead.

> >  I'm not quite sure
> > I understand how KVM_IRQ_LINE_STATUS would work for this.  AIUI, that
> > effectively gives us a way to post an interrupt AND let us know whether
> > it was masked, coalesced, or delivered.  So I'd have to poll by posting
> > a potentially spurious interrupt and if it was spurious unmask the
> > physical device and wait for a real interrupt?  What am I missing,
> > because that seems barely functional?  Thanks,
> > 
> > Alex
> 
> Sorry if I did not make this clear.  You still need to add a way to pass
> EOI writes out to userspace. But that's all you need: once
> userspace gets EOI events for specific GSIs, it can control IRQ level from 
> userspace
> in the same way it does it for emulated devices:
> wake up on EOI, poll all devices, if no interrupt asserted -
> clear IRQ line and unmask IRQ in assigned device.

Right, this is all we need for a completely mediocre implementation.

> Above is with plain exit to userspace rather than eventfd,
> though eventfd is also possible but then you need to clear
> irq level in kernel and re-assert from userspace later -
> I think it's ok but need to think about possible races.

And this is more along the lines of what I'm attempting to do, using a
new irq source id so that the irqchip handles taking the OR of inputs.
We can always reassert the interrupt if the de-assert was unnecessary.
I thought it more beneficial to optimize for the interrupt being
handled.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/2] virtio-ring: Use threshold for switching to indirect descriptors

2012-06-18 Thread Sasha Levin
Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will use indirect
descriptors even if we have plenty of space in the ring. This means that
we take a performance hit at all times due to the overhead of creating
indirect descriptors.

Instead, use it only after we're below a configurable offset.

Signed-off-by: Sasha Levin 
---
 drivers/block/virtio_blk.c  |4 
 drivers/char/hw_random/virtio-rng.c |4 
 drivers/char/virtio_console.c   |4 
 drivers/net/virtio_net.c|4 
 drivers/virtio/virtio_balloon.c |4 
 drivers/virtio/virtio_ring.c|   21 +++--
 include/linux/virtio.h  |1 +
 net/9p/trans_virtio.c   |4 
 8 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 693187d..a2c8d97 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -19,6 +19,9 @@ static DEFINE_IDA(vd_index_ida);
 
 struct workqueue_struct *virtblk_wq;
 
+static unsigned int indirect_thresh = 0;
+module_param(indirect_thresh, uint, S_IRUGO);
+
 struct virtio_blk
 {
spinlock_t lock;
@@ -438,6 +441,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
mutex_init(&vblk->config_lock);
INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
vblk->config_enable = true;
+   vdev->indirect_thresh = indirect_thresh;
 
err = init_vq(vblk);
if (err)
diff --git a/drivers/char/hw_random/virtio-rng.c 
b/drivers/char/hw_random/virtio-rng.c
index 723725b..bcaddb9 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -25,6 +25,9 @@
 #include 
 #include 
 
+static unsigned int indirect_thresh = 0;
+module_param(indirect_thresh, uint, S_IRUGO);
+
 static struct virtqueue *vq;
 static unsigned int data_avail;
 static DECLARE_COMPLETION(have_data);
@@ -90,6 +93,7 @@ static int virtrng_probe(struct virtio_device *vdev)
int err;
 
/* We expect a single virtqueue. */
+   vdev->indirect_thresh = indirect_thresh;
vq = virtio_find_single_vq(vdev, random_recv_done, "input");
if (IS_ERR(vq))
return PTR_ERR(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index cdf2f54..60397a4 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -37,6 +37,9 @@
 #include 
 #include "../tty/hvc/hvc_console.h"
 
+static unsigned int indirect_thresh = 0;
+module_param(indirect_thresh, uint, S_IRUGO);
+
 /*
  * This is a global struct for storing common data for all the devices
  * this driver handles.
@@ -1729,6 +1732,7 @@ static int __devinit virtcons_probe(struct virtio_device 
*vdev)
   max_nr_ports),
  &portdev->config.max_nr_ports) == 0)
multiport = true;
+   vdev->indirect_thresh = indirect_thresh;
 
err = init_vqs(portdev);
if (err < 0) {
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f18149a..5c1be92 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -34,6 +34,9 @@ static bool csum = true, gso = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 
+static unsigned int indirect_thresh = 0;
+module_param(indirect_thresh, uint, S_IRUGO);
+
 /* FIXME: MTU in config. */
 #define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN  128
@@ -1128,6 +1131,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
vi->mergeable_rx_bufs = true;
+   vdev->indirect_thresh = indirect_thresh;
 
err = init_vqs(vi);
if (err)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bfbc15c..4fc11ba 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -35,6 +35,9 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
 
+static unsigned int indirect_thresh = 0;
+module_param(indirect_thresh, uint, S_IRUGO);
+
 struct virtio_balloon
 {
struct virtio_device *vdev;
@@ -360,6 +363,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
init_waitqueue_head(&vb->config_change);
vb->vdev = vdev;
vb->need_stats_update = 0;
+   vdev->indirect_thresh = indirect_thresh;
 
err = init_vqs(vb);
if (err)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..99a64a7 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -87,8 +87,11 @@ struct vring_virtqueue
/* Other side has made a mess, don't try any more. */
bool broken;
 
-   /* Host supports indirect buffers */
-   bool indirect;
+   /* 
+* Min. number of free space in the ring to trigger direct
+

[RFC 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-06-18 Thread Sasha Levin
Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will
use indirect descriptors and allocate them using a simple
kmalloc().

This patch adds a cache which will allow indirect buffers under
a configurable size to be allocated from that cache instead.

Signed-off-by: Sasha Levin 
---
 drivers/block/virtio_blk.c  |4 
 drivers/char/hw_random/virtio-rng.c |4 
 drivers/char/virtio_console.c   |4 
 drivers/net/virtio_net.c|4 
 drivers/virtio/virtio_balloon.c |4 
 drivers/virtio/virtio_ring.c|   28 
 include/linux/virtio.h  |1 +
 net/9p/trans_virtio.c   |5 +
 8 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index a2c8d97..3c6d1bd 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -22,6 +22,9 @@ struct workqueue_struct *virtblk_wq;
 static unsigned int indirect_thresh = 0;
 module_param(indirect_thresh, uint, S_IRUGO);
 
+static unsigned int indirect_alloc_thresh = 0;
+module_param(indirect_alloc_thresh, uint, S_IRUGO);
+
 struct virtio_blk
 {
spinlock_t lock;
@@ -442,6 +445,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
vblk->config_enable = true;
vdev->indirect_thresh = indirect_thresh;
+   vdev->indirect_alloc_thresh = indirect_alloc_thresh;
 
err = init_vq(vblk);
if (err)
diff --git a/drivers/char/hw_random/virtio-rng.c 
b/drivers/char/hw_random/virtio-rng.c
index bcaddb9..6a32f76 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -28,6 +28,9 @@
 static unsigned int indirect_thresh = 0;
 module_param(indirect_thresh, uint, S_IRUGO);
 
+static unsigned int indirect_alloc_thresh = 0;
+module_param(indirect_alloc_thresh, uint, S_IRUGO);
+
 static struct virtqueue *vq;
 static unsigned int data_avail;
 static DECLARE_COMPLETION(have_data);
@@ -94,6 +97,7 @@ static int virtrng_probe(struct virtio_device *vdev)
 
/* We expect a single virtqueue. */
vdev->indirect_thresh = indirect_thresh;
+   vdev->indirect_alloc_thresh = indirect_alloc_thresh;
vq = virtio_find_single_vq(vdev, random_recv_done, "input");
if (IS_ERR(vq))
return PTR_ERR(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 60397a4..7c33714 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -40,6 +40,9 @@
 static unsigned int indirect_thresh = 0;
 module_param(indirect_thresh, uint, S_IRUGO);
 
+static unsigned int indirect_alloc_thresh = 0;
+module_param(indirect_alloc_thresh, uint, S_IRUGO);
+
 /*
  * This is a global struct for storing common data for all the devices
  * this driver handles.
@@ -1733,6 +1736,7 @@ static int __devinit virtcons_probe(struct virtio_device 
*vdev)
  &portdev->config.max_nr_ports) == 0)
multiport = true;
vdev->indirect_thresh = indirect_thresh;
+   vdev->indirect_alloc_thresh = indirect_alloc_thresh;
 
err = init_vqs(portdev);
if (err < 0) {
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5c1be92..441bd2f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -37,6 +37,9 @@ module_param(gso, bool, 0444);
 static unsigned int indirect_thresh = 0;
 module_param(indirect_thresh, uint, S_IRUGO);
 
+static unsigned int indirect_alloc_thresh = 0;
+module_param(indirect_alloc_thresh, uint, S_IRUGO);
+
 /* FIXME: MTU in config. */
 #define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN  128
@@ -1132,6 +1135,7 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
vi->mergeable_rx_bufs = true;
vdev->indirect_thresh = indirect_thresh;
+   vdev->indirect_alloc_thresh = indirect_alloc_thresh;
 
err = init_vqs(vi);
if (err)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4fc11ba..6e6313b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -38,6 +38,9 @@
 static unsigned int indirect_thresh = 0;
 module_param(indirect_thresh, uint, S_IRUGO);
 
+static unsigned int indirect_alloc_thresh = 0;
+module_param(indirect_alloc_thresh, uint, S_IRUGO);
+
 struct virtio_balloon
 {
struct virtio_device *vdev;
@@ -364,6 +367,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->vdev = vdev;
vb->need_stats_update = 0;
vdev->indirect_thresh = indirect_thresh;
+   vdev->indirect_alloc_thresh = indirect_alloc_thresh;
 
err = init_vqs(vb);
if (err)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 99a64a7..f0b0ce3 100644
--- a

Re: [PATCH 05/21] KVM: x86 emulator: allow loading null SS in long mode

2012-06-18 Thread Kevin Wolf
Am 12.06.2012 19:34, schrieb Avi Kivity:
> Null SS is valid in long mode; allow loading it.
> 
> Signed-off-by: Avi Kivity 

The documentation suggests that trying to load it in CPL 3 should still
fail in long mode.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/8] kvm_para: guest side for eoi avoidance

2012-06-18 Thread Avi Kivity
On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
> The idea is simple: there's a bit, per APIC, in guest memory,
> that tells the guest that it does not need EOI.
> Guest tests it using a single est and clear operation - this is
> necessary so that host can detect interrupt nesting - and if set, it can
> skip the EOI MSR.
> 
> I run a simple microbenchmark to show exit reduction
> (note: for testing, need to apply follow-up patch
> 'kvm: host side for eoi optimization' + a qemu patch
>  I posted separately, on host):
> 
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index a6983b2..47f9eff 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -28,11 +28,13 @@
>  #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 1)
>  /* Technically wrong, but this avoids compilation errors on some gcc
> versions. */
> -#define BITOP_ADDR(x) "=m" (*(volatile long *) (x))
> +#define BITOP_ADDR_CONSTRAINT "=m"
>  #else
> -#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
> +#define BITOP_ADDR_CONSTRAINT "+m"
>  #endif
>  
> +#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x))
> +
>  #define ADDR BITOP_ADDR(addr)

What's this doing here?

>  
> +/* size alignment is implied but just to make it explicit. */
> +static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) =
> + KVM_PV_EOI_DISABLED;

You're actually breaking the alignment.  ulong has 8 byte alignment
sometimes and you can make it cross cache boundary this way.

> +
>  void __cpuinit kvm_guest_cpu_init(void)
>  {
>   if (!kvm_para_available())
> @@ -300,11 +320,17 @@ void __cpuinit kvm_guest_cpu_init(void)
>  smp_processor_id());
>   }
>  
> + if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
> + __get_cpu_var(kvm_apic_eoi) = 0;
> + wrmsrl(MSR_KVM_PV_EOI_EN, __pa(&__get_cpu_var(kvm_apic_eoi)) |
> +KVM_MSR_ENABLED);

Bad formatting.

> + }
> +
>   if (has_steal_clock)
>   kvm_register_steal_time();
>  }
>  


Please check that the kexec path also disables pveoi.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 11:02 +0300, Avi Kivity wrote:
> On 06/16/2012 07:34 PM, Alex Williamson wrote:
> > I'm looking for opinions on this approach.  For vfio device assignment
> > we minimally need a way to get EOIs from the in-kernel irqchip out to
> > userspace.  Getting that out via an eventfd would allow us to bounce
> > all level interrupts out to userspace, where we would de-assert the
> > device interrupt in qemu and unmask the physical device.  Ideally we
> > could deassert the interrupt in KVM, which allows us to send the EOI
> > directly to vfio.  To do that, we need to use a new IRQ source ID so
> > the guest sees the logical OR of qemu requested state and external
> > device state.  This allows external devices to share interrupts with
> > emulated devices, just like KVM assignment.  That means the means we
> > also need to use a new source ID when injecting the interrupt via
> > irqfd.
> > 
> > Rather than creating a source ID allocation interface, extending irqfd
> > to support a user supplied source ID, and creating another new
> > interface to get the EOI out, I think it works out better to bundle
> > these all together as just a level irqfd interface.  This way we don't
> > allow users to create unbalanced states where a level interrupt is
> > asserted, but has no way of being deasserted.  I believe the below
> > does this, but needs testing and validation with an implementation in
> > qemu.
> > 
> 
> Missing documentation, which helps at least one reviewer review.  It's
> not just for a commit.

Yeah, sorry.  There's no irqfd documentation and I didn't a chance to
write it so I could add my blurb to it.
  
> > +static void
> > +irqfd_inject_level(struct work_struct *work)
> > +{
> > +   struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > +
> > +   kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
> > +}
> > +
> > +static void
> > +irqfd_ack_level(struct kvm_irq_ack_notifier *notifier)
> > +{
> > +   struct _irqfd *irqfd  = container_of(notifier, struct _irqfd, notifier);
> > +
> > +   kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 0);
> > +   eventfd_signal(irqfd->eoi_eventfd, 1);
> > +}
> > +
> 
> I don't understand how this works.  A level IRQ isn't de-asserted by the
> EOI, it's de-asserted by its source.
> 
> Consider the following sequence:
> 
> deviceguest
> 
>   event
>assert
>   interrupt
>interrupt handler
> handle event
> clear ISR bit
>deassert
>   event
> assert
>EOI
> 
> What should happen is that the interrupt will be redelivered
> immmediately after the EOI, but that won't happen with your API since
> the EOI ack notifier will deassert the interrupt and nothing will
> re-assert it.

A device that can de-assert it's own interrupt based on register/config
access probably isn't going to subscribe to this interface.  Such a
device already has much greater visibility of it's service needs.  In
the case where we have external devices, they'll reassert the interrupt,
which is part of this api that will need to be documented.

Comparing this to real hardware, an eoi causes the ioapic to reassert
the interrupt if any of it's inputs are still active.  Here we
preemptively de-assert our interrupt and require the user of the
interface to re-assert.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 5/8] kvm: eoi msi documentation

2012-06-18 Thread Avi Kivity
On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
> Document the new EOI MSR. Couldn't decide whether this change belongs
> conceptually on guest or host side, so a separate patch.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  Documentation/virtual/kvm/msr.txt | 32 
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/msr.txt 
> b/Documentation/virtual/kvm/msr.txt
> index 96b41bd..6ae5a85 100644
> --- a/Documentation/virtual/kvm/msr.txt
> +++ b/Documentation/virtual/kvm/msr.txt
> @@ -223,3 +223,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>   steal: the amount of time in which this vCPU did not run, in
>   nanoseconds. Time during which the vcpu is idle, will not be
>   reported as steal time.
> +
> +MSR_KVM_EOI_EN: 0x4b564d04
> + data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
> + when disabled.  When enabled, bits 63-1 hold 2-byte aligned physical 
> address
> + of a 2 byte memory area which must be in guest RAM and must be zeroed.

2 byte aligned means we must never access it on the host with a >2 byte
instruction, or we risk touching unmapped memory.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 13:11 +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 18, 2012 at 12:33:01PM +0300, Avi Kivity wrote:
> > On 06/18/2012 11:52 AM, Jan Kiszka wrote:
> > >> 
> > >> I don't understand how this works.  A level IRQ isn't de-asserted by the
> > >> EOI, it's de-asserted by its source.
> > >> 
> > >> Consider the following sequence:
> > >> 
> > >> deviceguest
> > >> 
> > >>   event
> > >>assert
> > >>   interrupt
> > >>interrupt handler
> > >> handle event
> > >> clear ISR bit
> > >>deassert
> > >>   event
> > >> assert
> > >>EOI
> > >> 
> > >> What should happen is that the interrupt will be redelivered
> > >> immmediately after the EOI, but that won't happen with your API since
> > >> the EOI ack notifier will deassert the interrupt and nothing will
> > >> re-assert it.
> > > 
> > > As it's level triggered and we unmask the physical source, another
> > > host-side interrupt will be triggered and then reported to the guest.
> > 
> > That works for real devices.  What about emulated devices
> 
> It's broken for userspace too.  I guess it
> should track the logical state of the device per source id.  On ack, it
> should clear it for assigned devices only, do the
> logical OR over all source IDs and set level to that.

That's basically what it's doing.  Maybe assigned devices are the only
user of this interface, but I'm trying not to make it device assignment
specific.  I'm grabbing a new source id for each user of this interface
specifically so the irqchip just sees it as one of many inputs.

> > (vhost,
> > msi-less ivshmem clone)?
> 
> I guess vhost can poll eventfd and reinject an interrupt.
> Of course to bypass qemu completely we also need to support reads over
> ioeventfd somehow.

If you're going to reinject an interrupt anyway, wouldn't you rather be
told when to do it via an eventfd like this instead of poll?  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/21] KVM: x86 emulator: allow loading null SS in long mode

2012-06-18 Thread Avi Kivity
On 06/18/2012 05:14 PM, Kevin Wolf wrote:
> Am 12.06.2012 19:34, schrieb Avi Kivity:
>> Null SS is valid in long mode; allow loading it.
>> 
>> Signed-off-by: Avi Kivity 
> 
> The documentation suggests that trying to load it in CPL 3 should still
> fail in long mode.

Right, good catch.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 13:14 +0300, Avi Kivity wrote:
> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
> 
> >> (vhost,
> >> msi-less ivshmem clone)?
> > 
> > I guess vhost can poll eventfd and reinject an interrupt.
> > Of course to bypass qemu completely we also need to support reads over
> > ioeventfd somehow.
> > 
> 
> eventfd is not suitable for level triggered interrupts as far as I can
> tell.  It's about passing edges, and level interrupts aren't.  We need
> something like a pipe for communicating state (or just use KVM_IRQ_LINE).

Isn't level just two edges with a state in between?  Sure we may be
unnecessarily de-asserting for brief periods, but for an external
assigned device, we don't know whether it's unnecessary.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 14:17 +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 18, 2012 at 02:03:40PM +0300, Avi Kivity wrote:
> > On 06/18/2012 01:55 PM, Michael S. Tsirkin wrote:
> > > On Mon, Jun 18, 2012 at 01:14:13PM +0300, Avi Kivity wrote:
> > >> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
> > >> 
> > >> >> (vhost,
> > >> >> msi-less ivshmem clone)?
> > >> > 
> > >> > I guess vhost can poll eventfd and reinject an interrupt.
> > >> > Of course to bypass qemu completely we also need to support reads over
> > >> > ioeventfd somehow.
> > >> > 
> > >> 
> > >> eventfd is not suitable for level triggered interrupts as far as I can
> > >> tell.  It's about passing edges, and level interrupts aren't.  We need
> > >> something like a pipe for communicating state (or just use KVM_IRQ_LINE).
> > > 
> > > Just using KVM_IRQ_LINE won't be enough.
> > > There's no software event when device deasserts the
> > > line, and no stadard way for drivers to do this.
> > > So kvm needs to trigger on some event to poll the real interrupt
> > > state, to check whether it's ok to clear for the guest,
> > > and forward it to userspace.
> > 
> > This in turn only applies to real devices.  Emulated devices of course
> > know the state of the emulated interrupt line.
> 
> Or to be more exact, real devices know the state of the line too.
> It's the hyprvisor that doesn't know the state, because
> of the abovementioned lack of standards, and because
> the state control can be mapped directly into guest.

Yes, exactly.

> For example the hypervisor knows about virtio ISR
> register so there's no problem.
> 
> > Maybe we need different interfaces then.

This interface is perhaps not as useful for anyone else as I was hoping,
but is it the wrong interface?  It still seems like a logical extension
of irqfd, even if the use case is small.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Avi Kivity
On 06/18/2012 05:27 PM, Alex Williamson wrote:
> On Mon, 2012-06-18 at 13:14 +0300, Avi Kivity wrote:
>> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
>> 
>> >> (vhost,
>> >> msi-less ivshmem clone)?
>> > 
>> > I guess vhost can poll eventfd and reinject an interrupt.
>> > Of course to bypass qemu completely we also need to support reads over
>> > ioeventfd somehow.
>> > 
>> 
>> eventfd is not suitable for level triggered interrupts as far as I can
>> tell.  It's about passing edges, and level interrupts aren't.  We need
>> something like a pipe for communicating state (or just use KVM_IRQ_LINE).
> 
> Isn't level just two edges with a state in between?  Sure we may be
> unnecessarily de-asserting for brief periods, but for an external
> assigned device, we don't know whether it's unnecessary.  Thanks,

For an assigned device you have someone maintaining the state and keep
pushing you edges as long as it's up.  So it's okay for you to keep
deasserting the interrupt as the device will fix it up for you.  For an
emulated device, you don't, so it isn't.

btw, technically we're in the wrong here.  An interrupt handler may read
the irr and see an incorrect value.  Of course no one does that so we
can get away with it.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Avi Kivity
On 06/18/2012 05:18 PM, Alex Williamson wrote:
>> 
>> I don't understand how this works.  A level IRQ isn't de-asserted by the
>> EOI, it's de-asserted by its source.
>> 
>> Consider the following sequence:
>> 
>> deviceguest
>> 
>>   event
>>assert
>>   interrupt
>>interrupt handler
>> handle event
>> clear ISR bit
>>deassert
>>   event
>> assert
>>EOI
>> 
>> What should happen is that the interrupt will be redelivered
>> immmediately after the EOI, but that won't happen with your API since
>> the EOI ack notifier will deassert the interrupt and nothing will
>> re-assert it.
> 
> A device that can de-assert it's own interrupt based on register/config
> access probably isn't going to subscribe to this interface.  Such a
> device already has much greater visibility of it's service needs.  In
> the case where we have external devices, they'll reassert the interrupt,
> which is part of this api that will need to be documented.
> 
> Comparing this to real hardware, an eoi causes the ioapic to reassert
> the interrupt if any of it's inputs are still active.  Here we
> preemptively de-assert our interrupt and require the user of the
> interface to re-assert.  Thanks,

Okay.  I guess this limits it assigned devices (which is fine, it just
needs to be documented clearly).

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/8] kvm_para: guest side for eoi avoidance

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 05:17:24PM +0300, Avi Kivity wrote:
> On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
> > The idea is simple: there's a bit, per APIC, in guest memory,
> > that tells the guest that it does not need EOI.
> > Guest tests it using a single est and clear operation - this is
> > necessary so that host can detect interrupt nesting - and if set, it can
> > skip the EOI MSR.
> > 
> > I run a simple microbenchmark to show exit reduction
> > (note: for testing, need to apply follow-up patch
> > 'kvm: host side for eoi optimization' + a qemu patch
> >  I posted separately, on host):
> > 
> > 
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index a6983b2..47f9eff 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -28,11 +28,13 @@
> >  #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 1)
> >  /* Technically wrong, but this avoids compilation errors on some gcc
> > versions. */
> > -#define BITOP_ADDR(x) "=m" (*(volatile long *) (x))
> > +#define BITOP_ADDR_CONSTRAINT "=m"
> >  #else
> > -#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
> > +#define BITOP_ADDR_CONSTRAINT "+m"
> >  #endif
> >  
> > +#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x))
> > +
> >  #define ADDR   BITOP_ADDR(addr)
> 
> What's this doing here?

Ugh. More leftovers from when I had inline asm here.
Will remove.

> >  
> > +/* size alignment is implied but just to make it explicit. */
> > +static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) =
> > +   KVM_PV_EOI_DISABLED;
> 
> You're actually breaking the alignment.  ulong has 8 byte alignment
> sometimes and you can make it cross cache boundary this way.

No, if you look at the definition of __aligned
you will see that it limits the alignment from below.
Compiler still applies the natural size alignment.
You are not the first to get confused. So I wonder: is it better
to add a comment or simply remove __aligned here.

> > +
> >  void __cpuinit kvm_guest_cpu_init(void)
> >  {
> > if (!kvm_para_available())
> > @@ -300,11 +320,17 @@ void __cpuinit kvm_guest_cpu_init(void)
> >smp_processor_id());
> > }
> >  
> > +   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
> > +   __get_cpu_var(kvm_apic_eoi) = 0;
> > +   wrmsrl(MSR_KVM_PV_EOI_EN, __pa(&__get_cpu_var(kvm_apic_eoi)) |
> > +  KVM_MSR_ENABLED);
> 
> Bad formatting.

I guess temporary will make it prettier.
unsigned long pa;
__get_cpu_var(kvm_apic_eoi) = 0;
pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa);

or did I miss the point?

> > +   }
> > +
> > if (has_steal_clock)
> > kvm_register_steal_time();
> >  }
> >  
> 
> 
> Please check that the kexec path also disables pveoi.

The chunk in kvm_pv_guest_cpu_reboot does this, doesn't it?

> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 5/8] kvm: eoi msi documentation

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 05:20:01PM +0300, Avi Kivity wrote:
> On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
> > Document the new EOI MSR. Couldn't decide whether this change belongs
> > conceptually on guest or host side, so a separate patch.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  Documentation/virtual/kvm/msr.txt | 32 
> >  1 file changed, 32 insertions(+)
> > 
> > diff --git a/Documentation/virtual/kvm/msr.txt 
> > b/Documentation/virtual/kvm/msr.txt
> > index 96b41bd..6ae5a85 100644
> > --- a/Documentation/virtual/kvm/msr.txt
> > +++ b/Documentation/virtual/kvm/msr.txt
> > @@ -223,3 +223,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
> > steal: the amount of time in which this vCPU did not run, in
> > nanoseconds. Time during which the vcpu is idle, will not be
> > reported as steal time.
> > +
> > +MSR_KVM_EOI_EN: 0x4b564d04
> > +   data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
> > +   when disabled.  When enabled, bits 63-1 hold 2-byte aligned physical 
> > address
> > +   of a 2 byte memory area which must be in guest RAM and must be zeroed.
> 
> 2 byte aligned means we must never access it on the host with a >2 byte
> instruction, or we risk touching unmapped memory.

Yes. So ... that's correct for any length.
The patch actually accesses a single byte only.

Could you clarify what you are saying here please?

> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 5/8] kvm: eoi msi documentation

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 05:20:01PM +0300, Avi Kivity wrote:
> On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
> > Document the new EOI MSR. Couldn't decide whether this change belongs
> > conceptually on guest or host side, so a separate patch.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  Documentation/virtual/kvm/msr.txt | 32 
> >  1 file changed, 32 insertions(+)
> > 
> > diff --git a/Documentation/virtual/kvm/msr.txt 
> > b/Documentation/virtual/kvm/msr.txt
> > index 96b41bd..6ae5a85 100644
> > --- a/Documentation/virtual/kvm/msr.txt
> > +++ b/Documentation/virtual/kvm/msr.txt
> > @@ -223,3 +223,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
> > steal: the amount of time in which this vCPU did not run, in
> > nanoseconds. Time during which the vcpu is idle, will not be
> > reported as steal time.
> > +
> > +MSR_KVM_EOI_EN: 0x4b564d04
> > +   data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
> > +   when disabled.  When enabled, bits 63-1 hold 2-byte aligned physical 
> > address
> > +   of a 2 byte memory area which must be in guest RAM and must be zeroed.
> 
> 2 byte aligned means we must never access it on the host with a >2 byte
> instruction, or we risk touching unmapped memory.

Wait, maybe you mean the text above should go into documentation?

> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/8] kvm_para: guest side for eoi avoidance

2012-06-18 Thread Avi Kivity
On 06/18/2012 05:50 PM, Michael S. Tsirkin wrote:
> 
>> >  
>> > +/* size alignment is implied but just to make it explicit. */
>> > +static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) =
>> > +  KVM_PV_EOI_DISABLED;
>> 
>> You're actually breaking the alignment.  ulong has 8 byte alignment
>> sometimes and you can make it cross cache boundary this way.
> 
> No, if you look at the definition of __aligned
> you will see that it limits the alignment from below.
> Compiler still applies the natural size alignment.
> You are not the first to get confused. So I wonder: is it better
> to add a comment or simply remove __aligned here.

Both.

>> >  
>> > +  if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
>> > +  __get_cpu_var(kvm_apic_eoi) = 0;
>> > +  wrmsrl(MSR_KVM_PV_EOI_EN, __pa(&__get_cpu_var(kvm_apic_eoi)) |
>> > + KVM_MSR_ENABLED);
>> 
>> Bad formatting.
> 
> I guess temporary will make it prettier.
>   unsigned long pa;
>   __get_cpu_var(kvm_apic_eoi) = 0;
>   pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
>   wrmsrl(MSR_KVM_PV_EOI_EN, pa);

That, or

+   wrmsrl(MSR_KVM_PV_EOI_EN,
+  __pa(&__get_cpu_var(kvm_apic_eoi)) | _ENABLED);

You have an argument split over two lines with no helpful indentation to
show this.

>> 
>> 
>> Please check that the kexec path also disables pveoi.
> 
> The chunk in kvm_pv_guest_cpu_reboot does this, doesn't it?

Dunno, does it?



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 5/8] kvm: eoi msi documentation

2012-06-18 Thread Avi Kivity
On 06/18/2012 05:56 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 18, 2012 at 05:20:01PM +0300, Avi Kivity wrote:
>> On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
>> > Document the new EOI MSR. Couldn't decide whether this change belongs
>> > conceptually on guest or host side, so a separate patch.
>> > 
>> > Signed-off-by: Michael S. Tsirkin 
>> > ---
>> >  Documentation/virtual/kvm/msr.txt | 32 
>> >  1 file changed, 32 insertions(+)
>> > 
>> > diff --git a/Documentation/virtual/kvm/msr.txt 
>> > b/Documentation/virtual/kvm/msr.txt
>> > index 96b41bd..6ae5a85 100644
>> > --- a/Documentation/virtual/kvm/msr.txt
>> > +++ b/Documentation/virtual/kvm/msr.txt
>> > @@ -223,3 +223,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>> >steal: the amount of time in which this vCPU did not run, in
>> >nanoseconds. Time during which the vcpu is idle, will not be
>> >reported as steal time.
>> > +
>> > +MSR_KVM_EOI_EN: 0x4b564d04
>> > +  data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
>> > +  when disabled.  When enabled, bits 63-1 hold 2-byte aligned physical 
>> > address
>> > +  of a 2 byte memory area which must be in guest RAM and must be zeroed.
>> 
>> 2 byte aligned means we must never access it on the host with a >2 byte
>> instruction, or we risk touching unmapped memory.
> 
> Yes. So ... that's correct for any length.
> The patch actually accesses a single byte only.
> 
> Could you clarify what you are saying here please?
> 

It means, if we want to use __set_bit() (and if __set_bit wants to
access this as a long) then we can get an exception.

The spec shouldn't limit us in this way, even if the implementation is okay.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for Tuesday, June 19th

2012-06-18 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Anthony suggested for last week:
- multithreading vhost (and general vhost improvements)

I suggest:
- status of migration: post-copy, IDL, XBRLE, huge memory, ...
  Will send an email with an status before tomorrow call.

Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 5/8] kvm: eoi msi documentation

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 06:03:42PM +0300, Avi Kivity wrote:
> On 06/18/2012 05:56 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 18, 2012 at 05:20:01PM +0300, Avi Kivity wrote:
> >> On 06/14/2012 04:53 PM, Michael S. Tsirkin wrote:
> >> > Document the new EOI MSR. Couldn't decide whether this change belongs
> >> > conceptually on guest or host side, so a separate patch.
> >> > 
> >> > Signed-off-by: Michael S. Tsirkin 
> >> > ---
> >> >  Documentation/virtual/kvm/msr.txt | 32 
> >> >  1 file changed, 32 insertions(+)
> >> > 
> >> > diff --git a/Documentation/virtual/kvm/msr.txt 
> >> > b/Documentation/virtual/kvm/msr.txt
> >> > index 96b41bd..6ae5a85 100644
> >> > --- a/Documentation/virtual/kvm/msr.txt
> >> > +++ b/Documentation/virtual/kvm/msr.txt
> >> > @@ -223,3 +223,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
> >> >  steal: the amount of time in which this vCPU did not 
> >> > run, in
> >> >  nanoseconds. Time during which the vcpu is idle, will 
> >> > not be
> >> >  reported as steal time.
> >> > +
> >> > +MSR_KVM_EOI_EN: 0x4b564d04
> >> > +data: Bit 0 is 1 when PV end of interrupt is enabled on the 
> >> > vcpu; 0
> >> > +when disabled.  When enabled, bits 63-1 hold 2-byte aligned 
> >> > physical address
> >> > +of a 2 byte memory area which must be in guest RAM and must be 
> >> > zeroed.
> >> 
> >> 2 byte aligned means we must never access it on the host with a >2 byte
> >> instruction, or we risk touching unmapped memory.
> > 
> > Yes. So ... that's correct for any length.
> > The patch actually accesses a single byte only.
> > 
> > Could you clarify what you are saying here please?
> > 
> 
> It means, if we want to use __set_bit() (and if __set_bit wants to
> access this as a long) then we can get an exception.

We can't use __set_bit. It's in guest memory.

> The spec shouldn't limit us in this way, even if the implementation is okay.

This specific spec seems to create the smallest possible amount
of work for the hypervisor. You have an idea for an interface
that will make the hypervisor side even more compact?
Can you clarify?

> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git access via http to kvm-kmod

2012-06-18 Thread Juan Lorenzo del Castillo

Thanks, Jan

I changed the URL in .git/config and 'git submodule update --init' 
worked and populated the ./linux directory. Then, I ran


./configure
make sync
make

but I got a build error:

$ make
make -C /lib/modules/2.6.32-71.el6.x86_64/build M=`pwd` \
LINUXINCLUDE="-I`pwd`/include -Iinclude \
-Iinclude2 -I/lib/modules/2.6.32-71.el6.x86_64/source/include 
-I/lib/modules/2.6.32-71.el6.x86_64/source/arch/x86/include \

-Iarch/x86/include/generated \
-I`pwd`/include-compat -I`pwd`/x86 \
-include include/linux/autoconf.h \
-include `pwd`/x86/external-module-compat.h" \
"$@"
make[1]: Entering directory `/usr/src/kernels/2.6.32-71.el6.x86_64'
CC [M] /mnt/data/compilations/kvm-kmod/x86/svm.o
In file included from 
/mnt/data/compilations/kvm-kmod/x86/external-module-compat.h:26,

from :0:
/mnt/data/compilations/kvm-kmod/x86/../external-module-compat-comm.h:936: error: 
redefinition of ‘PageTransCompound’
include/linux/huge_mm.h:108: note: previous definition of 
‘PageTransCompound’ was here
In file included from 
/mnt/data/compilations/kvm-kmod/include/asm/kvm_host.h:58,

from /mnt/data/compilations/kvm-kmod/include/linux/kvm_host.h:65,
from /mnt/data/compilations/kvm-kmod/x86/svm.c:51:
include/linux/perf_event.h:466: error: redefinition of ‘struct 
perf_guest_info_callbacks’
In file included from 
/mnt/data/compilations/kvm-kmod/include/asm/kvm_host.h:58,

from /mnt/data/compilations/kvm-kmod/include/linux/kvm_host.h:65,
from /mnt/data/compilations/kvm-kmod/x86/svm.c:51:
include/linux/perf_event.h:959: error: conflicting types for 
‘perf_register_guest_info_callbacks’
/mnt/data/compilations/kvm-kmod/x86/../external-module-compat-comm.h:752: note: 
previous definition of ‘perf_register_guest_info_callbacks’ was here
include/linux/perf_event.h:960: error: conflicting types for 
‘perf_unregister_guest_info_callbacks’
/mnt/data/compilations/kvm-kmod/x86/../external-module-compat-comm.h:758: note: 
previous definition of ‘perf_unregister_guest_info_callbacks’ was here

make[3]: *** [/mnt/data/compilations/kvm-kmod/x86/svm.o] Error 1
make[2]: *** [/mnt/data/compilations/kvm-kmod/x86] Error 2
make[1]: *** [_module_/mnt/data/compilations/kvm-kmod] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.32-71.el6.x86_64'


I am compiling against a RHEL6 2.6.32-71.el6.x86_64 kernel. I am not 
sure if this is feasible. I've read the section "building an external 
module with older kernels" in http://www.linux-kvm.org/page/Code, but I 
am not sure if that applies to my x86_64 architecture.


Thanks and regards,
Juan

On 06/15/2012 06:02 PM, Jan Kiszka wrote:

On 2012-06-15 17:54, Juan wrote:

Hi,

I cloned the KVM external module kit from http://git.kiszka.org/?p=kvm-kmod.git.
Following the README, when I try to build it by running 'git submodule update',
I get an error:

$ git submodule initSubmodule 'linux' (http://git.kiszka.org/kvm.git) registered
for path 'linux'
$ git submodule update
Cloning into 'linux'...
fatal: http://git.kiszka.org/kvm.git/info/refs not found: did you run git
update-server-info on the server?
Clone of 'http://git.kiszka.org/kvm.git' into submodule path 'linux' failed

The default submodule path assumes that kvm-kmod is still hosted on
kernel.org. I will fix this.

Meanwhile you need to set

[submodule "linux"]
 url = http://git.kernel.org/pub/scm/virt/kvm/kvm.git

in .git/config of your kvm-kmod clone.

Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git access via http to kvm-kmod

2012-06-18 Thread Jan Kiszka
On 2012-06-18 18:01, Juan Lorenzo del Castillo wrote:
> Thanks, Jan
> 
> I changed the URL in .git/config and 'git submodule update --init' worked and 
> populated the ./linux directory. Then, I ran
> 
> ./configure
> make sync
> make
> 
> but I got a build error:
> 
> $ make
> make -C /lib/modules/2.6.32-71.el6.x86_64/build M=`pwd` \
> LINUXINCLUDE="-I`pwd`/include -Iinclude \
>  -Iinclude2 -I/lib/modules/2.6.32-71.el6.x86_64/source/include 
> -I/lib/modules/2.6.32-71.el6.x86_64/source/arch/x86/include \
> -Iarch/x86/include/generated \
> -I`pwd`/include-compat -I`pwd`/x86 \
> -include  include/linux/autoconf.h \
> -include `pwd`/x86/external-module-compat.h" \
> "$@"
> make[1]: Entering directory `/usr/src/kernels/2.6.32-71.el6.x86_64'
>   CC [M]  /mnt/data/compilations/kvm-kmod/x86/svm.o
> In file included from 
> /mnt/data/compilations/kvm-kmod/x86/external-module-compat.h:26,
>  from :0:
> /mnt/data/compilations/kvm-kmod/x86/../external-module-compat-comm.h:936: 
> error: redefinition of ‘PageTransCompound’
> include/linux/huge_mm.h:108: note: previous definition of ‘PageTransCompound’ 
> was here
> In file included from 
> /mnt/data/compilations/kvm-kmod/include/asm/kvm_host.h:58,
>  from 
> /mnt/data/compilations/kvm-kmod/include/linux/kvm_host.h:65,
>  from /mnt/data/compilations/kvm-kmod/x86/svm.c:51:
> include/linux/perf_event.h:466: error: redefinition of ‘struct 
> perf_guest_info_callbacks’
> In file included from 
> /mnt/data/compilations/kvm-kmod/include/asm/kvm_host.h:58,
>  from 
> /mnt/data/compilations/kvm-kmod/include/linux/kvm_host.h:65,
>  from /mnt/data/compilations/kvm-kmod/x86/svm.c:51:
> include/linux/perf_event.h:959: error: conflicting types for 
> ‘perf_register_guest_info_callbacks’
> /mnt/data/compilations/kvm-kmod/x86/../external-module-compat-comm.h:752: 
> note: previous definition of ‘perf_register_guest_info_callbacks’ was here
> include/linux/perf_event.h:960: error: conflicting types for 
> ‘perf_unregister_guest_info_callbacks’
> /mnt/data/compilations/kvm-kmod/x86/../external-module-compat-comm.h:758: 
> note: previous definition of ‘perf_unregister_guest_info_callbacks’ was here
> make[3]: *** [/mnt/data/compilations/kvm-kmod/x86/svm.o] Error 1
> make[2]: *** [/mnt/data/compilations/kvm-kmod/x86] Error 2
> make[1]: *** [_module_/mnt/data/compilations/kvm-kmod] Error 2
> make[1]: Leaving directory `/usr/src/kernels/2.6.32-71.el6.x86_64'
> 
> 
> I am compiling against a RHEL6 2.6.32-71.el6.x86_64 kernel. I am not sure if 
> this is feasible. I've read the section "building an external module with 
> older kernels" in http://www.linux-kvm.org/page/Code, but I am not sure if 
> that applies to my x86_64 architecture.

RHEL kernels differ heavily from the Linux upstream kernels that have
the same version numbers. kvm-kmod likely assumes that some interfaces
are missing, but the RHEL kernel actually have them.

You will have to identify these difference and adjust the wrappings of
kvm-kmod accordingly. I would accept patches.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: Extend irqfd to support level interrupts

2012-06-18 Thread Alex Williamson
On Mon, 2012-06-18 at 17:33 +0300, Avi Kivity wrote:
> On 06/18/2012 05:27 PM, Alex Williamson wrote:
> > On Mon, 2012-06-18 at 13:14 +0300, Avi Kivity wrote:
> >> On 06/18/2012 01:11 PM, Michael S. Tsirkin wrote:
> >> 
> >> >> (vhost,
> >> >> msi-less ivshmem clone)?
> >> > 
> >> > I guess vhost can poll eventfd and reinject an interrupt.
> >> > Of course to bypass qemu completely we also need to support reads over
> >> > ioeventfd somehow.
> >> > 
> >> 
> >> eventfd is not suitable for level triggered interrupts as far as I can
> >> tell.  It's about passing edges, and level interrupts aren't.  We need
> >> something like a pipe for communicating state (or just use KVM_IRQ_LINE).
> > 
> > Isn't level just two edges with a state in between?  Sure we may be
> > unnecessarily de-asserting for brief periods, but for an external
> > assigned device, we don't know whether it's unnecessary.  Thanks,
> 
> For an assigned device you have someone maintaining the state and keep
> pushing you edges as long as it's up.  So it's okay for you to keep
> deasserting the interrupt as the device will fix it up for you.  For an
> emulated device, you don't, so it isn't.

An emulated device doesn't know enough about it's own state to figure
out whether it should re-assert an interrupt?  Somehow I'm doubtful of
that.  What would be on the other end of that pipe if an emulated device
doesn't know when it's interrupting and when it's done?  Anyway, they're
probably not using an IRQFD to start with.

What I'm proposing emulates level signaling using something like this:

: "Hey, push the button (assert interrupt)!"
: "Bob (the guest) told me to let go of the button (deassert interrupt), 
let me know if I should push it again"

Another valid emulation of that is:

: "Hey, push the button!"
: "Let go of the button"

Nobody currently needs the latter, but the latter doesn't replace the
former for device assignment.  I can rename variables and flags to make
it more compatible with some future implementation of the latter if
that's helpful.

> btw, technically we're in the wrong here.  An interrupt handler may read
> the irr and see an incorrect value.  Of course no one does that so we
> can get away with it.

Hmm, you're speaking of existing irr emulation?  I don't see how this
makes anything less accurate or prevents the irr from being emulated
accurately through this patch.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/8] kvm_para: guest side for eoi avoidance

2012-06-18 Thread Michael S. Tsirkin
On Mon, Jun 18, 2012 at 06:01:59PM +0300, Avi Kivity wrote:
> On 06/18/2012 05:50 PM, Michael S. Tsirkin wrote:
> > 
> >> >  
> >> > +/* size alignment is implied but just to make it explicit. */
> >> > +static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) =
> >> > +KVM_PV_EOI_DISABLED;
> >> 
> >> You're actually breaking the alignment.  ulong has 8 byte alignment
> >> sometimes and you can make it cross cache boundary this way.
> > 
> > No, if you look at the definition of __aligned
> > you will see that it limits the alignment from below.
> > Compiler still applies the natural size alignment.
> > You are not the first to get confused. So I wonder: is it better
> > to add a comment or simply remove __aligned here.
> 
> Both.

Will do.

> >> >  
> >> > +if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
> >> > +__get_cpu_var(kvm_apic_eoi) = 0;
> >> > +wrmsrl(MSR_KVM_PV_EOI_EN, 
> >> > __pa(&__get_cpu_var(kvm_apic_eoi)) |
> >> > +   KVM_MSR_ENABLED);
> >> 
> >> Bad formatting.
> > 
> > I guess temporary will make it prettier.
> > unsigned long pa;
> > __get_cpu_var(kvm_apic_eoi) = 0;
> > pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
> > wrmsrl(MSR_KVM_PV_EOI_EN, pa);
> 
> That, or
> 
> + wrmsrl(MSR_KVM_PV_EOI_EN,
> +  __pa(&__get_cpu_var(kvm_apic_eoi)) | _ENABLED);
> 
> You have an argument split over two lines with no helpful indentation to
> show this.
> 
> >> 
> >> 
> >> Please check that the kexec path also disables pveoi.
> > 
> > The chunk in kvm_pv_guest_cpu_reboot does this, doesn't it?
> 
> Dunno, does it?

I thought it absolutely does but now I noticed this:

Without CONFIG_KEXEC_JUMP, it calls kernel_restart_prepare
which invokes notifiers. So fine.
But with CONFIG_KEXEC_JUMP it does suspend which is way more complex -
it stops all other cpus so we are fine but still not sure about the last
one.

Any idea?

How does it work for e.g. ASYNC_PF?

> 
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 04/15] KVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER

2012-06-18 Thread Christoffer Dall
On Mon, Jun 18, 2012 at 9:08 AM, Avi Kivity  wrote:
> On 06/15/2012 10:07 PM, Christoffer Dall wrote:
>> From: Marc Zyngier 
>>
>> In order to avoid compilation failure when KVM is not compiled in,
>> guard the mmu_notifier specific sections with both CONFIG_MMU_NOTIFIER
>> and KVM_ARCH_WANT_MMU_NOTIFIER, like it is being done in the rest of
>> the KVM code.
>>
>>
>> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>>       struct mmu_notifier mmu_notifier;
>>       unsigned long mmu_notifier_seq;
>>       long mmu_notifier_count;
>> @@ -780,7 +780,7 @@ struct kvm_stats_debugfs_item {
>>  extern struct kvm_stats_debugfs_item debugfs_entries[];
>>  extern struct dentry *kvm_debugfs_dir;
>>
>> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>>  static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long 
>> mmu_seq)
>>  {
>
> Why not have Kconfig select CONFIG_MMU_NOTIFIER?
>
>
Not sure I understand. Where would you select this option?

We do select this option when choosing to compile KVM on, but when we
do _not_, then other includes of kvm_host.h fails.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 06/15] ARM: KVM: Hypervisor identity mapping

2012-06-18 Thread Christoffer Dall
On Mon, Jun 18, 2012 at 9:12 AM, Avi Kivity  wrote:
> On 06/15/2012 10:07 PM, Christoffer Dall wrote:
>> Adds support in the identity mapping feature that allows KVM to setup
>> identity mapping for the Hyp mode with the AP[1] bit set as required by
>> the specification and also supports freeing created sub pmd's after
>> finished use.
>>
>> These two functions:
>>  - hyp_idmap_add(pgd, addr, end);
>>  - hyp_idmap_del(pgd, addr, end);
>> are essentially calls to the same function as the non-hyp versions but
>> with a different argument value. KVM calls these functions to setup
>> and teardown the identity mapping used to initialize the hypervisor.
>>
>> Note, the hyp-version of the _del function actually frees the pmd's
>> pointed to by the pgd as opposed to the non-hyp version which just
>> clears them.
>
>
> I asked previously what happens if two data structures share a page, and
> one of them is removed.  Is that handled now?  How?
>

I think you asked previously for the general hyp-mode mappings, not
the identity mappings. For the general hyp-mode mappings we simply
don't unmap the data structures, potentially leaking a few pages for
the page tables themselves.

This is only for initialization, so there are not really any data
structures mapped, only one/two pages to initialize the hypervisor
mode.

> Why not just identity map all memory?  You can use large pages so it's
> fast and doesn't consume a lot of page table memory.

That's an option, but it still seems like an awful waste since it's
only used once (unless you unload and re-load the module) and there's
really no problem with data structures here.

The truth is that this is going to go away, and the code will be put
in a section that's idmapped from kernel start. There's a patch under
way from Marc taking care of this which I assmue we'll merge for v9.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] PCI: Mark INTx masking support of Chelsio T310 10GbE NIC as broken

2012-06-18 Thread Bjorn Helgaas
On Thu, Jun 7, 2012 at 2:30 AM, Jan Kiszka  wrote:
> From: Jan Kiszka 
>
> According to
>
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/91388
>
> the T310 does not properly support INTx masking as it fails to keep the
> PCI_STATUS_INTERRUPT bit updated once the interrupt is masked. Mark this
> adapter as broken so that pci_intx_mask_supported won't report it as
> compatible.
>
> Tested-by: Alexey Kardashevskiy 
> Signed-off-by: Jan Kiszka 
> ---
>
> Changes in v2:
>  - Fixed device ID to Alexey's report
>  - Added reference to the original report
>
>  drivers/pci/pci.c    |    3 +++
>  drivers/pci/quirks.c |   12 
>  include/linux/pci.h  |    2 ++
>  3 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 447e834..9ae517a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2876,6 +2876,9 @@ bool pci_intx_mask_supported(struct pci_dev *dev)
>        bool mask_supported = false;
>        u16 orig, new;
>
> +       if (dev->broken_intx_masking)
> +               return false;
> +
>        pci_cfg_access_lock(dev);
>
>        pci_read_config_word(dev, PCI_COMMAND, &orig);
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 2a75216..28804f4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -2929,6 +2929,18 @@ static void __devinit disable_igfx_irq(struct pci_dev 
> *dev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
>
> +/*
> + * Some devices may pass our check in pci_intx_mask_supported if
> + * PCI_COMMAND_INTX_DISABLE works though they actually do not properly
> + * support this feature.
> + */
> +static void __devinit quirk_broken_intx_masking(struct pci_dev *dev)
> +{
> +       dev->broken_intx_masking = 1;
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x0030,
> +                       quirk_broken_intx_masking);
> +
>  static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f,
>                          struct pci_fixup *end)
>  {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index d8c379d..7ea6cf8 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -324,6 +324,8 @@ struct pci_dev {
>        unsigned int    is_hotplug_bridge:1;
>        unsigned int    __aer_firmware_first_valid:1;
>        unsigned int    __aer_firmware_first:1;
> +       unsigned int    broken_intx_masking:1;  /* device's INTx masking
> +                                                  support is not working */
>        pci_dev_flags_t dev_flags;
>        atomic_t        enable_cnt;     /* pci_enable_device has been called */
>
> --
> 1.7.3.4

I applied this to my "next" branch with minor tweaks (split the
infrastructure part from the device-specific quirk, and changed from a
FINAL quirk to a HEADER quirk because FINAL quirks currently aren't
called for hot-added devices).  Let me know if you see anything wrong.

Thanks!

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] PCI: Add Ralink RT2800 broken INTx masking quirk

2012-06-18 Thread Bjorn Helgaas
On Thu, Jun 7, 2012 at 11:01 AM, Alex Williamson
 wrote:
> Passes pci_intx_mask_supported test but continues to send interrupts
> as discovered through VFIO-based device assignment.
>
> http://www.spinics.net/lists/kvm/msg73738.html
>
> Signed-off-by: Alex Williamson 
> Tested-by: Andreas Hartmann 
> ---
>
> v2: context-only respin for Jan's v2 patch
>
> Depends on Jan's patch:
> http://www.spinics.net/lists/linux-pci/msg15720.html
>
>  drivers/pci/quirks.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 87b9edd..3681678 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -2940,6 +2940,8 @@ static void __devinit quirk_broken_intx_masking(struct 
> pci_dev *dev)
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x0030,
>                        quirk_broken_intx_masking);
> +DECLARE_PCI_FIXUP_FINAL(0x1814, 0x0601, /* Ralink RT2800 802.11n PCI */
> +                       quirk_broken_intx_masking);
>
>  static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f,
>                          struct pci_fixup *end)
>

Applied to my "next" branch (as HEADER quirk, not FINAL).  Thanks!

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 6/9] KVM: MMU: fast path of handling guest page fault

2012-06-18 Thread Marcelo Tosatti
On Thu, Jun 14, 2012 at 10:22:14AM +0900, Takuya Yoshikawa wrote:
> On Wed, 13 Jun 2012 19:40:02 -0300
> Marcelo Tosatti  wrote:
> 
> > mmu_spte_update is handling several different cases. Please rewrite
> > it, add a comment on top of it (or spread comments on top of each
> > significant code line) with all cases it is handling (also recheck it
> > regarding new EPT accessed/dirty bits code).
> 
> [not about this patch]
> 
> EPT accessed/dirty bits will be used for more things in the future.
> Are there any rules for using these bits?
> 
> Same as other bits?

Do you mean hardware rules or KVM rules?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian

2012-06-18 Thread Jimmy Crossley
> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
> Of Jimmy Crossley
> Sent: Friday, June 08, 2012 17:29
> To: kvm@vger.kernel.org
> Subject: Garbled audio - Windows 7 64 bit guest on Debian
> 
> I am experiencing garbled sound on a Windows 7 64 bit guest running under 64 
> bit Debian.  I have
> searched many discussion groups, etc. on the net and could find nothing 
> useful, so I thought I would
> post this here, hoping someone with a deeper understanding could help out.
> 
> I am running 64 bit Debian GNU/Linux "testing" on a Lenovo Thinkpad T420s, 
> 8GB RAM, Crucial m4 512GB
> disk.  Using ALSA, the sound works perfectly when playing all kinds of files. 
>  I am running
> everything from the standard Debian repositories.  This is using kernel 
> version 3.0.0-1.  kvm --
> version says:
> 
>   QEMU emulator version 1.0 (qemu-kvm-1.0+dfsg-11, Debian), Copyright (c) 
> 2003-2008 Fabrice Bellard
> 
> I am running Windows 7 Enterprise service pack 1 as a virtual machine.  The 
> machine is started with
> the following command:
> 
>   /usr/bin/kvm \
>   -M pc \
>   -m 2048 \
>   -drive file="csilt033-6.qcow",if=virtio \
>   -soundhw hda \
>   -display sdl &
> 
> I believe I must use "-soundhw hda" because that is the only sound device 
> that kvm emulates which
> has Windows 7 64 bit drivers.  Windows 7 automatically detects a "High 
> Definition Audio Device" and
> uses the driver HDAudio.sys.  But the sounds it makes sound like they are 
> very highly compressed, or
> a lot of bits are being lost or something.  Any application used to play 
> sound has the same problem.
> Speech is barely recognizable.  Using the built in speakers or external 
> speakers through the audio
> jack give the same result.
> 
> I have tried several QEMU_AUDIO* settings, but it seems like these make no 
> difference.
> 
> If I connect to the machine using remote desktop (mstsc.exe, rdesktop, 
> xfreerdp), the sound gets
> redirected to the local machine and sounds perfect.  The sound is only 
> garbled when using SDL.  The
> same audio problems exist if I start up the machine and connect to it with 
> vnc.
> 
> 
> I want to use SDL because I need to use Microsoft Lync, which requires me to 
> log in on the console.
> Using voice with Lync is not supported in a remote desktop environment.
> 
> Here is lspci on the audio device on the host machine.
> 
> $ sudo lspci -vvv -s 00:1b.0
> 00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family 
> High Definition Audio
> Controller (rev 04)
> Subsystem: Lenovo Device 21d2
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B-
> DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> SERR-
>  Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 51
> Region 0: Memory at f162 (64-bit, non-prefetchable) [size=16K]
> Capabilities: [50] Power Management version 2
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA 
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: fee004b8  Data: 
> Capabilities: [70] Express (v1) Root Complex Integrated Endpoint, MSI 
> 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, 
> L1 <1us
> ExtTag- RBE- FLReset+
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 128 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ 
> TransPend-
> LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown, 
> Latency L0 <64ns, L1 <1us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- 
> DLActive- BWMgmt- ABWMgmt-
> Capabilities: [100 v1] Virtual Channel
> Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
> Arb:Fixed- WRR32- WRR64- WRR128-
> Ctrl:   ArbSelect=Fixed
> Status: InProgress-
> VC0:Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> Status: NegoPending- InProgress-
> VC1:Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
>

Re: [PATCH v6 6/9] KVM: MMU: fast path of handling guest page fault

2012-06-18 Thread Marcelo Tosatti
On Thu, Jun 14, 2012 at 11:00:14AM +0800, Xiao Guangrong wrote:
> On 06/14/2012 06:40 AM, Marcelo Tosatti wrote:
> 
> > On Tue, May 29, 2012 at 02:50:32PM +0800, Xiao Guangrong wrote:
> >> If the the present bit of page fault error code is set, it indicates
> >> the shadow page is populated on all levels, it means what we do is
> >> only modify the access bit which can be done out of mmu-lock
> >>
> >> Currently, in order to simplify the code, we only fix the page fault
> >> caused by write-protect on the fast path
> >>
> >> Signed-off-by: Xiao Guangrong 
> >> ---
> >>  arch/x86/kvm/mmu.c |  126 
> >> +++-
> >>  1 files changed, 114 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 150c5ad..d6101a8 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -445,6 +445,11 @@ static bool __check_direct_spte_mmio_pf(u64 spte)
> >>  }
> >>  #endif
> >>
> >> +static bool spte_can_be_writable(u64 spte)
> >> +{
> >> +  return !(~spte & (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE));
> >> +}
> >> +
> > 
> > spte_is_locklessly_modifiable(). Its easy to confuse
> > "spte_can_be_writable" with different things.
> > 
> 
> 
> Yes. Will update it.
> 
> >>  static bool spte_has_volatile_bits(u64 spte)
> >>  {
> >>if (!shadow_accessed_mask)
> >> @@ -454,7 +459,7 @@ static bool spte_has_volatile_bits(u64 spte)
> >>return false;
> >>
> >>if ((spte & shadow_accessed_mask) &&
> >> -(!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
> >> +(!spte_can_be_writable(spte) || (spte & shadow_dirty_mask)))
> >>return false;
> > 
> > mmu_spte_update is handling several different cases. Please rewrite
> > it, add a comment on top of it (or spread comments on top of each
> > significant code line) with all cases it is handling (also recheck it
> > regarding new EPT accessed/dirty bits code).
> > 
> 
> 
> Okay.
> 
> > For one thing, if spte can be updated locklessly the update must be
> > atomic:
> > 
> > if spte can be locklessly updated
> > read-and-modify must be atomic.
> 
> 
> Actually, i did it in the v5, Avi has some comments on that. Please
> see https://lkml.org/lkml/2012/5/24/55
> 
> What the reason we should locklessly update spte here? So far i know
> is for volatile bit lost and getting a stable is_writable_spte()?

Yes.

> But this two cases can be avoided by using spte_can_be_writable(spte)
> instead of is_writable_pte(spte), right?

Well, yes, but it becomes confusing: this optimization is always going
to consider sptes that can be locklessly updated as dirty, even though 
they are read-only. Is that what is wanted?

Ok, if you/Avi want to avoid an atomic read-and-update, please
introduce it later an as optimization patch.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 6/6] KVM: introduce readonly memslot

2012-06-18 Thread Marcelo Tosatti
On Mon, Jun 18, 2012 at 12:50:10PM +0300, Avi Kivity wrote:
> On 06/16/2012 05:11 AM, Marcelo Tosatti wrote:
> > 
> > Can you introduce a separate exit reason, say KVM_EXIT_READ_FAULT, with
> > information about the fault?
> 
> I think you mean WRITE_FAULT.  

Yes.

> But what's wrong with the normal mmio exit?

It is necessary to perform an address->mmio region lookup, to verify
whether the mmio exit is due to an actual mmio (no memory slot) or from
a write access to a write protected slot. That information is readily
available in the kernel but is lost if the mmio exit is used to transmit 
the information.

Moreover, i'd argue the uses are different: one is an mmio emulation
exit, the other is more like handling a pagefault in qemu.

> > Then perform this exit only if userspace allows it by explicit enable, 
> > and by default have the exit_read_fault handler jump to the mmio
> > handler. 
> 
> 
> I don't get this.


CAN USERSPACE HANDLE WRITE FAULT EXITS?
YES: WRITE FAULT EXIT.
NO: MMIO EXIT.

But then again userspace won't set read-only slots if it does not know
about them. So it is not necessary.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Eduardo Habkost
On Sun, Jun 17, 2012 at 01:12:31PM -0700, Chegu Vinod wrote:
> The -numa option to qemu is used to create [fake] numa nodes
> and expose them to the guest OS instance.
> 
> There are a couple of issues with the -numa option:
> 
> a) Max VCPU's that can be specified for a guest while using
>the qemu's -numa option is 64. Due to a typecasting issue
>when the number of VCPUs is > 32 the VCPUs don't show up
>under the specified [fake] numa nodes.
> 
> b) KVM currently has support for 160VCPUs per guest. The
>qemu's -numa option has only support for upto 64VCPUs
>per guest.
> 
> This patch addresses these two issues. [ Note: This
> patch has been verified by Eduardo Habkost ].

I was going to add a Tested-by line, but this patch breaks the automatic
round-robin assignment when no CPU is added to any node on the
command-line.

Additional questions below:

[...]
> diff --git a/cpus.c b/cpus.c
> index b182b3d..f9cee60 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1145,7 +1145,7 @@ void set_numa_modes(void)
>  
>  for (env = first_cpu; env != NULL; env = env->next_cpu) {
>  for (i = 0; i < nb_numa_nodes; i++) {
> -if (node_cpumask[i] & (1 << env->cpu_index)) {
> +if (CPU_ISSET_S(env->cpu_index, cpumask_size, node_cpumask[i])) {
>  env->numa_node = i;
>  }
>  }
> diff --git a/hw/pc.c b/hw/pc.c
> index 8368701..f0c3665 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -639,7 +639,7 @@ static void *bochs_bios_init(void)
>  numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
>  for (i = 0; i < max_cpus; i++) {
>  for (j = 0; j < nb_numa_nodes; j++) {
> -if (node_cpumask[j] & (1 << i)) {
> +if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) {
>  numa_fw_cfg[i + 1] = cpu_to_le64(j);
>  break;
>  }
> diff --git a/sysemu.h b/sysemu.h
> index bc2c788..6e4d342 100644
> --- a/sysemu.h
> +++ b/sysemu.h
> @@ -9,6 +9,7 @@
>  #include "qapi-types.h"
>  #include "notify.h"
>  #include "main-loop.h"
> +#include 
>  
>  /* vl.c */
>  
> @@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2];
>  extern QEMUClock *rtc_clock;
>  
>  #define MAX_NODES 64
> +#define KVM_MAX_VCPUS 254

Do we really need this constant? Why not just use max_cpus when
allocating the CPU sets, instead?


>  extern int nb_numa_nodes;
>  extern uint64_t node_mem[MAX_NODES];
> -extern uint64_t node_cpumask[MAX_NODES];
> +extern cpu_set_t *node_cpumask[MAX_NODES];
> +extern size_t cpumask_size;
>  
>  #define MAX_OPTION_ROMS 16
>  typedef struct QEMUOptionRom {
> diff --git a/vl.c b/vl.c
> index 204d85b..1906412 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Needed early for CONFIG_BSD etc. */
>  #include "config-host.h"
> @@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = 
> QTAILQ_HEAD_INITIALIZER(fw_boot_order
>  
>  int nb_numa_nodes;
>  uint64_t node_mem[MAX_NODES];
> -uint64_t node_cpumask[MAX_NODES];
> +cpu_set_t *node_cpumask[MAX_NODES];
> +size_t cpumask_size;
>  
>  uint8_t qemu_uuid[16];
>  
> @@ -950,6 +952,9 @@ static void numa_add(const char *optarg)
>  char *endptr;
>  unsigned long long value, endvalue;
>  int nodenr;
> +int i;
> +
> +value = endvalue = 0;
>  
>  optarg = get_opt_name(option, 128, optarg, ',') + 1;
>  if (!strcmp(option, "node")) {
> @@ -970,27 +975,17 @@ static void numa_add(const char *optarg)
>  }
>  node_mem[nodenr] = sval;
>  }
> -if (get_param_value(option, 128, "cpus", optarg) == 0) {
> -node_cpumask[nodenr] = 0;
> -} else {
> +if (get_param_value(option, 128, "cpus", optarg) != 0) {
>  value = strtoull(option, &endptr, 10);
> -if (value >= 64) {
> -value = 63;
> -fprintf(stderr, "only 64 CPUs in NUMA mode supported.\n");
> +if (*endptr == '-') {
> +endvalue = strtoull(endptr+1, &endptr, 10);
>  } else {
> -if (*endptr == '-') {
> -endvalue = strtoull(endptr+1, &endptr, 10);
> -if (endvalue >= 63) {
> -endvalue = 62;
> -fprintf(stderr,
> -"only 63 CPUs in NUMA mode supported.\n");
> -}
> -value = (2ULL << endvalue) - (1ULL << value);
> -} else {
> -value = 1ULL << value;
> -}
> +endvalue = value;
> +}
> +
> +for (i = value; i <= endvalue; ++i) {
> +CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]);

What if endvalue is larger than the cpu mask size we allocated?


>  }
> -node_cpumask[nodenr] = value;
>  }
>  nb_numa_nodes++;
>  }
> @@ -2331,7 +2326,9 @@ int main(int argc, char 

Re: [PATCH v8 10/15] ARM: KVM: Inject IRQs and FIQs from userspace

2012-06-18 Thread Christoffer Dall
On Mon, Jun 18, 2012 at 9:32 AM, Avi Kivity  wrote:
> On 06/15/2012 10:08 PM, Christoffer Dall wrote:
>> From: Christoffer Dall 
>>
>> Userspace can inject IRQs and FIQs through the KVM_IRQ_LINE VM ioctl.
>> This ioctl is used since the sematics are in fact two lines that can be
>> either raised or lowered on the VCPU - the IRQ and FIQ lines.
>>
>> KVM needs to know which VCPU it must operate on and whether the FIQ or
>> IRQ line is raised/lowered. Hence both pieces of information is packed
>> in the kvm_irq_level->irq field. The irq fild value will be:
>>   IRQ: vcpu_index << 1
>>   FIQ: (vcpu_index << 1) | 1
>>
>> This is documented in Documentation/kvm/api.txt.
>>
>> The effect of the ioctl is simply to simply raise/lower the
>> corresponding irq_line field on the VCPU struct, which will cause the
>> world-switch code to raise/lower virtual interrupts when running the
>> guest on next switch. The wait_for_interrupt flag is also cleared for
>> raised IRQs or FIQs causing an idle VCPU to become active again. CPUs
>> in guest mode are kicked to make sure they refresh their interrupt status.
>
>>
>> +static int kvm_arch_vm_ioctl_irq_line(struct kvm *kvm,
>> +                                   struct kvm_irq_level *irq_level)
>> +{
>> +     int mask;
>> +     unsigned int vcpu_idx;
>> +     struct kvm_vcpu *vcpu;
>> +     unsigned long old, new, *ptr;
>> +
>> +     vcpu_idx = irq_level->irq >> 1;
>> +     if (vcpu_idx >= KVM_MAX_VCPUS)
>> +             return -EINVAL;
>> +
>> +     vcpu = kvm_get_vcpu(kvm, vcpu_idx);
>> +     if (!vcpu)
>> +             return -EINVAL;
>> +
>> +     if ((irq_level->irq & 1) == KVM_ARM_IRQ_LINE)
>> +             mask = HCR_VI;
>> +     else /* KVM_ARM_FIQ_LINE */
>> +             mask = HCR_VF;
>> +
>> +     trace_kvm_set_irq(irq_level->irq, irq_level->level, 0);
>> +
>> +     ptr = (unsigned long *)&vcpu->arch.irq_lines;
>> +     do {
>> +             old = ACCESS_ONCE(*ptr);
>> +             if (irq_level->level)
>> +                     new = old | mask;
>> +             else
>> +                     new = old & ~mask;
>> +
>> +             if (new == old)
>> +                     return 0; /* no change */
>> +     } while (cmpxchg(ptr, old, new) != old);
>
> Isn't this a complicated
>
>
>   if (level)
>       set_bit()
>   else
>       clear_bit()
>
> ?
>

we need to atomically know if we changed either the FIQ/IRQ fields and
atomically update without locking. (I think you suggested this
approach, in fact).

>> +
>> +     /*
>> +      * The vcpu irq_lines field was updated, wake up sleeping VCPUs and
>> +      * trigger a world-switch round on the running physical CPU to set the
>> +      * virtual IRQ/FIQ fields in the HCR appropriately.
>> +      */
>> +     kvm_vcpu_kick(vcpu);
>
> No need to wake when the line is asserted so you can make this
> conditional on level.
>

we need to trigger a world switch to update the virtualized register
from the actual running physical CPU if we changed any of the IRQ/FIQ
fields. We return in the loop above if we didn't change anything.

>> +
>> +     return 0;
>> +}
>> +
>>  long kvm_arch_vcpu_ioctl(struct file *filp,
>>                        unsigned int ioctl, unsigned long arg)
>>  {
>> @@ -298,7 +345,20 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
>> kvm_dirty_log *log)
>>  long kvm_arch_vm_ioctl(struct file *filp,
>>                      unsigned int ioctl, unsigned long arg)
>>  {
>> -     return -EINVAL;
>> +     struct kvm *kvm = filp->private_data;
>> +     void __user *argp = (void __user *)arg;
>> +
>> +     switch (ioctl) {
>> +     case KVM_IRQ_LINE: {
>> +             struct kvm_irq_level irq_event;
>> +
>> +             if (copy_from_user(&irq_event, argp, sizeof irq_event))
>> +                     return -EFAULT;
>> +             return kvm_arch_vm_ioctl_irq_line(kvm, &irq_event);
>> +     }
>> +     default:
>> +             return -EINVAL;
>> +     }
>>  }
>
> Should be in common code guarded by the define introduced previously.
>
>

you mean like this: ?


diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a9f209b..5bf2193 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -535,8 +535,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu,
struct kvm_run *run)
return ret;
 }

-static int kvm_arch_vm_ioctl_irq_line(struct kvm *kvm,
- struct kvm_irq_level *irq_level)
+int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level)
 {
int mask;
unsigned int vcpu_idx;
@@ -596,20 +595,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log)
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
-   struct kvm *kvm = filp->private_data;
-   void __user *argp = (void __user *)arg;
-
-   switch (ioctl) {
-   case KVM_IRQ_LINE: {
-   struct kvm_irq_level irq_event;
-
-   if (copy_from_user(&irq_event, argp,

Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Tejun Heo
Hello, guys.

On Mon, Jun 18, 2012 at 07:35:22PM +0930, Rusty Russell wrote:
> On Mon, 18 Jun 2012 16:03:23 +0800, Asias He  wrote:
> > On 06/18/2012 03:46 PM, Rusty Russell wrote:
> > > On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:
> > >> This patch introduces bio-based IO path for virtio-blk.
> > >
> > > Why make it optional?
> > 
> > request-based IO path is useful for users who do not want to bypass the 
> > IO scheduler in guest kernel, e.g. users using spinning disk. For users 
> > using fast disk device, e.g. SSD device, they can use bio-based IO path.
> 
> Users using a spinning disk still get IO scheduling in the host though.
> What benefit is there in doing it in the guest as well?

Another thing is that some of cgroup features are impelmented in IO
scheduler (cfq).  e.g. bio based driver will be able to use cgroup
based throttling but IO weights won't work.  Not sure how meaningful
this is tho.

With that said, I think this sort of feature switching is quite ugly.
The pros and cons of each choice aren't obvious unless one is familiar
with implementation details.  IMHO, if the benefits of ioscheds aren't
critical, it would be better to just go with bio based implementation.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] block: Introduce __blk_segment_map_sg() helper

2012-06-18 Thread Tejun Heo
Hello, Asias.

On Mon, Jun 18, 2012 at 02:53:08PM +0800, Asias He wrote:
> Split the mapping code in blk_rq_map_sg() to a helper
> __blk_segment_map_sg(), so that other mapping function, e.g.
> blk_bio_map_sg(), can share the code.
> 
> Cc: Jens Axboe 
> Cc: Tejun Heo 
> Cc: Shaohua Li 
> Cc: linux-ker...@vger.kernel.org
> Suggested-by: Tejun Heo 
> Suggested-by: Jens Axboe 
> Signed-off-by: Asias He 
> ---
>  block/blk-merge.c |   80 
> ++---
>  1 file changed, 45 insertions(+), 35 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 160035f..576b68e 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -110,6 +110,49 @@ static int blk_phys_contig_segment(struct request_queue 
> *q, struct bio *bio,
>   return 0;
>  }
>  
> +static void
> +__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
> +  struct scatterlist *sglist, struct bio_vec **bvprv,
> +  struct scatterlist **sg, int *nsegs, int *cluster)
> +{
> +
> + int nbytes = bvec->bv_len;
> +
> + if (*bvprv && *cluster) {
> + if ((*sg)->length + nbytes > queue_max_segment_size(q))
> + goto new_segment;
> +
> + if (!BIOVEC_PHYS_MERGEABLE(*bvprv, bvec))
> + goto new_segment;
> + if (!BIOVEC_SEG_BOUNDARY(q, *bvprv, bvec))
> + goto new_segment;
> +
> + (*sg)->length += nbytes;
> + } else {
> +new_segment:
> + if (!*sg)
> + *sg = sglist;
> + else {
> + /*
> +  * If the driver previously mapped a shorter
> +  * list, we could see a termination bit
> +  * prematurely unless it fully inits the sg
> +  * table on each mapping. We KNOW that there
> +  * must be more entries here or the driver
> +  * would be buggy, so force clear the
> +  * termination bit to avoid doing a full
> +  * sg_init_table() in drivers for each command.
> +  */
> + (*sg)->page_link &= ~0x02;
> + *sg = sg_next(*sg);
> + }
> +
> + sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
> + (*nsegs)++;
> + }
> + *bvprv = bvec;
> +}

I *hope* this is a bit prettier.  e.g. Do we really need to pass in
@sglist and keep using "goto new_segment"?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Chegu Vinod

On 6/18/2012 1:29 PM, Eduardo Habkost wrote:

On Sun, Jun 17, 2012 at 01:12:31PM -0700, Chegu Vinod wrote:

The -numa option to qemu is used to create [fake] numa nodes
and expose them to the guest OS instance.

There are a couple of issues with the -numa option:

a) Max VCPU's that can be specified for a guest while using
the qemu's -numa option is 64. Due to a typecasting issue
when the number of VCPUs is>  32 the VCPUs don't show up
under the specified [fake] numa nodes.

b) KVM currently has support for 160VCPUs per guest. The
qemu's -numa option has only support for upto 64VCPUs
per guest.

This patch addresses these two issues. [ Note: This
patch has been verified by Eduardo Habkost ].

I was going to add a Tested-by line, but this patch breaks the automatic
round-robin assignment when no CPU is added to any node on the
command-line.


Pl. see below.


Additional questions below:

[...]

diff --git a/cpus.c b/cpus.c
index b182b3d..f9cee60 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1145,7 +1145,7 @@ void set_numa_modes(void)

  for (env = first_cpu; env != NULL; env = env->next_cpu) {
  for (i = 0; i<  nb_numa_nodes; i++) {
-if (node_cpumask[i]&  (1<<  env->cpu_index)) {
+if (CPU_ISSET_S(env->cpu_index, cpumask_size, node_cpumask[i])) {
  env->numa_node = i;
  }
  }
diff --git a/hw/pc.c b/hw/pc.c
index 8368701..f0c3665 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -639,7 +639,7 @@ static void *bochs_bios_init(void)
  numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
  for (i = 0; i<  max_cpus; i++) {
  for (j = 0; j<  nb_numa_nodes; j++) {
-if (node_cpumask[j]&  (1<<  i)) {
+if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) {
  numa_fw_cfg[i + 1] = cpu_to_le64(j);
  break;
  }
diff --git a/sysemu.h b/sysemu.h
index bc2c788..6e4d342 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -9,6 +9,7 @@
  #include "qapi-types.h"
  #include "notify.h"
  #include "main-loop.h"
+#include

  /* vl.c */

@@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2];
  extern QEMUClock *rtc_clock;

  #define MAX_NODES 64
+#define KVM_MAX_VCPUS 254

Do we really need this constant? Why not just use max_cpus when
allocating the CPU sets, instead?


Hmm I had thought about this earlier too max_cpus was not getting 
set at the point where the allocations were being done. I have now moved 
that code to a bit later and switched to using

max_cpus.





  extern int nb_numa_nodes;
  extern uint64_t node_mem[MAX_NODES];
-extern uint64_t node_cpumask[MAX_NODES];
+extern cpu_set_t *node_cpumask[MAX_NODES];
+extern size_t cpumask_size;

  #define MAX_OPTION_ROMS 16
  typedef struct QEMUOptionRom {
diff --git a/vl.c b/vl.c
index 204d85b..1906412 100644
--- a/vl.c
+++ b/vl.c
@@ -28,6 +28,7 @@
  #include
  #include
  #include
+#include

  /* Needed early for CONFIG_BSD etc. */
  #include "config-host.h"
@@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = 
QTAILQ_HEAD_INITIALIZER(fw_boot_order

  int nb_numa_nodes;
  uint64_t node_mem[MAX_NODES];
-uint64_t node_cpumask[MAX_NODES];
+cpu_set_t *node_cpumask[MAX_NODES];
+size_t cpumask_size;

  uint8_t qemu_uuid[16];

@@ -950,6 +952,9 @@ static void numa_add(const char *optarg)
  char *endptr;
  unsigned long long value, endvalue;
  int nodenr;
+int i;
+
+value = endvalue = 0;

  optarg = get_opt_name(option, 128, optarg, ',') + 1;
  if (!strcmp(option, "node")) {
@@ -970,27 +975,17 @@ static void numa_add(const char *optarg)
  }
  node_mem[nodenr] = sval;
  }
-if (get_param_value(option, 128, "cpus", optarg) == 0) {
-node_cpumask[nodenr] = 0;
-} else {
+if (get_param_value(option, 128, "cpus", optarg) != 0) {
  value = strtoull(option,&endptr, 10);
-if (value>= 64) {
-value = 63;
-fprintf(stderr, "only 64 CPUs in NUMA mode supported.\n");
+if (*endptr == '-') {
+endvalue = strtoull(endptr+1,&endptr, 10);
  } else {
-if (*endptr == '-') {
-endvalue = strtoull(endptr+1,&endptr, 10);
-if (endvalue>= 63) {
-endvalue = 62;
-fprintf(stderr,
-"only 63 CPUs in NUMA mode supported.\n");
-}
-value = (2ULL<<  endvalue) - (1ULL<<  value);
-} else {
-value = 1ULL<<  value;
-}
+endvalue = value;
+}
+
+for (i = value; i<= endvalue; ++i) {
+CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]);

What if endvalue is larger than the cpu mask size we allocated?


I can add a check (endvalue >= max_cpus) and print an error.
Should we force set the endvalue to max_cpus-1 in that case ?


Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Andreas Färber
Am 17.06.2012 22:12, schrieb Chegu Vinod:
> diff --git a/vl.c b/vl.c
> index 204d85b..1906412 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

Did you check whether this and the macros you're using are available on
POSIX and mingw32? vl.c is a pretty central file.

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 11/15] ARM: KVM: World-switch implementation

2012-06-18 Thread Christoffer Dall
On Mon, Jun 18, 2012 at 9:41 AM, Avi Kivity  wrote:
> On 06/15/2012 10:08 PM, Christoffer Dall wrote:
>> Provides complete world-switch implementation to switch to other guests
>> running in non-secure modes. Includes Hyp exception handlers that
>> capture necessary exception information and stores the information on
>> the VCPU and KVM structures.
>>
>> Switching to Hyp mode is done through a simple HVC instructions. The
>> exception vector code will check that the HVC comes from VMID==0 and if
>> so will store the necessary state on the Hyp stack, which will look like
>> this (growing downwards, see the hyp_hvc handler):
>>   ...
>>   Hyp_Sp + 4: spsr (Host-SVC cpsr)
>>   Hyp_Sp    : lr_usr
>>
>> When returning from Hyp mode to SVC mode, another HVC instruction is
>> executed from Hyp mode, which is taken in the Hyp_Svc handler. The Hyp
>> stack pointer should be where it was left from the above initial call,
>> since the values on the stack will be used to restore state (see
>> hyp_svc).
>>
>> Otherwise, the world-switch is pretty straight-forward. All state that
>> can be modified by the guest is first backed up on the Hyp stack and the
>> VCPU values is loaded onto the hardware. State, which is not loaded, but
>> theoretically modifiable by the guest is protected through the
>> virtualiation features to generate a trap and cause software emulation.
>> Upon guest returns, all state is restored from hardware onto the VCPU
>> struct and the original state is restored from the Hyp-stack onto the
>> hardware.
>>
>> One controversy may be the back-door call to __irq_svc (the host
>> kernel's own physical IRQ handler) which is called when a physical IRQ
>> exception is taken in Hyp mode while running in the guest.
>>
>> SMP support using the VMPIDR calculated on the basis of the host MPIDR
>> and overriding the low bits with KVM vcpu_id contributed by Marc Zyngier.
>>
>> Reuse of VMIDs has been implemented by Antonios Motakis and adapated from
>> a separate patch into the appropriate patches introducing the
>> functionality. Note that the VMIDs are stored per VM as required by the ARM
>> architecture reference manual.
>>
>> +
>> +/**
>> + * update_vttbr - Update the VTTBR with a valid VMID before the guest runs
>> + * @kvm      The guest that we are about to run
>> + *
>> + * Called from kvm_arch_vcpu_ioctl_run before entering the guest to ensure 
>> the
>> + * VM has a valid VMID, otherwise assigns a new one and flushes 
>> corresponding
>> + * caches and TLBs.
>> + */
>> +static void update_vttbr(struct kvm *kvm)
>> +{
>> +     phys_addr_t pgd_phys;
>> +
>> +     spin_lock(&kvm_vmid_lock);
>
> Premature, but this is sad.  I suggest you split vmid generation from
> next available vmid.  This allows you to make the generation counter
> atomic so it may be read outside the lock.
>
> You can do
>
>    if (likely(kvm->arch.vmd_gen) == atomic_read(&kvm_vmid_gen)))
>           return;
>
>    spin_lock(...
>

I knew you were going to say something here :), please take a look at
this and see if you agree:

+struct kvm_arch {
+   /* The VMID generation used for the virt. memory system */
+   u64vmid_gen;
+   u32vmid;
+
+   /* 1-level 2nd stage table and lock */
+   struct mutex pgd_mutex;
+   pgd_t *pgd;
+
+   /* VTTBR value associated with above pgd and vmid */
+   u64vttbr;
+};

[snip]

+/* The VMID used in the VTTBR */
+static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
+static u8 kvm_next_vmid;
+DEFINE_SPINLOCK(kvm_vmid_lock);

[snip]

+/**
+ * update_vttbr - Update the VTTBR with a valid VMID before the guest runs
+ * @kvmThe guest that we are about to run
+ *
+ * Called from kvm_arch_vcpu_ioctl_run before entering the guest to ensure the
+ * VM has a valid VMID, otherwise assigns a new one and flushes corresponding
+ * caches and TLBs.
+ */
+static void update_vttbr(struct kvm *kvm)
+{
+   phys_addr_t pgd_phys;
+
+   /*
+*  Check that the VMID is still valid.
+*  (The hardware supports only 256 values with the value zero
+*   reserved for the host, so we check if an assigned value belongs to
+*   a previous generation, which which requires us to assign a new
+*   value. If we're the first to use a VMID for the new generation,
+*   we must flush necessary caches and TLBs on all CPUs.)
+*/
+   if (likely(kvm->arch.vmid_gen == atomic64_read(&kvm_vmid_gen)))
+   return;
+
+   spin_lock(&kvm_vmid_lock);
+
+   /* First user of a new VMID generation? */
+   if (unlikely(kvm_next_vmid == 0)) {
+   atomic64_inc(&kvm_vmid_gen);
+   kvm_next_vmid = 1;
+
+   /* This does nothing on UP */
+   smp_call_function(reset_vm_context, NULL, 1);
+
+   /*
+* On SMP we know no other CPUs can use this CPU's or
+* each other's VMID since the kvm_vmid_lock blocks
+* them from reentry to t

Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Eric Blake
On 06/18/2012 04:05 PM, Andreas Färber wrote:
> Am 17.06.2012 22:12, schrieb Chegu Vinod:
>> diff --git a/vl.c b/vl.c
>> index 204d85b..1906412 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -28,6 +28,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
> 
> Did you check whether this and the macros you're using are available on
> POSIX and mingw32? vl.c is a pretty central file.

POSIX, yes.  mingw32, no.  Use of preprocessor conditionals is probably
in order.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org





signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Chegu Vinod

On 6/18/2012 3:11 PM, Eric Blake wrote:

On 06/18/2012 04:05 PM, Andreas Färber wrote:

Am 17.06.2012 22:12, schrieb Chegu Vinod:

diff --git a/vl.c b/vl.c
index 204d85b..1906412 100644
--- a/vl.c
+++ b/vl.c
@@ -28,6 +28,7 @@
  #include
  #include
  #include
+#include

Did you check whether this and the macros you're using are available on
POSIX and mingw32? vl.c is a pretty central file.

POSIX, yes.  mingw32, no.  Use of preprocessor conditionals is probably
in order.


Thanks. I will look into this.
Vinod
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 13/15] ARM: KVM: Handle guest faults in KVM

2012-06-18 Thread Christoffer Dall
On Mon, Jun 18, 2012 at 9:45 AM, Avi Kivity  wrote:
> On 06/15/2012 10:09 PM, Christoffer Dall wrote:
>> From: Christoffer Dall 
>>
>> Handles the guest faults in KVM by mapping in corresponding user pages
>> in the 2nd stage page tables.
>>
>> Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
>> pgprot_guest variables used to map 2nd stage memory for KVM guests.
>>
>> Leverages MMU notifiers on KVM/ARM by supporting the kvm_unmap_hva() 
>> operation,
>> where we remove the HVA from the 2nd stage translation. All other KVM MMU
>> notifierhooks are NOPs.
>
> I think you must at least support change_pte (possibly by unmapping).
> Andrea?
>
hmmm, at least for KSM support we need to support change_pte (are
there other callers for this type of memory?)

It's not trivial I guess, since we would need to support COW and
thereby stage-2 permission faults... Marc, right?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 14/15] ARM: KVM: Handle I/O aborts

2012-06-18 Thread Christoffer Dall
On Mon, Jun 18, 2012 at 9:48 AM, Avi Kivity  wrote:
> On 06/15/2012 10:09 PM, Christoffer Dall wrote:
>> When the guest accesses I/O memory this will create data abort
>> exceptions and they are handled by decoding the HSR information
>> (physical address, read/write, length, register) and forwarding reads
>> and writes to QEMU which performs the device emulation.
>>
>> Certain classes of load/store operations do not support the syndrome
>> information provided in the HSR and we therefore must be able to fetch
>> the offending instruction from guest memory and decode it manually.
>>
>> This requires changing the general flow somewhat since new calls to run
>> the VCPU must check if there's a pending MMIO load and perform the write
>> after userspace has made the data available.
>>
>>
>>       memslot = gfn_to_memslot(vcpu->kvm, gfn);
>> diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h
>> index e474a0a..bd3a6cc 100644
>> --- a/arch/arm/kvm/trace.h
>> +++ b/arch/arm/kvm/trace.h
>> @@ -39,6 +39,27 @@ TRACE_EVENT(kvm_exit,
>>       TP_printk("PC: 0x%08lx", __entry->vcpu_pc)
>>  );
>>
>> +TRACE_EVENT(kvm_mmio_emulate,
>> +     TP_PROTO(unsigned long vcpu_pc, unsigned long instr,
>> +              unsigned long cpsr),
>> +     TP_ARGS(vcpu_pc, instr, cpsr),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        unsigned long,  vcpu_pc         )
>> +             __field(        unsigned long,  instr           )
>> +             __field(        unsigned long,  cpsr            )
>> +     ),
>> +
>> +     TP_fast_assign(
>> +             __entry->vcpu_pc                = vcpu_pc;
>> +             __entry->vcpu_pc                = instr;
>> +             __entry->vcpu_pc                = cpsr;
>
> -ECUTANDPASTE
>
hehe, nice, thanks.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] block: Introduce __blk_segment_map_sg() helper

2012-06-18 Thread Asias He

On 06/19/2012 05:31 AM, Tejun Heo wrote:

Hello, Asias.

On Mon, Jun 18, 2012 at 02:53:08PM +0800, Asias He wrote:

Split the mapping code in blk_rq_map_sg() to a helper
__blk_segment_map_sg(), so that other mapping function, e.g.
blk_bio_map_sg(), can share the code.

Cc: Jens Axboe 
Cc: Tejun Heo 
Cc: Shaohua Li 
Cc: linux-ker...@vger.kernel.org
Suggested-by: Tejun Heo 
Suggested-by: Jens Axboe 
Signed-off-by: Asias He 
---
  block/blk-merge.c |   80 ++---
  1 file changed, 45 insertions(+), 35 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 160035f..576b68e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -110,6 +110,49 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return 0;
  }

+static void
+__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
+struct scatterlist *sglist, struct bio_vec **bvprv,
+struct scatterlist **sg, int *nsegs, int *cluster)
+{
+
+   int nbytes = bvec->bv_len;
+
+   if (*bvprv && *cluster) {
+   if ((*sg)->length + nbytes > queue_max_segment_size(q))
+   goto new_segment;
+
+   if (!BIOVEC_PHYS_MERGEABLE(*bvprv, bvec))
+   goto new_segment;
+   if (!BIOVEC_SEG_BOUNDARY(q, *bvprv, bvec))
+   goto new_segment;
+
+   (*sg)->length += nbytes;
+   } else {
+new_segment:
+   if (!*sg)
+   *sg = sglist;
+   else {
+   /*
+* If the driver previously mapped a shorter
+* list, we could see a termination bit
+* prematurely unless it fully inits the sg
+* table on each mapping. We KNOW that there
+* must be more entries here or the driver
+* would be buggy, so force clear the
+* termination bit to avoid doing a full
+* sg_init_table() in drivers for each command.
+*/
+   (*sg)->page_link &= ~0x02;
+   *sg = sg_next(*sg);
+   }
+
+   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+   (*nsegs)++;
+   }
+   *bvprv = bvec;
+}


I *hope* this is a bit prettier.  e.g. Do we really need to pass in
@sglist and keep using "goto new_segment"?


I think this deserves another patch on top of this splitting one. I'd 
like to clean this up later.


--
Asias


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 6/9] KVM: MMU: fast path of handling guest page fault

2012-06-18 Thread Xiao Guangrong
On 06/19/2012 03:32 AM, Marcelo Tosatti wrote:

 
> Ok, if you/Avi want to avoid an atomic read-and-update, please
> introduce it later an as optimization patch.


Good to me, let's discus it in the separate patch. Thank you, Marcelo!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 6/9] KVM: MMU: fast path of handling guest page fault

2012-06-18 Thread Takuya Yoshikawa
On Mon, 18 Jun 2012 16:21:20 -0300
Marcelo Tosatti  wrote:

> > [not about this patch]
> > 
> > EPT accessed/dirty bits will be used for more things in the future.
> > Are there any rules for using these bits?
> > 
> > Same as other bits?
> 
> Do you mean hardware rules or KVM rules?

KVM rules.

Your concern was about "lock-less" v. "current EPT-A/D code".
My question was what we should care not to break the former
in the future.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 6/6] KVM: introduce readonly memslot

2012-06-18 Thread Xiao Guangrong
On 06/18/2012 06:11 PM, Avi Kivity wrote:


>>  static int check_memory_region_flags(struct kvm_userspace_memory_region 
>> *mem)
>>  {
>> -if (mem->flags & ~KVM_MEM_LOG_DIRTY_PAGES)
>> +if (mem->flags & ~(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY))
>>  return -EINVAL;
> 
> Only x86 supports readonly so far.
> 


Right, will fix.

>>
>> -static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t 
>> gfn,
>> - gfn_t *nr_pages)
>> +static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t 
>> gfn,
>> + gfn_t *nr_pages, bool write)
>>  {
>> -if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
>> +if (!slot || slot->flags & KVM_MEMSLOT_INVALID ||
>> +  ((slot->flags & KVM_MEM_READONLY) && write))
>>  return bad_hva();
>>
>>  if (nr_pages)
>> @@ -1045,6 +1046,12 @@ static unsigned long gfn_to_hva_many(struct 
>> kvm_memory_slot *slot, gfn_t gfn,
>>  return gfn_to_hva_memslot(slot, gfn);
>>  }
>>
>> +static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t 
>> gfn,
>> + gfn_t *nr_pages)
>> +{
>> +return __gfn_to_hva_many(slot, gfn, nr_pages, true);
>> +}
> 
> We have dozens of translation functions: read/write guest virtual, guest
> physical (nested and non nested), host virtual, host physical,
> atomic/nonatomic, sync/async, with/without slot lookup, and probably a
> few more I forgot.
> 


Yes, they make me confused me too.

> I think we should refactor this into a series of on-step translations:
> 
>/*
> * Translate gva/len write access to a number of tlb entries
> * (due to cross-page splits) or a fault
> */
>gva_to_tlb(gva, len, ACCESS_WRITE, &translation);
>/*
> * Translate tlb entries to callbacks that do I/O (either directly
> * or through KVM_EXIT_MMIO, provided there is no exception pending
> */
>tlb_to_io(&translation, &iolist, IO_ATOMIC);
>/*
> * Initiate I/O (if no exception)
> */
>run_iolist(&iolist, data);
> 
>struct gpa_scatterlist {
>unsigned nr_entries;
>struct {
>gpa_t gpa;
>unsigned len;
>} entry[2];
>struct x86_exception exception;
>};
> 
>struct kvm_iolist {
>unsigned nr_entries;
>struct kvm_ioentry {
>struct kvm_memslot *slot;  /* NULL for mmio */
>struct something *kernel_iodevice;
>gfn_t page_in_slot;
>unsigned offset_in_page;
>unsigned len;
>void (*iofunc)(struct kvm_ioentry *entry, void *data);
>} entry[2];
>struct x86_exception execption;
>};
> 
> This is of course outside the scope of this patchset, just something to
> think about (and write opinions on).


I will think it more and try to do it based on you idea after this patchset,
thank you, Avi!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/6] KVM: pass slot to hva_to_pfn

2012-06-18 Thread Xiao Guangrong
On 06/18/2012 06:15 PM, Avi Kivity wrote:

> On 06/12/2012 05:48 AM, Xiao Guangrong wrote:
>> This parameter will be used in the later patch
>>
>>
>> -return hva_to_pfn(kvm, addr, atomic, async, write_fault, writable);
>> +return hva_to_pfn(kvm, slot, addr, atomic, async, write_fault,
>> +  writable);
>>  }
>>
>>  pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
>> @@ -1205,7 +1210,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
>>   struct kvm_memory_slot *slot, gfn_t gfn)
>>  {
>>  unsigned long addr = gfn_to_hva_memslot(slot, gfn);
>> -return hva_to_pfn(kvm, addr, false, NULL, true, NULL);
>> +return hva_to_pfn(kvm, slot, addr, false, NULL, true, NULL);
>>  }
>>
>>  pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
>> @@ -1213,7 +1218,7 @@ pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
>>  {
>>  unsigned long addr = gfn_to_hva_memslot(slot, gfn);
>>
>> -return hva_to_pfn(kvm, addr, true, NULL, true, NULL);
>> +return hva_to_pfn(kvm, slot, addr, true, NULL, true, NULL);
>>  }
> 
> It was unreadable before, but now it's even more unreadable.


Hmm, i will simplify hva_to_pfn() in the next version.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-18 Thread Asias He

On 06/18/2012 06:13 PM, Michael S. Tsirkin wrote:

On Mon, Jun 18, 2012 at 04:03:23PM +0800, Asias He wrote:

On 06/18/2012 03:46 PM, Rusty Russell wrote:

On Mon, 18 Jun 2012 14:53:10 +0800, Asias He  wrote:

This patch introduces bio-based IO path for virtio-blk.


Why make it optional?


request-based IO path is useful for users who do not want to bypass
the IO scheduler in guest kernel, e.g. users using spinning disk.
For users using fast disk device, e.g. SSD device, they can use
bio-based IO path.


OK I guess but then it should be per-device. There could be
a mix of slow and fast disks :)


Yes, per-device might be useful. There are issues which need solving.
- How do we tell the drive which IO path to use
  - Device add some flag
- Old qemu/lkvm can not turn this feature on
  - Through /sys filesystem attribute
- How do we handle the switch from one path to anther.

So, let's add the per-device feature later.

--
Asias


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/6] KVM: introduce gfn_to_hva_read/kvm_read_hva/kvm_read_hva_atomic

2012-06-18 Thread Xiao Guangrong
On 06/18/2012 06:16 PM, Avi Kivity wrote:

> On 06/12/2012 05:48 AM, Xiao Guangrong wrote:
>> This set of functions is only used to read data from host space, read is
>> a special case in the later patch
>>
>>
>> +/*
>> + * The hva returned by this function is only allowed to be read.
>> + * It should pair with kvm_read_hva() or kvm_read_hva_atomic().
>> + */
>> +static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn)
>> +{
>> +return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
>> +}
>> +
>> +static int kvm_read_hva(void *data, void *hva, int len)
>> +{
>> +return __copy_from_user(data, (void __user *)hva, len);
>> +
>> +}
>> +
>> +static int kvm_read_hva_atomic(void *data, void *hva, int len)
>> +{
>> +return __copy_from_user_inatomic(data, (void __user *)hva, len);
>> +
>> +}
> 
> Why cast to __user?  Make it __user in the first place.
> 


Okay.

> Also these are just simple wrappers, why to them at all?


For gfn_to_hva_read(), they will call gfn_to_hva_many(...write = false)
in the later patch.

For kvm_read_hva() and kvm_read_hva_atomic(), they are pure wrappers,
i think its name is a good hint to let gfn_to_hva_read to pair with
kvm_read_hva()/kvm_read_hva_atomic().

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >