date:20111204

Re: What are the rules for hyperthreaded CPUs?

2011-12-04 Thread Michael Tokarev

On 05.12.2011 05:26, Todd And Margo Chester wrote:
>> Hi All,
>>
>> My host has an i7-960 processor (4 cores, 8 hypterthreaded cores.)
>> /proc/cpuinfo on my host shows 8 processors.
>>
>> On my Fedora Core 16 guest, /proc/cpuinfo shows 4 processors.
>>
>> Question: what is the maximum number of processors I can assign to my
>> guest?  4 or 8?  What are the rules for Hyperthreading?
>>
>> Many thanks,
>> -T
> 
> Figured it out.  The max cores is the number of hyperthreaded cores
> (eight in my case).

The maximum number of guest vCPUs is unlimited - you can assign, say,
100 cores - each vCPU core is just a thread on host, it does not
correspond to any physical characteristics of your host.

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance

2011-12-04 Thread Liu ping fan

On Sun, Dec 4, 2011 at 8:10 PM, Gleb Natapov  wrote:
> On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
>> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka  wrote:
>> > On 2011-12-02 07:26, Liu Ping Fan wrote:
>> >> From: Liu Ping Fan 
>> >>
>> >> Currently, vcpu can be destructed only when kvm instance destroyed.
>> >> Change this to vcpu's destruction taken when its refcnt is zero,
>> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>> >
>> > I'm lacking the big picture yet (would be good to have in the change log
>> > - at least I'm too lazy to read the code):
>> >
>> > What increments the refcnt, what decrements it again? IOW, how does user
>> > space controls the life-cycle of a vcpu after your changes?
>> >
>> In local APIC mode, delivering IPI to target APIC, target's refcnt is
>> incremented, and decremented when finished. At other times, using RCU to
> Why is this needed?
>
Suppose the following scene:

#define kvm_for_each_vcpu(idx, vcpup, kvm) \
for (idx = 0; \
 idx < atomic_read(&kvm->online_vcpus) && \
 (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
 idx++)

-->
Here kvm_vcpu's destruction is called
  vcpup->vcpu_id ...  //oops!


Regards,
ping fan



>> protect the vcpu's reference from its destruction.
>>
>> If kvm_vcpu is not needed by guest, user space can close the
>> kvm_vcpu's file
>> descriptors, and then,if the kvm_vcpu has crossed the period of local
>> APCI mode's reference,it will be destroyed.
>>
>> Regards,
>> ping fan
>>
>> > Thanks,
>> > Jan
>> >
>> > --
>> > Siemens AG, Corporate Technology, CT T DE IT 1
>> > Corporate Competence Center Embedded Linux
>
> --
>                        Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance

2011-12-04 Thread Liu ping fan

On Sun, Dec 4, 2011 at 6:23 PM, Avi Kivity  wrote:
> On 12/02/2011 08:26 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan 
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction taken when its refcnt is zero,
>> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>>
>>
>> @@ -315,9 +315,17 @@ static void pit_do_work(struct work_struct *work)
>>                * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>>                * VCPU0, and only if its LVT0 is in EXTINT mode.
>>                */
>> -             if (kvm->arch.vapics_in_nmi_mode > 0)
>> -                     kvm_for_each_vcpu(i, vcpu, kvm)
>> +             if (kvm->arch.vapics_in_nmi_mode > 0) {
>> +                     rcu_read_lock();
>> +                     kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
>> +                             vcpu = kvm_get_vcpu(kvm, i);
>> +                             if (vcpu == NULL)
>> +                                     continue;
>> +                             cnt++;
>>                               kvm_apic_nmi_wd_deliver(vcpu);
>> +                     }
>> +                     rcu_read_unlock();
>> +             }
>>       }
>>  }
>
> This pattern keeps repeating, please fold it into kvm_for_each_vcpu().
>
What about folding
kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
              vcpu = kvm_get_vcpu(kvm, i);
              if (vcpu == NULL)
                        continue;
              cnt++;

like this,
#define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
for (idx = 0, cnt = 0, vcpup = kvm_get_vcpu(kvm, idx); \
 cnt < atomic_read(&kvm->online_vcpus) && \
 idx < KVM_MAX_VCPUS; \
 idx++, (vcpup == NULL)?:cnt++, vcpup = kvm_get_vcpu(kvm, idx)) \
 if (vcpup == NULL) \
  continue; \
 else


A little ugly, but have not thought a better way out :-)

Thanks,
ping fan
> --
> error compiling committee.c: too many arguments to function
>

Re: [KVM][Kemari]: Build error fix

2011-12-04 Thread OHMURA Kei

On 2011/12/02 21:51, Pradeep Kumar wrote:
> It fixes build failure.
> 
> I hit this error, after succsfull migration and sync.
> 
> (qemu) qemu-system-x86_64: fill buffer failed, Interrupted system call
> 
> qemu-system-x86_64: recv header failed
> 
> qemu-system-x86_64: recv ack failed
> 
> qemu_transaction_begin failed

Did you use master branch?
It is not latest version. next branch is latest and fixed error.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari next

Thanks,
Kei

> 
> Any one working on this now? 
> 
>>From 827c04da6574be80d8352acd7c40b0b4524af5f4 Mon Sep 17 00:00:00 2001
> Date: Fri, 2 Dec 2011 18:11:40 +0530
> Subject: [PATCH]  [Qemu][Kemari]: Build Failure
>Signed-off-by: pradeep 
>modified:   ft_trans_file.c
> 
> ---
>  ft_trans_file.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/ft_trans_file.c b/ft_trans_file.c
> index 4e33034..dc36757 100644
> --- a/ft_trans_file.c
> +++ b/ft_trans_file.c
> @@ -174,7 +174,7 @@ static int ft_trans_send_header(QEMUFileFtTrans *s,
>  static int ft_trans_put_buffer(void *opaque, const uint8_t *buf, int64_t 
> pos, int size)
>  {
>  QEMUFileFtTrans *s = opaque;
> -ssize_t ret;
> +ssize_t ret = 0;
>  
>  trace_ft_trans_put_buffer(size, pos);
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Can I configure cores instead of CPU's

2011-12-04 Thread Todd And Margo Chester


Hi All,

Scientific Linux 6.1 x64
qemu-kvm-0.12.1.2-2.160.el6_1.2.x86_64

My XP-Pro guest will only let me use two CPUs.

Is there a way I can tell Virt-Manager to use
one CPU with four cores instead of four separate
CPUs?

Many thanks,
-T



Hi Frank,

Figured out what I was doing wrong.  In virt-manager,
Processor, as soon as I clinked on the "Copy host CPU
configuration" button (Nelhalem, by the way), when
I went to "topology", I was able to configure:
   socket = 1
   cores  = 4
   thread = 1
And, this time XP-Pro saw the cores as cores and not
individual CPUs.

I even got fancy and tried
   socket = 1
   cores  = 2
   thread = 2
And that worked too.  Yipee!

My mistake was not clicking on the "Copy host CPU
configuration" button.

Thank you for all the help in getting me there.

Happy Camper,
-T
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

What are the rules for hyperthreaded CPUs?

2011-12-04 Thread Todd And Margo Chester


Hi All,

My host has an i7-960 processor (4 cores, 8 hypterthreaded cores.)
/proc/cpuinfo on my host shows 8 processors.

On my Fedora Core 16 guest, /proc/cpuinfo shows 4 processors.

Question: what is the maximum number of processors I can assign to my
guest?  4 or 8?  What are the rules for Hyperthreading?

Many thanks,
-T


Figured it out.  The max cores is the number of hyperthreaded cores
(eight in my case).

My mistake was that in virt-manager, processor, I was not clicking
on "Copy host CPU configuration" button.  Without that, everything
I tried to configure came out as single CPUs.

-T
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

What are the rules for hyperthreaded CPUs?

2011-12-04 Thread Todd And Margo Chester


Hi All,

My host has an i7-960 processor (4 cores, 8 hypterthreaded cores.)
/proc/cpuinfo on my host shows 8 processors.

On my Fedora Core 16 guest, /proc/cpuinfo shows 4 processors.

Question: what is the maximum number of processors I can assign to my
guest?  4 or 8?  What are the rules for Hyperthreading?

Many thanks,
-T
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Rusty Russell

On Sun, 04 Dec 2011 17:16:59 +0200, Avi Kivity  wrote:
> On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > There's also the used ring, but that's a
> > > mistake if you have out of order completion.  We should have used copying.
> >
> > Seems unrelated... unless you want used to be written into
> > descriptor ring itself?
> 
> The avail/used rings are in addition to the regular ring, no?  If you
> copy descriptors, then it goes away.

There were two ideas which drove the current design:

1) The Van-Jacobson style "no two writers to same cacheline makes rings
   fast" idea.  Empirically, this doesn't show any winnage.
2) Allowing a generic inter-guest copy mechanism, so we could have
   genuinely untrusted driver domains.  Yet noone ever did this so it's
   hardly a killer feature :(

So if we're going to revisit and drop those requirements, I'd say:

1) Shared device/driver rings like Xen.  Xen uses device-specific ring
   contents, I'd be tempted to stick to our pre-headers, and a 'u64
   addr; u64 len_and_flags; u64 cookie;' generic style.  Then use
   the same ring for responses.  That's a slight space-win, since
   we're 24 bytes vs 26 bytes now.

2) Stick with physically-contiguous rings, but use them of size (2^n)-1.
   Makes the indexing harder, but that -1 lets us stash the indices in
   the first entry and makes the ring a nice 2^n size.

> > > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > > just 1.5 ms.  In any case I think it's sufficient.
> >
> > Right. So I think that without indirect, we waste about 3 entries
> > per packet for virtio header and transport etc headers.
> 
> That does suck.  Are there issues in increasing the ring size?  Or
> making it discontiguous?

Because the qemu implementation is broken.  We can often put the virtio
header at the head of the packet.  In practice, the qemu implementation
insists the header be a single descriptor.

(At least, it used to, perhaps it has now been fixed.  We need a
VIRTIO_NET_F_I_NOW_CONFORM_TO_THE_DAMN_SPEC_SORRY_I_SUCK bit).

We currently use small rings: the guest can't negotiate so qemu has to
offer a lowest-common-denominator value.  The new virtio-pci layout
fixes this, and lets the guest set the ring size.

> Can you take a peek at how Xen manages its rings?  They have the same
> problems we do.

Yes, I made some mistakes, but I did steal from them in the first
place...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Jan Kiszka

On 2011-12-04 22:31, Blue Swirl wrote:
> On Sun, Dec 4, 2011 at 16:35, Avi Kivity  wrote:
>> On 12/04/2011 05:19 PM, Jan Kiszka wrote:

 In the sense that kernel-apic is just an accelerated apic.  From the
 guest point of view, there's no difference, and that should be reflected
 in the device model.
>>>
>>> That was my goal as well: The guest should not notice the difference,
>>> but the admin on the host side should still be able to tell both
>>> internally fairly different models apart.
>>
>> This should be some attribute, not the name.
>>
>>> Plus the code should be
>>> clearly split where there are differences and explicitly shared where
>>> there aren't.
>>
>> That's a good goal, yes.
> 
> I'd prefer an unified device built from a single source file if
> possible. This conflicts with the build-once model though.

Right, another reason to not do this.

> 
>>>

 If I'm reading an apic register, either from the guest or via a monitor
 debug interface, I shouldn't care whether it's accelerated or not.  The
 guest part already holds, of course.
>>>
>>> Specifically for the debug scenario, I'd prefer the clear
>>> differentiation by name as there can always remain subtle differences in
>>> the implementation of kernel vs. user space. Someone debugging the guest
>>> and/or qemu/kvm should remain aware of this.
>>
>> Aware, yes, but the name change is too drastic.
> 
> It should be also possible to migrate from non-KVM device to KVM
> version, different names would prevent that for ever.

It is (theoretically) possible with these patches as the vmstate names
are the same. KVM to TCG migration does not work right now, so I was
only able to test in-kernel <-> user space irqchip model migrations.

Jan



signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Blue Swirl

On Sun, Dec 4, 2011 at 16:35, Avi Kivity  wrote:
> On 12/04/2011 05:19 PM, Jan Kiszka wrote:
>> >
>> > In the sense that kernel-apic is just an accelerated apic.  From the
>> > guest point of view, there's no difference, and that should be reflected
>> > in the device model.
>>
>> That was my goal as well: The guest should not notice the difference,
>> but the admin on the host side should still be able to tell both
>> internally fairly different models apart.
>
> This should be some attribute, not the name.
>
>> Plus the code should be
>> clearly split where there are differences and explicitly shared where
>> there aren't.
>
> That's a good goal, yes.

I'd prefer an unified device built from a single source file if
possible. This conflicts with the build-once model though.

>>
>> >
>> > If I'm reading an apic register, either from the guest or via a monitor
>> > debug interface, I shouldn't care whether it's accelerated or not.  The
>> > guest part already holds, of course.
>>
>> Specifically for the debug scenario, I'd prefer the clear
>> differentiation by name as there can always remain subtle differences in
>> the implementation of kernel vs. user space. Someone debugging the guest
>> and/or qemu/kvm should remain aware of this.
>
> Aware, yes, but the name change is too drastic.

It should be also possible to migrate from non-KVM device to KVM
version, different names would prevent that for ever.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Veirfy memory slot only for readability

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 19:34 +0200, Avi Kivity wrote:
> On 12/04/2011 07:29 PM, Sasha Levin wrote:
> > On Sun, 2011-12-04 at 18:53 +0200, Avi Kivity wrote:
> > > On 12/02/2011 07:46 AM, Sasha Levin wrote:
> > > > > Do you want to create read only memory slots for kvm tool?
> > > >
> > > > What KVM tool currently does is copy the kernel into guest memory and
> > > > run it from there. An idea raised recently was instead of copying it we
> > > > should mmap it into the memory to reduce footprint.
> > > >
> > > > This is why I'm looking into adding a read only memory slot. The KVM
> > > > code doesn't have to know it's read only.
> > > 
> > > The kernel will patch itself very early.  You need to use MAP_PRIVATE
> > > (and thus have a read/write area).  It will be interesting to see what
> > > fraction of the memory is modified.
> > > 
> > > Note that mapping will remove benefits like huge page support, and that
> > > you can get page sharing by using ksm.  Still, it's interesting to see
> > > where this goes.
> >
> > Why would I lose hugepage if the kernel gets it's own memory slot?
> 
> (transparent) hugepages only work on anonymous memory.  Hopefully later
> it will be extended to work on mapped memory as well.

So I'll lose hugepages just for the kernel memory slot, I wonder how it
will matter performance-wise.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] QEMU kvm/i386 : Adding KICK_VCPU capability support in i386 target.

2011-12-04 Thread Raghavendra K T

 Extend the KVM Hypervisor to enable KICK_VCPU feature that allows
a vcpu to kick the halted vcpu to continue with execution in PV ticket
spinlock.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Raghavendra K T 
---
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 5bfc21f..69bce21 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -97,6 +97,7 @@ struct kvm_para_features {
 { KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
 { KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
 { KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
+{ KVM_CAP_KICK_VCPU, KVM_FEATURE_KICK_VCPU },
 { -1, -1 }
 };
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] QEMU kvm: Syncing linux headers to support KICK_VCPU capability

2011-12-04 Thread Raghavendra K T

Update the kernel header that adds a hypercall to support pv-ticketlocks.

Signed-off-by: Raghavendra K T 
---
diff --git a/linux-headers/asm-x86/kvm_para.h b/linux-headers/asm-x86/kvm_para.h
index f2ac46a..03d3a36 100644
--- a/linux-headers/asm-x86/kvm_para.h
+++ b/linux-headers/asm-x86/kvm_para.h
@@ -16,12 +16,14 @@
 #define KVM_FEATURE_CLOCKSOURCE0
 #define KVM_FEATURE_NOP_IO_DELAY   1
 #define KVM_FEATURE_MMU_OP 2
+
 /* This indicates that the new set of kvmclock msrs
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
+#define KVM_FEATURE_KICK_VCPU  6
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 07bd557..47ab6ff 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_S390_GMAP 71
+#define KVM_CAP_KICK_VCPU 72
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/linux-headers/linux/kvm_para.h b/linux-headers/linux/kvm_para.h
index b315e27..e4a0e3e 100644
--- a/linux-headers/linux/kvm_para.h
+++ b/linux-headers/linux/kvm_para.h
@@ -19,6 +19,7 @@
 #define KVM_HC_MMU_OP  2
 #define KVM_HC_FEATURES3
 #define KVM_HC_PPC_MAP_MAGIC_PAGE  4
+#define KVM_HC_KICK_CPU5
 
 /*
  * hypercalls use architecture specific

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] QEMU kvm: Syncing linux headers to 3.2.0-rc1

2011-12-04 Thread Raghavendra K T

 Update the kvm kernel headers to the 3.2.0-rc1 post using
scripts/update-linux-headers.sh script.

Signed-off-by: Raghavendra K T 
---
diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
index fb3fddc..08fe69e 100644
--- a/linux-headers/asm-powerpc/kvm.h
+++ b/linux-headers/asm-powerpc/kvm.h
@@ -149,6 +149,12 @@ struct kvm_regs {
 #define KVM_SREGS_E_UPDATE_DBSR(1 << 3)
 
 /*
+ * Book3S special bits to indicate contents in the struct by maintaining
+ * backwards compatibility with older structs. If adding a new field,
+ * please make sure to add a flag for that new field */
+#define KVM_SREGS_S_HIOR   (1 << 0)
+
+/*
  * In KVM_SET_SREGS, reserved/pad fields must be left untouched from a
  * previous KVM_GET_REGS.
  *
@@ -170,9 +176,11 @@ struct kvm_sregs {
} ppc64;
struct {
__u32 sr[16];
-   __u64 ibat[8];
-   __u64 dbat[8];
+   __u64 ibat[8]; 
+   __u64 dbat[8]; 
} ppc32;
+   __u64 flags; /* KVM_SREGS_S_ */
+   __u64 hior;
} s;
struct {
union {
@@ -292,41 +300,4 @@ struct kvm_allocate_rma {
__u64 rma_size;
 };
 
-struct kvm_book3e_206_tlb_entry {
-   __u32 mas8;
-   __u32 mas1;
-   __u64 mas2;
-   __u64 mas7_3;
-};
-
-struct kvm_book3e_206_tlb_params {
-   /*
-* For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
-*
-* - The number of ways of TLB0 must be a power of two between 2 and
-*   16.
-* - TLB1 must be fully associative.
-* - The size of TLB0 must be a multiple of the number of ways, and
-*   the number of sets must be a power of two.
-* - The size of TLB1 may not exceed 64 entries.
-* - TLB0 supports 4 KiB pages.
-* - The page sizes supported by TLB1 are as indicated by
-*   TLB1CFG (if MMUCFG[MAVN] = 0) or TLB1PS (if MMUCFG[MAVN] = 1)
-*   as returned by KVM_GET_SREGS.
-* - TLB2 and TLB3 are reserved, and their entries in tlb_sizes[]
-*   and tlb_ways[] must be zero.
-*
-* tlb_ways[n] = tlb_sizes[n] means the array is fully associative.
-*
-* KVM will adjust TLBnCFG based on the sizes configured here,
-* though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
-* set to zero.
-*/
-   __u32 tlb_sizes[4];
-   __u32 tlb_ways[4];
-   __u32 reserved[8];
-};
-
-#define KVM_ONE_REG_PPC_HIOR   KVM_ONE_REG_PPC | 0x100
-
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/linux-headers/asm-x86/hyperv.h b/linux-headers/asm-x86/hyperv.h
index 5df477a..b80420b 100644
--- a/linux-headers/asm-x86/hyperv.h
+++ b/linux-headers/asm-x86/hyperv.h
@@ -189,5 +189,6 @@
 #define HV_STATUS_INVALID_HYPERCALL_CODE   2
 #define HV_STATUS_INVALID_HYPERCALL_INPUT  3
 #define HV_STATUS_INVALID_ALIGNMENT4
+#define HV_STATUS_INSUFFICIENT_BUFFERS 19
 
 #endif
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index a8761d3..07bd557 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -371,6 +371,7 @@ struct kvm_s390_psw {
 #define KVM_S390_INT_VIRTIO0x2603u
 #define KVM_S390_INT_SERVICE   0x2401u
 #define KVM_S390_INT_EMERGENCY 0x1201u
+#define KVM_S390_INT_EXTERNAL_CALL 0x1202u
 
 struct kvm_s390_interrupt {
__u32 type;
@@ -556,8 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66   /* returns max vcpus per vm */
 #define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
-#define KVM_CAP_SW_TLB 69
-#define KVM_CAP_ONE_REG 70
+#define KVM_CAP_S390_GMAP 71
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -637,49 +637,6 @@ struct kvm_clock_data {
__u32 pad[9];
 };
 
-#define KVM_MMU_FSL_BOOKE_NOHV 0
-#define KVM_MMU_FSL_BOOKE_HV   1
-
-struct kvm_config_tlb {
-   __u64 params;
-   __u64 array;
-   __u32 mmu_type;
-   __u32 array_len;
-};
-
-struct kvm_dirty_tlb {
-   __u64 bitmap;
-   __u32 num_dirty;
-};
-
-/* Available with KVM_CAP_ONE_REG */
-
-#define KVM_ONE_REG_GENERIC0xULL
-
-/*
- * Architecture specific registers are to be defined in arch headers and
- * ORed with the arch identifier.
- */
-#define KVM_ONE_REG_PPC0x1000ULL
-#define KVM_ONE_REG_X860x2000ULL
-#define KVM_ONE_REG_IA64   0x3000ULL
-#define KVM_ONE_REG_ARM0x4000ULL
-#define KVM_ONE_REG_S390   0x5000ULL
-
-struct kvm_one_reg {
-   __u64 id;
-   union {
-   __u8 reg8;
-   __u16 reg16;
-   __

[PATCH 0/3] QEMU kvm: Adding KICK_VCPU capability to i386 kvm

2011-12-04 Thread Raghavendra K T

From: Raghavendra K T 

Three patch series following this, extends KVM-hypervisor
and Linux guest running on KVM-hypervisor to support pv-ticket spinlocks.

PV ticket spinlock helps to solve Lock Holder Preemption problem discussed in
http://www.amd64.org/fileadmin/user_upload/pub/LHP-commented_slides.pdf.

When spinlock is contended,a guest vcpu relinqueshes cpu by halt().
Correspondingly, One hypercall is introduced in KVM hypervisor,that allows
a vcpu to kick the halted vcpu to continue with execution.

The series will : 
- Update qemu with latest linux header files (to 3.2.0-rc1).
- Enable KICK_VCPU capability in kvm/i386.

Raghavendra K T(3):
  Sync the linux headers to 3.2.0-rc1
  Sync the linux headers to patched linux kernel with 
KICK_VCPU capability.
  Add KICK_VCPU support in i386 target

---
The corresponding kernel patch is available in the thread
https://lkml.org/lkml/2011/11/30/62

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 19:39 +0200, Sasha Levin wrote:
> On Sun, 2011-12-04 at 19:37 +0200, Avi Kivity wrote:
> > On 12/04/2011 07:34 PM, Sasha Levin wrote:
> > > > 
> > > > I'm confused. didn't you see a bigger benefit for guest->host by
> > > > switching indirect off?
> > >
> > > The 5% improvement is over the 'regular' indirect on, not over indirect
> > > off. Sorry for the confusion there.
> > >
> > > I suggested this change regardless of the outcome of indirect descriptor
> > > threshold discussion, since it would help anyways.
> > 
> > For net, this makes sense.  For block, it reduces the effective queue
> > depth, so it's not a trivial change.  It probably makes sense there too,
> > though.
> 
> It doesn't have to be limited at that number, anything above that can go
> through the regular kmalloc() path.

Something like the following patch:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..3166ca0 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,7 @@ struct vring_virtqueue
 
/* Host supports indirect buffers */
bool indirect;
+   struct kmem_cache *indirect_cache;
 
/* Host publishes avail event idx */
bool event;
@@ -110,6 +111,9 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static unsigned int ind_alloc_thresh = 0;
+module_param(ind_alloc_thresh, uint, S_IRUGO);
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
  struct scatterlist sg[],
@@ -121,7 +125,10 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
unsigned head;
int i;
 
-   desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+   if ((out + in) <= ind_alloc_thresh)
+   desc = kmem_cache_alloc(vq->indirect_cache, gfp);
+   else
+   desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
if (!desc)
return -ENOMEM;
 
@@ -479,6 +486,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
vq->broken = false;
vq->last_used_idx = 0;
vq->num_added = 0;
+   if (ind_alloc_thresh)
+   vq->indirect_cache = KMEM_CACHE(vring_desc[ind_alloc_thresh], 
0);
list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
vq->in_use = false;

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2011-12-04 Thread Raghavendra K T


On 12/02/2011 01:20 AM, Raghavendra K T wrote:

Have you tested it on AMD machines? There are some differences in the
hypercall infrastructure there.


Yes. 'll test on AMD machine and update on that.



I tested the code on 64 bit Dual-Core AMD Opteron machine, and it is 
working.


- Raghu

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 19:37 +0200, Avi Kivity wrote:
> On 12/04/2011 07:34 PM, Sasha Levin wrote:
> > > 
> > > I'm confused. didn't you see a bigger benefit for guest->host by
> > > switching indirect off?
> >
> > The 5% improvement is over the 'regular' indirect on, not over indirect
> > off. Sorry for the confusion there.
> >
> > I suggested this change regardless of the outcome of indirect descriptor
> > threshold discussion, since it would help anyways.
> 
> For net, this makes sense.  For block, it reduces the effective queue
> depth, so it's not a trivial change.  It probably makes sense there too,
> though.

It doesn't have to be limited at that number, anything above that can go
through the regular kmalloc() path.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: Correct documentation of KVM_GET_SUPPORTED_CPUID

2011-12-04 Thread Sasha Levin

Avi,

This is the other part of the get_supported_cpuid change. We discussed
it over IRC and you said it looks right.

On Thu, 2011-11-17 at 12:18 +0200, Sasha Levin wrote:
> If the amount of entries available passed to KVM_GET_SUPPORTED_CPUID is
> too big we don't fail, we just adjust it to the amount actually needed
> and fill the entries.
> 
> Cc: Avi Kivity 
> Cc: Marcelo Tosatti 
> Signed-off-by: Sasha Levin 
> ---
>  Documentation/virtual/kvm/api.txt |7 +++
>  1 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 7945b0b..273be09 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1074,10 +1074,9 @@ or for feature consistency across a cluster).
>  Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
>  with the 'nent' field indicating the number of entries in the variable-size
>  array 'entries'.  If the number of entries is too low to describe the cpu
> -capabilities, an error (E2BIG) is returned.  If the number is too high,
> -the 'nent' field is adjusted and an error (ENOMEM) is returned.  If the
> -number is just right, the 'nent' field is adjusted to the number of valid
> -entries in the 'entries' array, which is then filled.
> +capabilities, an error (E2BIG) is returned.  If the number is above or just,
> +right, the 'nent' field is adjusted to the number of valid entries in the
> +'entries' array, which is then filled.
>  
>  The entries returned are the host cpuid as returned by the cpuid instruction,
>  with unknown or unsupported features masked out.  Some features (for example,

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Avi Kivity

On 12/04/2011 07:34 PM, Sasha Levin wrote:
> > 
> > I'm confused. didn't you see a bigger benefit for guest->host by
> > switching indirect off?
>
> The 5% improvement is over the 'regular' indirect on, not over indirect
> off. Sorry for the confusion there.
>
> I suggested this change regardless of the outcome of indirect descriptor
> threshold discussion, since it would help anyways.

For net, this makes sense.  For block, it reduces the effective queue
depth, so it's not a trivial change.  It probably makes sense there too,
though.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH rebased 2/2] KVM: Use memdup_user instead of kmalloc/copy_from_user

2011-12-04 Thread Sasha Levin

Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.

Cc: Avi Kivity 
Cc: Marcelo Tosatti 
Signed-off-by: Sasha Levin 
---
 arch/x86/kvm/x86.c  |   82 +-
 virt/kvm/kvm_main.c |   29 +++--
 2 files changed, 47 insertions(+), 64 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3daec74..eeeaf2e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1309,12 +1309,11 @@ static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 
data)
if (page_num >= blob_size)
goto out;
r = -ENOMEM;
-   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!page)
+   page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE);
+   if (IS_ERR(page)) {
+   r = PTR_ERR(page);
goto out;
-   r = -EFAULT;
-   if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE))
-   goto out_free;
+   }
if (kvm_write_guest(kvm, page_addr, page, PAGE_SIZE))
goto out_free;
r = 0;
@@ -1988,15 +1987,12 @@ static int msr_io(struct kvm_vcpu *vcpu, struct 
kvm_msrs __user *user_msrs,
if (msrs.nmsrs >= MAX_IO_MSRS)
goto out;
 
-   r = -ENOMEM;
size = sizeof(struct kvm_msr_entry) * msrs.nmsrs;
-   entries = kmalloc(size, GFP_KERNEL);
-   if (!entries)
+   entries = memdup_user(user_msrs->entries, size);
+   if (IS_ERR(entries)) {
+   r = PTR_ERR(entries);
goto out;
-
-   r = -EFAULT;
-   if (copy_from_user(entries, user_msrs->entries, size))
-   goto out_free;
+   }
 
r = n = __msr_io(vcpu, &msrs, entries, do_msr);
if (r < 0)
@@ -2530,13 +2526,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = -EINVAL;
if (!vcpu->arch.apic)
goto out;
-   u.lapic = kmalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!u.lapic)
-   goto out;
-   r = -EFAULT;
-   if (copy_from_user(u.lapic, argp, sizeof(struct 
kvm_lapic_state)))
+   u.lapic = memdup_user(argp, sizeof(*u.lapic));
+   if (IS_ERR(u.lapic)) {
+   r = PTR_ERR(u.lapic);
goto out;
+   }
+
r = kvm_vcpu_ioctl_set_lapic(vcpu, u.lapic);
if (r)
goto out;
@@ -2715,14 +2710,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_XSAVE: {
-   u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!u.xsave)
-   break;
-
-   r = -EFAULT;
-   if (copy_from_user(u.xsave, argp, sizeof(struct kvm_xsave)))
-   break;
+   u.xsave = memdup_user(argp, sizeof(*u.xsave));
+   if (IS_ERR(u.xsave)) {
+   r = PTR_ERR(u.xsave);
+   goto out;
+   }
 
r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
break;
@@ -2743,15 +2735,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_XCRS: {
-   u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!u.xcrs)
-   break;
-
-   r = -EFAULT;
-   if (copy_from_user(u.xcrs, argp,
-  sizeof(struct kvm_xcrs)))
-   break;
+   u.xcrs = memdup_user(argp, sizeof(*u.xcrs));
+   if (IS_ERR(u.xcrs)) {
+   r = PTR_ERR(u.xcrs);
+   goto out;
+   }
 
r = kvm_vcpu_ioctl_x86_set_xcrs(vcpu, u.xcrs);
break;
@@ -3187,14 +3175,14 @@ long kvm_arch_vm_ioctl(struct file *filp,
}
case KVM_GET_IRQCHIP: {
/* 0: PIC master, 1: PIC slave, 2: IOAPIC */
-   struct kvm_irqchip *chip = kmalloc(sizeof(*chip), GFP_KERNEL);
+   struct kvm_irqchip *chip;
 
-   r = -ENOMEM;
-   if (!chip)
+   chip = memdup_user(argp, sizeof(*chip));
+   if (IS_ERR(chip)) {
+   r = PTR_ERR(chip);
goto out;
-   r = -EFAULT;
-   if (copy_from_user(chip, argp, sizeof *chip))
-   goto get_irqchip_out;
+   }
+
r = -ENXIO;
if (!irqchip_in_kernel(kvm))
goto get_irqchip_out;
@@ -3213,14 +3201,14 @@ long kvm_arch_vm_ioctl(struct file *filp,
}
case KVM_SET_IRQCHIP: {
/* 0: PIC master, 1: PIC slave, 2:

[PATCH rebased 1/2] KVM: Use kmemdup() instead of kmalloc/memcpy

2011-12-04 Thread Sasha Levin

Switch to kmemdup() in two places to shorten the code and avoid possible bugs.

Cc: Avi Kivity 
Cc: Marcelo Tosatti 
Signed-off-by: Sasha Levin 
---
 arch/x86/kvm/x86.c  |4 ++--
 virt/kvm/kvm_main.c |7 +++
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23c93fe..3daec74 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3028,10 +3028,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
memset(dirty_bitmap_head, 0, n);
 
r = -ENOMEM;
-   slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
+   slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), 
GFP_KERNEL);
if (!slots)
goto out;
-   memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
+
memslot = id_to_memslot(slots, log->slot);
memslot->nr_dirty_pages = 0;
memslot->dirty_bitmap = dirty_bitmap_head;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e289486..a6e612f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2591,13 +2591,12 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
int i, r;
struct kvm_io_bus *new_bus, *bus;
 
-   new_bus = kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL);
+   bus = kvm->buses[bus_idx];
+
+   new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
 
-   bus = kvm->buses[bus_idx];
-   memcpy(new_bus, bus, sizeof(struct kvm_io_bus));
-
r = -ENOENT;
for (i = 0; i < new_bus->dev_count; i++)
if (new_bus->range[i].dev == dev) {
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 18:22 +0200, Michael S. Tsirkin wrote:
> On Sun, Dec 04, 2011 at 02:13:51PM +0200, Sasha Levin wrote:
> > On Sun, 2011-12-04 at 13:52 +0200, Avi Kivity wrote:
> > > On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > > > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" 
> > > > >  wrote:
> > > > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > > > We'll presumably need some logic to increment is back,
> > > > > > > > to account for random workload changes.
> > > > > > > > Something like slow start?
> > > > > > > 
> > > > > > > We can increment it each time the queue was less than 10% full, it
> > > > > > > should act like slow start, no?
> > > > > > 
> > > > > > No, we really shouldn't get an empty ring as long as things behave
> > > > > > well. What I meant is something like:
> > > > > 
> > > > > I was thinking of the network output case, but you're right.  We need 
> > > > > to
> > > > > distinguish between usually full (eg. virtio-net input) and usually
> > > > > empty (eg. virtio-net output).
> > > > > 
> > > > > The signal for "we to pack more into the ring" is different.  We could
> > > > > use some hacky heuristic like "out == 0" but I'd rather make it 
> > > > > explicit
> > > > > when we set up the virtqueue.
> > > > > 
> > > > > Our other alternative, moving the logic to the driver, is worse.
> > > > > 
> > > > > As to fading the effect over time, that's harder.  We have to deplete
> > > > > the ring quite a few times before it turns into always-indirect.  We
> > > > > could back off every time the ring is totally idle, but that may hurt
> > > > > bursty traffic.  Let's try simple first?
> > > >
> > > > I tried to take a different approach, and tried putting the indirect
> > > > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > > > that this way virtio-net actually worked faster with indirect on even in
> > > > a single stream.
> > > 
> > > How much better?
> > 
> > host->guest was same with both indirect on and off, and guest->host went
> > up by 5% with indirect on.
> > 
> > This was just a simple 1 TCP stream test.
> 
> I'm confused. didn't you see a bigger benefit for guest->host by
> switching indirect off?

The 5% improvement is over the 'regular' indirect on, not over indirect
off. Sorry for the confusion there.

I suggested this change regardless of the outcome of indirect descriptor
threshold discussion, since it would help anyways.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Veirfy memory slot only for readability

2011-12-04 Thread Avi Kivity

On 12/04/2011 07:29 PM, Sasha Levin wrote:
> On Sun, 2011-12-04 at 18:53 +0200, Avi Kivity wrote:
> > On 12/02/2011 07:46 AM, Sasha Levin wrote:
> > > > Do you want to create read only memory slots for kvm tool?
> > >
> > > What KVM tool currently does is copy the kernel into guest memory and
> > > run it from there. An idea raised recently was instead of copying it we
> > > should mmap it into the memory to reduce footprint.
> > >
> > > This is why I'm looking into adding a read only memory slot. The KVM
> > > code doesn't have to know it's read only.
> > 
> > The kernel will patch itself very early.  You need to use MAP_PRIVATE
> > (and thus have a read/write area).  It will be interesting to see what
> > fraction of the memory is modified.
> > 
> > Note that mapping will remove benefits like huge page support, and that
> > you can get page sharing by using ksm.  Still, it's interesting to see
> > where this goes.
>
> Why would I lose hugepage if the kernel gets it's own memory slot?

(transparent) hugepages only work on anonymous memory.  Hopefully later
it will be extended to work on mapped memory as well.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Veirfy memory slot only for readability

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 18:53 +0200, Avi Kivity wrote:
> On 12/02/2011 07:46 AM, Sasha Levin wrote:
> > > Do you want to create read only memory slots for kvm tool?
> >
> > What KVM tool currently does is copy the kernel into guest memory and
> > run it from there. An idea raised recently was instead of copying it we
> > should mmap it into the memory to reduce footprint.
> >
> > This is why I'm looking into adding a read only memory slot. The KVM
> > code doesn't have to know it's read only.
> 
> The kernel will patch itself very early.  You need to use MAP_PRIVATE
> (and thus have a read/write area).  It will be interesting to see what
> fraction of the memory is modified.
> 
> Note that mapping will remove benefits like huge page support, and that
> you can get page sharing by using ksm.  Still, it's interesting to see
> where this goes.

Why would I lose hugepage if the kernel gets it's own memory slot?

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] KVM: Use kmemdup() instead of kmalloc/memcpy

2011-12-04 Thread Avi Kivity

On 11/27/2011 02:43 PM, Sasha Levin wrote:
> Switch to kmemdup() in two places to shorten the code and avoid possible bugs.
>

Doesn't apply, please rebase.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Veirfy memory slot only for readability

2011-12-04 Thread Avi Kivity

On 12/02/2011 07:46 AM, Sasha Levin wrote:
> > Do you want to create read only memory slots for kvm tool?
>
> What KVM tool currently does is copy the kernel into guest memory and
> run it from there. An idea raised recently was instead of copying it we
> should mmap it into the memory to reduce footprint.
>
> This is why I'm looking into adding a read only memory slot. The KVM
> code doesn't have to know it's read only.

The kernel will patch itself very early.  You need to use MAP_PRIVATE
(and thus have a read/write area).  It will be interesting to see what
fraction of the memory is modified.

Note that mapping will remove benefits like huge page support, and that
you can get page sharing by using ksm.  Still, it's interesting to see
where this goes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Remove set-but-unused cr4 from check_cr_write

2011-12-04 Thread Avi Kivity

On 12/02/2011 07:26 PM, Jan Kiszka wrote:
> This was probably copy&pasted from the cr0 case, but it's unneeded here.
>
>

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages

2011-12-04 Thread Avi Kivity

On 12/02/2011 07:35 PM, Jan Kiszka wrote:
> freed_pages is never evaluated, so remove it as well as the return code
> kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user.
>

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Avi Kivity

On 12/04/2011 05:19 PM, Jan Kiszka wrote:
> > 
> > In the sense that kernel-apic is just an accelerated apic.  From the
> > guest point of view, there's no difference, and that should be reflected
> > in the device model.
>
> That was my goal as well: The guest should not notice the difference,
> but the admin on the host side should still be able to tell both
> internally fairly different models apart. 

This should be some attribute, not the name.

> Plus the code should be
> clearly split where there are differences and explicitly shared where
> there aren't.

That's a good goal, yes.

>
> > 
> > If I'm reading an apic register, either from the guest or via a monitor
> > debug interface, I shouldn't care whether it's accelerated or not.  The
> > guest part already holds, of course.
>
> Specifically for the debug scenario, I'd prefer the clear
> differentiation by name as there can always remain subtle differences in
> the implementation of kernel vs. user space. Someone debugging the guest
> and/or qemu/kvm should remain aware of this.

Aware, yes, but the name change is too drastic.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Avi Kivity

On 12/04/2011 06:00 PM, Michael S. Tsirkin wrote:
> >  If you
> > copy descriptors, then it goes away.
>
> The avail ring could go away. used could if we make descriptors
> writeable. IIUC it was made RO in the hope that will make it
> easier for xen to adopt. Still relevant?

You mean RO from the consumer side?  Why can't Xen do that?

> > That does suck.  Are there issues in increasing the ring size?  Or
> > making it discontiguous?
>
> discontiguous ring is what indirect is, basically.

No, discontiguous is more cache and prefetch friendly.  With vmap(), the
code doesn't even change.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Michael S. Tsirkin

On Sun, Dec 04, 2011 at 02:13:51PM +0200, Sasha Levin wrote:
> On Sun, 2011-12-04 at 13:52 +0200, Avi Kivity wrote:
> > On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" 
> > > >  wrote:
> > > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > > We'll presumably need some logic to increment is back,
> > > > > > > to account for random workload changes.
> > > > > > > Something like slow start?
> > > > > > 
> > > > > > We can increment it each time the queue was less than 10% full, it
> > > > > > should act like slow start, no?
> > > > > 
> > > > > No, we really shouldn't get an empty ring as long as things behave
> > > > > well. What I meant is something like:
> > > > 
> > > > I was thinking of the network output case, but you're right.  We need to
> > > > distinguish between usually full (eg. virtio-net input) and usually
> > > > empty (eg. virtio-net output).
> > > > 
> > > > The signal for "we to pack more into the ring" is different.  We could
> > > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > > when we set up the virtqueue.
> > > > 
> > > > Our other alternative, moving the logic to the driver, is worse.
> > > > 
> > > > As to fading the effect over time, that's harder.  We have to deplete
> > > > the ring quite a few times before it turns into always-indirect.  We
> > > > could back off every time the ring is totally idle, but that may hurt
> > > > bursty traffic.  Let's try simple first?
> > >
> > > I tried to take a different approach, and tried putting the indirect
> > > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > > that this way virtio-net actually worked faster with indirect on even in
> > > a single stream.
> > 
> > How much better?
> 
> host->guest was same with both indirect on and off, and guest->host went
> up by 5% with indirect on.
> 
> This was just a simple 1 TCP stream test.

I'm confused. didn't you see a bigger benefit for guest->host by
switching indirect off?

> > 
> > I think that if indirects benefit networking, then we're doing something
> > wrong.  What's going on?  Does the ring get filled too early?  If so we
> > should expand it.
> > 
> 
> -- 
> 
> Sasha.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Michael S. Tsirkin

On Sun, Dec 04, 2011 at 05:16:59PM +0200, Avi Kivity wrote:
> On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > There's also the used ring, but that's a
> > > mistake if you have out of order completion.  We should have used copying.
> >
> > Seems unrelated... unless you want used to be written into
> > descriptor ring itself?
> 
> The avail/used rings are in addition to the regular ring, no?

Yes. A couple of extra pages, if we reduce alignment we could pack this
in a single extra page.

>  If you
> copy descriptors, then it goes away.

The avail ring could go away. used could if we make descriptors
writeable. IIUC it was made RO in the hope that will make it
easier for xen to adopt. Still relevant?

> > But, I don't really know why does virtio ring insist on
> > making the 3 buffers (avail/used/descriptor)
> > physically contigious. Rusty?
> 
> Let's drop them instead.
> 
> >
> > > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > > just 1.5 ms.  In any case I think it's sufficient.
> >
> > Right. So I think that without indirect, we waste about 3 entries
> > per packet for virtio header and transport etc headers.
> 
> That does suck.  Are there issues in increasing the ring size?  Or
> making it discontiguous?

discontiguous ring is what indirect is, basically.

> Can you take a peek at how Xen manages its rings?  They have the same
> problems we do.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Jan Kiszka

On 2011-12-04 16:12, Avi Kivity wrote:
> On 12/04/2011 04:06 PM, Jan Kiszka wrote:
>> On 2011-12-04 15:04, Avi Kivity wrote:
>>> On 12/04/2011 03:51 PM, Jan Kiszka wrote:
>
> But the name becomes part of the save/restore ABI, so you can't.

 Nope, the vmstate names are identical. That would ruin migration
 otherwise. It's just the output of info qtree & co. that changes.
>>>
>>> Oh, okay.  I still think it's wrong, but now it's just a matter of
>>> taste, and I can live with it.
>>
>> Wrong in what sense?
> 
> In the sense that kernel-apic is just an accelerated apic.  From the
> guest point of view, there's no difference, and that should be reflected
> in the device model.

That was my goal as well: The guest should not notice the difference,
but the admin on the host side should still be able to tell both
internally fairly different models apart. Plus the code should be
clearly split where there are differences and explicitly shared where
there aren't.

> 
> If I'm reading an apic register, either from the guest or via a monitor
> debug interface, I shouldn't care whether it's accelerated or not.  The
> guest part already holds, of course.

Specifically for the debug scenario, I'd prefer the clear
differentiation by name as there can always remain subtle differences in
the implementation of kernel vs. user space. Someone debugging the guest
and/or qemu/kvm should remain aware of this.

Jan

signature.asc
Description: OpenPGP digital signature

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Avi Kivity

On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > There's also the used ring, but that's a
> > mistake if you have out of order completion.  We should have used copying.
>
> Seems unrelated... unless you want used to be written into
> descriptor ring itself?

The avail/used rings are in addition to the regular ring, no?  If you
copy descriptors, then it goes away.

> But, I don't really know why does virtio ring insist on
> making the 3 buffers (avail/used/descriptor)
> physically contigious. Rusty?

Let's drop them instead.

>
> > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > just 1.5 ms.  In any case I think it's sufficient.
>
> Right. So I think that without indirect, we waste about 3 entries
> per packet for virtio header and transport etc headers.

That does suck.  Are there issues in increasing the ring size?  Or
making it discontiguous?

Can you take a peek at how Xen manages its rings?  They have the same
problems we do.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Michael S. Tsirkin

On Sun, Dec 04, 2011 at 01:06:37PM +0200, Michael S. Tsirkin wrote:
> On Sat, Dec 03, 2011 at 01:50:28PM +0200, Sasha Levin wrote:
> > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin"  
> > > wrote:
> > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > We'll presumably need some logic to increment is back,
> > > > > > to account for random workload changes.
> > > > > > Something like slow start?
> > > > > 
> > > > > We can increment it each time the queue was less than 10% full, it
> > > > > should act like slow start, no?
> > > > 
> > > > No, we really shouldn't get an empty ring as long as things behave
> > > > well. What I meant is something like:
> > > 
> > > I was thinking of the network output case, but you're right.  We need to
> > > distinguish between usually full (eg. virtio-net input) and usually
> > > empty (eg. virtio-net output).
> > > 
> > > The signal for "we to pack more into the ring" is different.  We could
> > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > when we set up the virtqueue.
> > > 
> > > Our other alternative, moving the logic to the driver, is worse.
> > > 
> > > As to fading the effect over time, that's harder.  We have to deplete
> > > the ring quite a few times before it turns into always-indirect.  We
> > > could back off every time the ring is totally idle, but that may hurt
> > > bursty traffic.  Let's try simple first?
> > 
> > I tried to take a different approach, and tried putting the indirect
> > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > that this way virtio-net actually worked faster with indirect on even in
> > a single stream.
> > 
> > Maybe we can do that instead of playing with threshold for now.
> > 
> > The question here, how much wasted space we can afford? since indirect
> > descriptors would have to be the same size we'd have a bunch of them
> > wasted in the cache. Ofcourse we can make that configurable, but how
> > much is ok by default?
> 
> I think it's a good idea to make that per-device.
> For network at least, each skb already has overhead of
> around 1/2 K, so using up to 1/2K more seems acceptable.
> But even if we went up to MAX_SKB_FRAGS+2, it would be
> only 1K per ring entry,

I got this wrong - descriptor is 16 bytes, so MAX_SKB_FRAGS+2
descriptors would be less than 300 bytes overhead per packet.
That's not a lot.

> so for a ring of 256 entries, we end up with
> 256K max waste. That's not that terrible.
> 
> But I'd say let's do some benchmarking to figure out
> the point where the gains are becoming very small.


> > -- 
> > 
> > Sasha.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Avi Kivity

On 12/04/2011 04:06 PM, Jan Kiszka wrote:
> On 2011-12-04 15:04, Avi Kivity wrote:
> > On 12/04/2011 03:51 PM, Jan Kiszka wrote:
> >>>
> >>> But the name becomes part of the save/restore ABI, so you can't.
> >>
> >> Nope, the vmstate names are identical. That would ruin migration
> >> otherwise. It's just the output of info qtree & co. that changes.
> > 
> > Oh, okay.  I still think it's wrong, but now it's just a matter of
> > taste, and I can live with it.
>
> Wrong in what sense?

In the sense that kernel-apic is just an accelerated apic.  From the
guest point of view, there's no difference, and that should be reflected
in the device model.

If I'm reading an apic register, either from the guest or via a monitor
debug interface, I shouldn't care whether it's accelerated or not.  The
guest part already holds, of course.

> I think the way of merging kvm support into the user space models in
> qemu-kvm is not particularly beautiful. But that's my taste, and
> therefore I modeled the upstream proposal differently. :)

Oh, qemu-kvm was not meant to be an example of engineering elegance,
just minimal changes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Michael S. Tsirkin

On Sun, Dec 04, 2011 at 02:06:34PM +0200, Avi Kivity wrote:
> On 12/04/2011 02:01 PM, Michael S. Tsirkin wrote:
> > > 
> > > How much better?
> > > 
> > > I think that if indirects benefit networking, then we're doing something
> > > wrong.  What's going on?  Does the ring get filled too early?  If so we
> > > should expand it.
> >
> > The ring is physically contigious.
> > With 256 entries and 64 bytes each, that's already 16K.
> 
> A descriptor is just 16 bytes.

Right. Not sure where did I get 64.

> There's also the used ring, but that's a
> mistake if you have out of order completion.  We should have used copying.

Seems unrelated... unless you want used to be written into
descriptor ring itself?
But, I don't really know why does virtio ring insist on
making the 3 buffers (avail/used/descriptor)
physically contigious. Rusty?

> 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> just 1.5 ms.  In any case I think it's sufficient.

Right. So I think that without indirect, we waste about 3 entries
per packet for virtio header and transport etc headers.

> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 04/16] apic: Factor out core for KVM reuse

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

The KVM in-kernel APIC model will reuse parts of the user space model,
namely the vmstate, reset handling, IRQ coalescing tracker, some init
steps and the base and tpr set/get routines. For the latter, we also
prepare set callbacks as KVM will override those.

Signed-off-by: Jan Kiszka 
---
 Makefile.target|2 +-
 hw/apic.c  |  260 +++-
 hw/apic_common.c   |  215 +++
 hw/apic_internal.h |  108 ++
 trace-events   |2 +-
 5 files changed, 339 insertions(+), 248 deletions(-)
 create mode 100644 hw/apic_common.c
 create mode 100644 hw/apic_internal.h

diff --git a/Makefile.target b/Makefile.target
index 3a9e95d..7bb6b13 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -226,7 +226,7 @@ obj-$(CONFIG_IVSHMEM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o pc.o
-obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic_common.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
diff --git a/hw/apic.c b/hw/apic.c
index 2644a82..27b18d6 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -16,53 +16,13 @@
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see 
  */
-#include "hw.h"
+#include "apic_internal.h"
 #include "apic.h"
 #include "ioapic.h"
-#include "qemu-timer.h"
 #include "host-utils.h"
-#include "sysbus.h"
 #include "trace.h"
 #include "pc.h"
 
-/* APIC Local Vector Table */
-#define APIC_LVT_TIMER   0
-#define APIC_LVT_THERMAL 1
-#define APIC_LVT_PERFORM 2
-#define APIC_LVT_LINT0   3
-#define APIC_LVT_LINT1   4
-#define APIC_LVT_ERROR   5
-#define APIC_LVT_NB  6
-
-/* APIC delivery modes */
-#define APIC_DM_FIXED  0
-#define APIC_DM_LOWPRI 1
-#define APIC_DM_SMI2
-#define APIC_DM_NMI4
-#define APIC_DM_INIT   5
-#define APIC_DM_SIPI   6
-#define APIC_DM_EXTINT 7
-
-/* APIC destination mode */
-#define APIC_DESTMODE_FLAT 0xf
-#define APIC_DESTMODE_CLUSTER  1
-
-#define APIC_TRIGGER_EDGE  0
-#define APIC_TRIGGER_LEVEL 1
-
-#defineAPIC_LVT_TIMER_PERIODIC (1<<17)
-#defineAPIC_LVT_MASKED (1<<16)
-#defineAPIC_LVT_LEVEL_TRIGGER  (1<<15)
-#defineAPIC_LVT_REMOTE_IRR (1<<14)
-#defineAPIC_INPUT_POLARITY (1<<13)
-#defineAPIC_SEND_PENDING   (1<<12)
-
-#define ESR_ILLEGAL_ADDRESS (1 << 7)
-
-#define APIC_SV_DIRECTED_IO (1<<12)
-#define APIC_SV_ENABLE  (1<<8)
-
-#define MAX_APICS 255
 #define MAX_APIC_WORDS 8
 
 /* Intel APIC constants: from include/asm/msidef.h */
@@ -75,40 +35,7 @@
 #define MSI_ADDR_DEST_ID_SHIFT 12
 #defineMSI_ADDR_DEST_ID_MASK   0x000
 
-#define MSI_ADDR_SIZE   0x10
-
-typedef struct APICState APICState;
-
-struct APICState {
-SysBusDevice busdev;
-MemoryRegion io_memory;
-void *cpu_env;
-uint32_t apicbase;
-uint8_t id;
-uint8_t arb_id;
-uint8_t tpr;
-uint32_t spurious_vec;
-uint8_t log_dest;
-uint8_t dest_mode;
-uint32_t isr[8];  /* in service register */
-uint32_t tmr[8];  /* trigger mode register */
-uint32_t irr[8]; /* interrupt request register */
-uint32_t lvt[APIC_LVT_NB];
-uint32_t esr; /* error register */
-uint32_t icr[2];
-
-uint32_t divide_conf;
-int count_shift;
-uint32_t initial_count;
-int64_t initial_count_load_time, next_time;
-uint32_t idx;
-QEMUTimer *timer;
-int sipi_vector;
-int wait_for_sipi;
-};
-
 static APICState *local_apics[MAX_APICS + 1];
-static int apic_irq_delivered;
 
 static void apic_set_irq(APICState *s, int vector_num, int trigger_mode);
 static void apic_update_irq(APICState *s);
@@ -293,14 +220,8 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, 
uint8_t delivery_mode,
 apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num, trigger_mode);
 }
 
-void cpu_set_apic_base(DeviceState *d, uint64_t val)
+static void apic_set_base(APICState *s, uint64_t val)
 {
-APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
-
-trace_cpu_set_apic_base(val);
-
-if (!s)
-return;
 s->apicbase = (val & 0xf000) |
 (s->apicbase & (MSR_IA32_APICBASE_BSP | MSR_IA32_APICBASE_ENABLE));
 /* if disabled, cannot be enabled again */
@@ -311,32 +232,12 @@ void cpu_set_apic_base(DeviceState *d, uint64_t val)
 }
 }
 
-uint64_t cpu_get_apic_base(DeviceState *d)
-{
-APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
-
-trace_cpu_get_apic_base(s ? (uint64_t)s->apicbase: 0);
-
-return s ? s->apicbase : 0;
-}
-
-void cpu_set_apic_tpr(DeviceState *d, uint8_t val)
+static void apic_set_tpr(APICState *s, uint8_t val)
 {
-

[PATCH v2 08/16] ioapic: Reject non-dword accesses to IOWIN register

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Aligns the model with the spec.

Signed-off-by: Jan Kiszka 
---
 hw/ioapic.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/hw/ioapic.c b/hw/ioapic.c
index 56b1612..eb75766 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -208,6 +208,9 @@ ioapic_mem_read(void *opaque, target_phys_addr_t addr, 
unsigned int size)
 val = s->ioregsel;
 break;
 case IOAPIC_IOWIN:
+if (size != 4) {
+break;
+}
 switch (s->ioregsel) {
 case IOAPIC_REG_ID:
 val = s->id << IOAPIC_ID_SHIFT;
@@ -247,6 +250,9 @@ ioapic_mem_write(void *opaque, target_phys_addr_t addr, 
uint64_t val,
 s->ioregsel = val;
 break;
 case IOAPIC_IOWIN:
+if (size != 4) {
+break;
+}
 DPRINTF("write: %08x = %08x\n", s->ioregsel, val);
 switch (s->ioregsel) {
 case IOAPIC_REG_ID:
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/16] ioapic: Factor out core for KVM reuse

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

KVM will share the IOAPICState, the vmstate, the reset logic and certain
init parts with the user space model.

Signed-off-by: Jan Kiszka 
---
 Makefile.target  |2 +-
 hw/ioapic.c  |  108 -
 hw/ioapic_common.c   |   89 +
 hw/ioapic_internal.h |   93 +++
 4 files changed, 192 insertions(+), 100 deletions(-)
 create mode 100644 hw/ioapic_common.c
 create mode 100644 hw/ioapic_internal.h

diff --git a/Makefile.target b/Makefile.target
index 7bb6b13..4cd3c0e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -226,7 +226,7 @@ obj-$(CONFIG_IVSHMEM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o pc.o
-obj-i386-y += cirrus_vga.o sga.o apic_common.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic_common.o apic.o ioapic_common.o ioapic.o 
piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
diff --git a/hw/ioapic.c b/hw/ioapic.c
index eb75766..8876d5d 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -24,9 +24,7 @@
 #include "pc.h"
 #include "apic.h"
 #include "ioapic.h"
-#include "qemu-timer.h"
-#include "host-utils.h"
-#include "sysbus.h"
+#include "ioapic_internal.h"
 
 //#define DEBUG_IOAPIC
 
@@ -37,62 +35,6 @@
 #define DPRINTF(fmt, ...)
 #endif
 
-#define MAX_IOAPICS 1
-
-#define IOAPIC_VERSION  0x11
-
-#define IOAPIC_LVT_DEST_SHIFT   56
-#define IOAPIC_LVT_MASKED_SHIFT 16
-#define IOAPIC_LVT_TRIGGER_MODE_SHIFT   15
-#define IOAPIC_LVT_REMOTE_IRR_SHIFT 14
-#define IOAPIC_LVT_POLARITY_SHIFT   13
-#define IOAPIC_LVT_DELIV_STATUS_SHIFT   12
-#define IOAPIC_LVT_DEST_MODE_SHIFT  11
-#define IOAPIC_LVT_DELIV_MODE_SHIFT 8
-
-#define IOAPIC_LVT_MASKED   (1 << IOAPIC_LVT_MASKED_SHIFT)
-#define IOAPIC_LVT_REMOTE_IRR   (1 << IOAPIC_LVT_REMOTE_IRR_SHIFT)
-
-#define IOAPIC_TRIGGER_EDGE 0
-#define IOAPIC_TRIGGER_LEVEL1
-
-/*io{apic,sapic} delivery mode*/
-#define IOAPIC_DM_FIXED 0x0
-#define IOAPIC_DM_LOWEST_PRIORITY   0x1
-#define IOAPIC_DM_PMI   0x2
-#define IOAPIC_DM_NMI   0x4
-#define IOAPIC_DM_INIT  0x5
-#define IOAPIC_DM_SIPI  0x6
-#define IOAPIC_DM_EXTINT0x7
-#define IOAPIC_DM_MASK  0x7
-
-#define IOAPIC_VECTOR_MASK  0xff
-
-#define IOAPIC_IOREGSEL 0x00
-#define IOAPIC_IOWIN0x10
-
-#define IOAPIC_REG_ID   0x00
-#define IOAPIC_REG_VER  0x01
-#define IOAPIC_REG_ARB  0x02
-#define IOAPIC_REG_REDTBL_BASE  0x10
-#define IOAPIC_ID   0x00
-
-#define IOAPIC_ID_SHIFT 24
-#define IOAPIC_ID_MASK  0xf
-
-#define IOAPIC_VER_ENTRIES_SHIFT16
-
-typedef struct IOAPICState IOAPICState;
-
-struct IOAPICState {
-SysBusDevice busdev;
-MemoryRegion io_memory;
-uint8_t id;
-uint8_t ioregsel;
-uint32_t irr;
-uint64_t ioredtbl[IOAPIC_NUM_PINS];
-};
-
 static IOAPICState *ioapics[MAX_IOAPICS];
 
 static void ioapic_service(IOAPICState *s)
@@ -278,44 +220,11 @@ ioapic_mem_write(void *opaque, target_phys_addr_t addr, 
uint64_t val,
 }
 }
 
-static int ioapic_post_load(void *opaque, int version_id)
-{
-IOAPICState *s = opaque;
-
-if (version_id == 1) {
-/* set sane value */
-s->irr = 0;
-}
-return 0;
-}
-
-static const VMStateDescription vmstate_ioapic = {
-.name = "ioapic",
-.version_id = 3,
-.post_load = ioapic_post_load,
-.minimum_version_id = 1,
-.minimum_version_id_old = 1,
-.fields = (VMStateField[]) {
-VMSTATE_UINT8(id, IOAPICState),
-VMSTATE_UINT8(ioregsel, IOAPICState),
-VMSTATE_UNUSED_V(2, 8), /* to account for qemu-kvm's v2 format */
-VMSTATE_UINT32_V(irr, IOAPICState, 2),
-VMSTATE_UINT64_ARRAY(ioredtbl, IOAPICState, IOAPIC_NUM_PINS),
-VMSTATE_END_OF_LIST()
-}
-};
-
 static void ioapic_reset(DeviceState *d)
 {
 IOAPICState *s = DO_UPCAST(IOAPICState, busdev.qdev, d);
-int i;
 
-s->id = 0;
-s->ioregsel = 0;
-s->irr = 0;
-for (i = 0; i < IOAPIC_NUM_PINS; i++) {
-s->ioredtbl[i] = 1 << IOAPIC_LVT_MASKED_SHIFT;
-}
+ioapic_reset_internal(s);
 }
 
 static const MemoryRegionOps ioapic_io_ops = {
@@ -327,18 +236,19 @@ static const MemoryRegionOps ioapic_io_ops = {
 static int ioapic_init1(SysBusDevice *dev)
 {
 IOAPICState *s = FROM_SYSBUS(IOAPICState, dev);
-static int ioapic_no;
+int ioapic_no;
 
-if (ioapic_no >= MAX_IOAPICS) {
+memory_region_init_io(&s->io_memory, &ioapic_io_ops, s, "ioapic", 0x1000);
+
+ioapic_no = ioapic_init_common(s);
+if (ioapic_no < 0

[PATCH v2 15/16] kvm: x86: Add user space part for in-kernel IOAPIC

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

This introduces the KVM-accelerated IOAPIC model 'kvm-ioapic' and
extends the IRQ routing setup by the 0->2 redirection when needed.

The kvm-ioapic model has a property that allows to define its GSI base
for injecting interrupts into the kernel model. This will allow to
disentangle PIC and IOAPIC pins for chipsets that support more
sophisticated IRQ routes than the PIIX3. So far the base is kept at 0,
i.e. PIC and IOAPIC share pins 0..15.

Signed-off-by: Jan Kiszka 
---
 Makefile.target  |2 +-
 hw/ioapic_internal.h |1 +
 hw/kvm/ioapic.c  |  120 ++
 hw/pc_piix.c |   15 ++-
 4 files changed, 136 insertions(+), 2 deletions(-)
 create mode 100644 hw/kvm/ioapic.c

diff --git a/Makefile.target b/Makefile.target
index 850b80f..2f3407b 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -231,7 +231,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o kvm/ioapic.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
diff --git a/hw/ioapic_internal.h b/hw/ioapic_internal.h
index bda3608..7d5f735 100644
--- a/hw/ioapic_internal.h
+++ b/hw/ioapic_internal.h
@@ -83,6 +83,7 @@ struct IOAPICState {
 
 void (*pre_save)(IOAPICState *s);
 void (*post_load)(IOAPICState *s);
+uint32_t kvm_gsi_base;
 };
 
 extern const VMStateDescription vmstate_ioapic;
diff --git a/hw/kvm/ioapic.c b/hw/kvm/ioapic.c
new file mode 100644
index 000..1040e29
--- /dev/null
+++ b/hw/kvm/ioapic.c
@@ -0,0 +1,120 @@
+/*
+ * KVM in-kernel IOPIC support
+ *
+ * Copyright (c) 2011 Siemens AG
+ *
+ * Authors:
+ *  Jan Kiszka  
+ *
+ * This work is licensed under the terms of the GNU GPL version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "hw/pc.h"
+#include "hw/ioapic_internal.h"
+#include "hw/apic_internal.h"
+#include "kvm.h"
+
+static void kvm_ioapic_get(IOAPICState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_ioapic_state *kioapic;
+int ret, i;
+
+chip.chip_id = KVM_IRQCHIP_IOAPIC;
+ret = kvm_vm_ioctl(kvm_state, KVM_GET_IRQCHIP, &chip);
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_IRQCHIP failed: %s\n", strerror(ret));
+abort();
+}
+
+kioapic = &chip.chip.ioapic;
+
+s->id = kioapic->id;
+s->ioregsel = kioapic->ioregsel;
+s->irr = kioapic->irr;
+for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+s->ioredtbl[i] = kioapic->redirtbl[i].bits;
+}
+}
+
+static void kvm_ioapic_put(IOAPICState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_ioapic_state *kioapic;
+int ret, i;
+
+chip.chip_id = KVM_IRQCHIP_IOAPIC;
+kioapic = &chip.chip.ioapic;
+
+kioapic->id = s->id;
+kioapic->ioregsel = s->ioregsel;
+kioapic->base_address = s->busdev.mmio[0].addr;
+kioapic->irr = s->irr;
+for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+kioapic->redirtbl[i].bits = s->ioredtbl[i];
+}
+
+ret = kvm_vm_ioctl(kvm_state, KVM_SET_IRQCHIP, &chip);
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_IRQCHIP failed: %s\n", strerror(ret));
+abort();
+}
+}
+
+static void kvm_ioapic_reset(DeviceState *d)
+{
+IOAPICState *s = DO_UPCAST(IOAPICState, busdev.qdev, d);
+
+ioapic_reset_internal(s);
+
+kvm_ioapic_put(s);
+}
+
+static void kvm_ioapic_set_irq(void *opaque, int irq, int level)
+{
+IOAPICState *s = opaque;
+int delivered;
+
+delivered = kvm_irqchip_set_irq(kvm_state, s->kvm_gsi_base + irq, level);
+apic_set_irq_delivered(delivered);
+}
+
+static int kvm_ioapic_init(SysBusDevice *dev)
+{
+IOAPICState *s = FROM_SYSBUS(IOAPICState, dev);
+
+memory_region_init_reservation(&s->io_memory, "kvm-ioapic", 0x1000);
+
+if (ioapic_init_common(s) < 0) {
+memory_region_destroy(&s->io_memory);
+return -1;
+}
+
+s->pre_save = kvm_ioapic_get;
+s->post_load = kvm_ioapic_put;
+
+qdev_init_gpio_in(&dev->qdev, kvm_ioapic_set_irq, IOAPIC_NUM_PINS);
+
+return 0;
+}
+
+static SysBusDeviceInfo kvm_ioapic_info = {
+.init = kvm_ioapic_init,
+.qdev.name = "kvm-ioapic",
+.qdev.size = sizeof(IOAPICState),
+.qdev.vmsd = &vmstate_ioapic,
+.qdev.reset = kvm_ioapic_reset,
+.qdev.no_user = 1,
+.qdev.props = (Property[]) {
+DEFINE_PROP_UINT32("gsi_base", IOAPICState, kvm_gsi_base, 0),
+DEFINE_PROP_END_OF_LIST(),
+}
+};
+
+static void kvm_ioapic_register_devices(void)
+{
+sysbus_register_withprop(&kvm_ioapic_info);
+}
+
+device_init(kvm_ioapic_register_devices)
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 351b032..624aecd 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -68,6 +68,15 @@ static void kvm_piix3_setup_irq_routing(bool pci_enabled)
 for (i = 8; i < 16; ++

[PATCH v2 06/16] i8259: Factor out core for KVM reuse

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Analogously to the APIC, we will reuse some parts of the user space
i8259 model for KVM. In this case it is the PicState, vmstate
description, a reset core and some init bits.

Signed-off-by: Jan Kiszka 
---
 Makefile.objs   |2 +-
 hw/i8259.c  |   78 +-
 hw/i8259_common.c   |  103 +++
 hw/i8259_internal.h |   67 +
 4 files changed, 174 insertions(+), 76 deletions(-)
 create mode 100644 hw/i8259_common.c
 create mode 100644 hw/i8259_internal.h

diff --git a/Makefile.objs b/Makefile.objs
index 01587c8..5372eec 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -220,7 +220,7 @@ hw-obj-$(CONFIG_APPLESMC) += applesmc.o
 hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o
 hw-obj-$(CONFIG_SMARTCARD_NSS) += ccid-card-emulated.o
 hw-obj-$(CONFIG_USB_REDIR) += usb-redir.o
-hw-obj-$(CONFIG_I8259) += i8259.o
+hw-obj-$(CONFIG_I8259) += i8259_common.o i8259.o
 
 # PPC devices
 hw-obj-$(CONFIG_PREP_PCI) += prep_pci.o
diff --git a/hw/i8259.c b/hw/i8259.c
index ab519de..e8a6a9a 100644
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -26,6 +26,7 @@
 #include "isa.h"
 #include "monitor.h"
 #include "qemu-timer.h"
+#include "i8259_internal.h"
 
 /* debug PIC */
 //#define DEBUG_PIC
@@ -40,33 +41,6 @@
 //#define DEBUG_IRQ_LATENCY
 //#define DEBUG_IRQ_COUNT
 
-struct PicState {
-ISADevice dev;
-uint8_t last_irr; /* edge detection */
-uint8_t irr; /* interrupt request register */
-uint8_t imr; /* interrupt mask register */
-uint8_t isr; /* interrupt service register */
-uint8_t priority_add; /* highest irq priority */
-uint8_t irq_base;
-uint8_t read_reg_select;
-uint8_t poll;
-uint8_t special_mask;
-uint8_t init_state;
-uint8_t auto_eoi;
-uint8_t rotate_on_auto_eoi;
-uint8_t special_fully_nested_mode;
-uint8_t init4; /* true if 4 byte init */
-uint8_t single_mode; /* true if slave pic is not initialized */
-uint8_t elcr; /* PIIX edge/trigger selection*/
-uint8_t elcr_mask;
-qemu_irq int_out[1];
-uint32_t master; /* reflects /SP input pin */
-uint32_t iobase;
-uint32_t elcr_addr;
-MemoryRegion base_io;
-MemoryRegion elcr_io;
-};
-
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT)
 static int irq_level[16];
 #endif
@@ -248,22 +222,7 @@ int pic_read_irq(PicState *s)
 
 static void pic_init_reset(PicState *s)
 {
-s->last_irr = 0;
-s->irr = 0;
-s->imr = 0;
-s->isr = 0;
-s->priority_add = 0;
-s->irq_base = 0;
-s->read_reg_select = 0;
-s->poll = 0;
-s->special_mask = 0;
-s->init_state = 0;
-s->auto_eoi = 0;
-s->rotate_on_auto_eoi = 0;
-s->special_fully_nested_mode = 0;
-s->init4 = 0;
-s->single_mode = 0;
-/* Note: ELCR is not reset */
+pic_reset_internal(s);
 pic_update_irq(s);
 }
 
@@ -418,32 +377,6 @@ static uint64_t elcr_ioport_read(void *opaque, 
target_phys_addr_t addr,
 return s->elcr;
 }
 
-static const VMStateDescription vmstate_pic = {
-.name = "i8259",
-.version_id = 1,
-.minimum_version_id = 1,
-.minimum_version_id_old = 1,
-.fields = (VMStateField[]) {
-VMSTATE_UINT8(last_irr, PicState),
-VMSTATE_UINT8(irr, PicState),
-VMSTATE_UINT8(imr, PicState),
-VMSTATE_UINT8(isr, PicState),
-VMSTATE_UINT8(priority_add, PicState),
-VMSTATE_UINT8(irq_base, PicState),
-VMSTATE_UINT8(read_reg_select, PicState),
-VMSTATE_UINT8(poll, PicState),
-VMSTATE_UINT8(special_mask, PicState),
-VMSTATE_UINT8(init_state, PicState),
-VMSTATE_UINT8(auto_eoi, PicState),
-VMSTATE_UINT8(rotate_on_auto_eoi, PicState),
-VMSTATE_UINT8(special_fully_nested_mode, PicState),
-VMSTATE_UINT8(init4, PicState),
-VMSTATE_UINT8(single_mode, PicState),
-VMSTATE_UINT8(elcr, PicState),
-VMSTATE_END_OF_LIST()
-}
-};
-
 static const MemoryRegionOps pic_base_ioport_ops = {
 .read = pic_ioport_read,
 .write = pic_ioport_write,
@@ -469,16 +402,11 @@ static int pic_initfn(ISADevice *dev)
 memory_region_init_io(&s->base_io, &pic_base_ioport_ops, s, "pic", 2);
 memory_region_init_io(&s->elcr_io, &pic_elcr_ioport_ops, s, "elcr", 1);
 
-isa_register_ioport(NULL, &s->base_io, s->iobase);
-if (s->elcr_addr != -1) {
-isa_register_ioport(NULL, &s->elcr_io, s->elcr_addr);
-}
+pic_init_common(s);
 
 qdev_init_gpio_out(&dev->qdev, s->int_out, ARRAY_SIZE(s->int_out));
 qdev_init_gpio_in(&dev->qdev, pic_set_irq, 8);
 
-qdev_set_legacy_instance_id(&dev->qdev, s->iobase, 1);
-
 return 0;
 }
 
diff --git a/hw/i8259_common.c b/hw/i8259_common.c
new file mode 100644
index 000..9d2fbc3
--- /dev/null
+++ b/hw/i8259_common.c
@@ -0,0 +1,103 @@
+/*
+ * QEMU 8259 - common bits of emulated and KVM kernel model
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyrig

[PATCH v2 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Introduce the alternative 'kvm-i8259' device model that exploits KVM
in-kernel acceleration.

The PIIX3 initialization code is furthermore extended by KVM specific
IRQ route setup. Moreover, GSI injection differs in KVM mode from the
user space model. As we can dispatch ISA-range IRQs to both IOAPIC and
PIC inside the kernel, we do not need to inject them separately. This is
reflected by a KVM-specific GSI handler.

Signed-off-by: Jan Kiszka 
---
 Makefile.target |2 +-
 hw/kvm/i8259.c  |  154 +++
 hw/pc.h |1 +
 hw/pc_piix.c|   50 --
 4 files changed, 202 insertions(+), 5 deletions(-)
 create mode 100644 hw/kvm/i8259.c

diff --git a/Makefile.target b/Makefile.target
index 66b42d5..850b80f 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -231,7 +231,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
diff --git a/hw/kvm/i8259.c b/hw/kvm/i8259.c
new file mode 100644
index 000..f3994cb
--- /dev/null
+++ b/hw/kvm/i8259.c
@@ -0,0 +1,154 @@
+/*
+ * KVM in-kernel PIC (i8259) support
+ *
+ * Copyright (c) 2011 Siemens AG
+ *
+ * Authors:
+ *  Jan Kiszka  
+ *
+ * This work is licensed under the terms of the GNU GPL version 2.
+ * See the COPYING file in the top-level directory.
+ */
+#include "hw/i8259_internal.h"
+#include "hw/apic_internal.h"
+#include "kvm.h"
+
+static void kvm_pic_get(PicState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_pic_state *kpic;
+int ret;
+
+chip.chip_id = s->master ? KVM_IRQCHIP_PIC_MASTER : KVM_IRQCHIP_PIC_SLAVE;
+ret = kvm_vm_ioctl(kvm_state, KVM_GET_IRQCHIP, &chip);
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_IRQCHIP failed: %s\n", strerror(ret));
+abort();
+}
+
+kpic = &chip.chip.pic;
+
+s->last_irr = kpic->last_irr;
+s->irr = kpic->irr;
+s->imr = kpic->imr;
+s->isr = kpic->isr;
+s->priority_add = kpic->priority_add;
+s->irq_base = kpic->irq_base;
+s->read_reg_select = kpic->read_reg_select;
+s->poll = kpic->poll;
+s->special_mask = kpic->special_mask;
+s->init_state = kpic->init_state;
+s->auto_eoi = kpic->auto_eoi;
+s->rotate_on_auto_eoi = kpic->rotate_on_auto_eoi;
+s->special_fully_nested_mode = kpic->special_fully_nested_mode;
+s->init4 = kpic->init4;
+s->elcr = kpic->elcr;
+s->elcr_mask = kpic->elcr_mask;
+}
+
+static void kvm_pic_put(PicState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_pic_state *kpic;
+int ret;
+
+chip.chip_id = s->master ? KVM_IRQCHIP_PIC_MASTER : KVM_IRQCHIP_PIC_SLAVE;
+
+kpic = &chip.chip.pic;
+
+kpic->last_irr = s->last_irr;
+kpic->irr = s->irr;
+kpic->imr = s->imr;
+kpic->isr = s->isr;
+kpic->priority_add = s->priority_add;
+kpic->irq_base = s->irq_base;
+kpic->read_reg_select = s->read_reg_select;
+kpic->poll = s->poll;
+kpic->special_mask = s->special_mask;
+kpic->init_state = s->init_state;
+kpic->auto_eoi = s->auto_eoi;
+kpic->rotate_on_auto_eoi = s->rotate_on_auto_eoi;
+kpic->special_fully_nested_mode = s->special_fully_nested_mode;
+kpic->init4 = s->init4;
+kpic->elcr = s->elcr;
+kpic->elcr_mask = s->elcr_mask;
+
+ret = kvm_vm_ioctl(kvm_state, KVM_SET_IRQCHIP, &chip);
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_IRQCHIP failed: %s\n", strerror(ret));
+abort();
+}
+}
+
+static void kvm_pic_reset(DeviceState *dev)
+{
+PicState *s = container_of(dev, PicState, dev.qdev);
+
+pic_reset_internal(s);
+s->elcr = 0;
+
+kvm_pic_put(s);
+}
+
+static void kvm_pic_set_irq(void *opaque, int irq, int level)
+{
+int delivered;
+
+delivered = kvm_irqchip_set_irq(kvm_state, irq, level);
+apic_set_irq_delivered(delivered);
+}
+
+static int kvm_pic_init(ISADevice *dev)
+{
+PicState *s = DO_UPCAST(PicState, dev, dev);
+
+memory_region_init_reservation(&s->base_io, "kvm-pic", 2);
+memory_region_init_reservation(&s->elcr_io, "kvm-elcr", 1);
+
+pic_init_common(s);
+
+s->pre_save = kvm_pic_get;
+s->post_load = kvm_pic_put;
+
+return 0;
+}
+
+qemu_irq *kvm_i8259_init(void)
+{
+ISADevice *dev;
+
+dev = isa_create("kvm-i8259");
+qdev_prop_set_uint32(&dev->qdev, "iobase", 0x20);
+qdev_prop_set_uint32(&dev->qdev, "elcr_addr", 0x4d0);
+qdev_prop_set_bit(&dev->qdev, "master", true);
+qdev_init_nofail(&dev->qdev);
+
+dev = isa_create("kvm-i8259");
+qdev_prop_set_uint32(&dev->qdev, "iobase", 0xa0);
+qdev_prop_set_uint32(&dev->qdev, "elcr_addr", 0x4d1);
+qdev_init_nofail(&dev->qdev);
+
+return qemu_allocate_irqs(kvm_pic_set_irq, NULL, ISA_NUM_IRQS);
+}

[PATCH v2 05/16] apic: Open-code timer save/restore

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

To enable migration between accelerated and non-accelerated APIC models,
we will need to handle the timer saving and restoring specially and can
no longer rely on the automatics of VMSTATE_TIMER. Specifically,
accelerated model will not start any QEMUTimer.

This patch therefore factors out the generic bits into apic_next_timer
and introduces a post-load callback that can be implemented differently
by both models.

Signed-off-by: Jan Kiszka 
---
 hw/apic.c  |   30 --
 hw/apic_common.c   |   51 +--
 hw/apic_internal.h |3 +++
 3 files changed, 64 insertions(+), 20 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 27b18d6..9b83c0c 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -516,25 +516,9 @@ static uint32_t apic_get_current_count(APICState *s)
 
 static void apic_timer_update(APICState *s, int64_t current_time)
 {
-int64_t next_time, d;
-
-if (!(s->lvt[APIC_LVT_TIMER] & APIC_LVT_MASKED)) {
-d = (current_time - s->initial_count_load_time) >>
-s->count_shift;
-if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_TIMER_PERIODIC) {
-if (!s->initial_count)
-goto no_timer;
-d = ((d / ((uint64_t)s->initial_count + 1)) + 1) * 
((uint64_t)s->initial_count + 1);
-} else {
-if (d >= s->initial_count)
-goto no_timer;
-d = (uint64_t)s->initial_count + 1;
-}
-next_time = s->initial_count_load_time + (d << s->count_shift);
-qemu_mod_timer(s->timer, next_time);
-s->next_time = next_time;
+if (apic_next_timer(s, current_time)) {
+qemu_mod_timer(s->timer, s->next_time);
 } else {
-no_timer:
 qemu_del_timer(s->timer);
 }
 }
@@ -756,6 +740,15 @@ static const MemoryRegionOps apic_io_ops = {
 .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+static void apic_post_load(APICState *s)
+{
+if (s->timer_expiry != -1) {
+qemu_mod_timer(s->timer, s->timer_expiry);
+} else {
+qemu_del_timer(s->timer);
+}
+}
+
 static int apic_init(SysBusDevice *dev)
 {
 APICState *s = FROM_SYSBUS(APICState, dev);
@@ -772,6 +765,7 @@ static int apic_init(SysBusDevice *dev)
 s->timer = qemu_new_timer_ns(vm_clock, apic_timer, s);
 s->set_base = apic_set_base;
 s->set_tpr = apic_set_tpr;
+s->post_load = apic_post_load;
 local_apics[s->idx] = s;
 return 0;
 }
diff --git a/hw/apic_common.c b/hw/apic_common.c
index 7d30356..84a3a27 100644
--- a/hw/apic_common.c
+++ b/hw/apic_common.c
@@ -80,6 +80,39 @@ int apic_get_irq_delivered(void)
 return apic_irq_delivered;
 }
 
+bool apic_next_timer(APICState *s, int64_t current_time)
+{
+int64_t d;
+
+/* We need to store the timer state separately to support APIC
+ * implementations that maintain a non-QEMU timer, e.g. inside the
+ * host kernel. This open-coded state allows us to migrate between
+ * both models. */
+s->timer_expiry = -1;
+
+if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_MASKED) {
+return false;
+}
+
+d = (current_time - s->initial_count_load_time) >> s->count_shift;
+
+if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_TIMER_PERIODIC) {
+if (!s->initial_count) {
+return false;
+}
+d = ((d / ((uint64_t)s->initial_count + 1)) + 1) *
+((uint64_t)s->initial_count + 1);
+} else {
+if (d >= s->initial_count) {
+return false;
+}
+d = (uint64_t)s->initial_count + 1;
+}
+s->next_time = s->initial_count_load_time + (d << s->count_shift);
+s->timer_expiry = s->next_time;
+return true;
+}
+
 void apic_init_reset(DeviceState *d)
 {
 APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
@@ -107,7 +140,10 @@ void apic_init_reset(DeviceState *d)
 s->next_time = 0;
 s->wait_for_sipi = 1;
 
-qemu_del_timer(s->timer);
+if (s->timer) {
+qemu_del_timer(s->timer);
+}
+s->timer_expiry = -1;
 }
 
 void apic_reset(DeviceState *d)
@@ -172,12 +208,23 @@ static int apic_load_old(QEMUFile *f, void *opaque, int 
version_id)
 return 0;
 }
 
+static int apic_dispatch_post_load(void *opaque, int version_id)
+{
+APICState *s = opaque;
+
+if (s->post_load) {
+s->post_load(s);
+}
+return 0;
+}
+
 const VMStateDescription vmstate_apic = {
 .name = "apic",
 .version_id = 3,
 .minimum_version_id = 3,
 .minimum_version_id_old = 1,
 .load_state_old = apic_load_old,
+.post_load = apic_dispatch_post_load,
 .fields  = (VMStateField[]) {
 VMSTATE_UINT32(apicbase, APICState),
 VMSTATE_UINT8(id, APICState),
@@ -197,7 +244,7 @@ const VMStateDescription vmstate_apic = {
 VMSTATE_UINT32(initial_count, APICState),
 VMSTATE_INT64(initial_count_load_time, APICState),
 VMSTATE_INT64(next_time, APICState),
-VMSTATE_TIMER(timer, APICState),
+VMSTATE_

[PATCH v2 16/16] kvm: Arm in-kernel irqchip support

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Make the basic in-kernel irqchip support selectable via
-machine ...,kernel_irqchip=on. Leave it off by default until it can
fully replace user space models.

Signed-off-by: Jan Kiszka 
---
 qemu-config.c   |4 
 qemu-options.hx |5 -
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/qemu-config.c b/qemu-config.c
index 90b6b3e..fc25115 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -483,6 +483,10 @@ static QemuOptsList qemu_machine_opts = {
 .name = "accel",
 .type = QEMU_OPT_STRING,
 .help = "accelerator list",
+}, {
+.name = "kernel_irqchip",
+.type = QEMU_OPT_BOOL,
+.help = "use KVM in-kernel irqchip",
 },
 { /* End of list */ }
 },
diff --git a/qemu-options.hx b/qemu-options.hx
index 5d2a776..e10186b 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -31,7 +31,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
 "-machine [type=]name[,prop[=value][,...]]\n"
 "selects emulated machine (-machine ? for list)\n"
 "property accel=accel1[:accel2[:...]] selects 
accelerator\n"
-"supported accelerators are kvm, xen, tcg (default: 
tcg)\n",
+"supported accelerators are kvm, xen, tcg (default: tcg)\n"
+"kernel_irqchip=on|off controls accelerated irqchip 
support\n",
 QEMU_ARCH_ALL)
 STEXI
 @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
@@ -44,6 +45,8 @@ This is used to enable an accelerator. Depending on the 
target architecture,
 kvm, xen, or tcg can be available. By default, tcg is used. If there is more
 than one accelerator specified, the next one is used if the previous one fails
 to initialize.
+@item kernel_irqchip=on|off
+Enables in-kernel irqchip support for the chosen accelerator when available.
 @end table
 ETEXI
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 13/16] kvm: x86: Add user space part for in-kernel APIC

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

This introduces the alternative APIC model 'kvm-apic' which makes use of
KVM's in-kernel device model. MSI is not yet supported, so we disable
this when the in-kernel model is in use.

Signed-off-by: Jan Kiszka 
---
 Makefile.target   |2 +-
 hw/kvm/apic.c |  147 +
 hw/pc.c   |   15 --
 kvm.h |3 +
 target-i386/kvm.c |8 +++
 5 files changed, 169 insertions(+), 6 deletions(-)
 create mode 100644 hw/kvm/apic.c

diff --git a/Makefile.target b/Makefile.target
index 4cd3c0e..66b42d5 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -231,7 +231,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvm/clock.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
diff --git a/hw/kvm/apic.c b/hw/kvm/apic.c
new file mode 100644
index 000..be6f401
--- /dev/null
+++ b/hw/kvm/apic.c
@@ -0,0 +1,147 @@
+/*
+ * KVM in-kernel APIC support
+ *
+ * Copyright (c) 2011 Siemens AG
+ *
+ * Authors:
+ *  Jan Kiszka  
+ *
+ * This work is licensed under the terms of the GNU GPL version 2.
+ * See the COPYING file in the top-level directory.
+ */
+#include "hw/apic_internal.h"
+#include "kvm.h"
+
+static inline void kvm_apic_set_reg(struct kvm_lapic_state *kapic,
+   int reg_id, uint32_t val)
+{
+*((uint32_t *)(kapic->regs + (reg_id << 4))) = val;
+}
+
+static inline uint32_t kvm_apic_get_reg(struct kvm_lapic_state *kapic,
+   int reg_id)
+{
+return *((uint32_t *)(kapic->regs + (reg_id << 4)));
+}
+
+int kvm_put_apic(CPUState *env)
+{
+APICState *s = DO_UPCAST(APICState, busdev.qdev, env->apic_state);
+struct kvm_lapic_state kapic;
+int i;
+
+if (s && kvm_enabled() && kvm_irqchip_in_kernel()) {
+memset(&kapic, 0, sizeof(kapic));
+kvm_apic_set_reg(&kapic, 0x2, s->id << 24);
+kvm_apic_set_reg(&kapic, 0x8, s->tpr);
+kvm_apic_set_reg(&kapic, 0xd, s->log_dest << 24);
+kvm_apic_set_reg(&kapic, 0xe, s->dest_mode << 28 | 0x0fff);
+kvm_apic_set_reg(&kapic, 0xf, s->spurious_vec);
+for (i = 0; i < 8; i++) {
+kvm_apic_set_reg(&kapic, 0x10 + i, s->isr[i]);
+kvm_apic_set_reg(&kapic, 0x18 + i, s->tmr[i]);
+kvm_apic_set_reg(&kapic, 0x20 + i, s->irr[i]);
+}
+kvm_apic_set_reg(&kapic, 0x28, s->esr);
+kvm_apic_set_reg(&kapic, 0x30, s->icr[0]);
+kvm_apic_set_reg(&kapic, 0x31, s->icr[1]);
+for (i = 0; i < APIC_LVT_NB; i++) {
+kvm_apic_set_reg(&kapic, 0x32 + i, s->lvt[i]);
+}
+kvm_apic_set_reg(&kapic, 0x38, s->initial_count);
+kvm_apic_set_reg(&kapic, 0x3e, s->divide_conf);
+
+return kvm_vcpu_ioctl(env, KVM_SET_LAPIC, &kapic);
+}
+
+return 0;
+}
+
+int kvm_get_apic(CPUState *env)
+{
+APICState *s = DO_UPCAST(APICState, busdev.qdev, env->apic_state);
+struct kvm_lapic_state kapic;
+int ret, i, v;
+
+if (s && kvm_enabled() && kvm_irqchip_in_kernel()) {
+ret = kvm_vcpu_ioctl(env, KVM_GET_LAPIC, &kapic);
+if (ret < 0) {
+return ret;
+}
+
+s->id = kvm_apic_get_reg(&kapic, 0x2) >> 24;
+s->tpr = kvm_apic_get_reg(&kapic, 0x8);
+s->arb_id = kvm_apic_get_reg(&kapic, 0x9);
+s->log_dest = kvm_apic_get_reg(&kapic, 0xd) >> 24;
+s->dest_mode = kvm_apic_get_reg(&kapic, 0xe) >> 28;
+s->spurious_vec = kvm_apic_get_reg(&kapic, 0xf);
+for (i = 0; i < 8; i++) {
+s->isr[i] = kvm_apic_get_reg(&kapic, 0x10 + i);
+s->tmr[i] = kvm_apic_get_reg(&kapic, 0x18 + i);
+s->irr[i] = kvm_apic_get_reg(&kapic, 0x20 + i);
+}
+s->esr = kvm_apic_get_reg(&kapic, 0x28);
+s->icr[0] = kvm_apic_get_reg(&kapic, 0x30);
+s->icr[1] = kvm_apic_get_reg(&kapic, 0x31);
+for (i = 0; i < APIC_LVT_NB; i++) {
+s->lvt[i] = kvm_apic_get_reg(&kapic, 0x32 + i);
+}
+s->initial_count = kvm_apic_get_reg(&kapic, 0x38);
+s->divide_conf = kvm_apic_get_reg(&kapic, 0x3e);
+
+v = (s->divide_conf & 3) | ((s->divide_conf >> 1) & 4);
+s->count_shift = (v + 1) & 7;
+
+s->initial_count_load_time = qemu_get_clock_ns(vm_clock);
+apic_next_timer(s, s->initial_count_load_time);
+}
+return 0;
+}
+
+static void kvm_apic_set_base(APICState *s, uint64_t val)
+{
+s->apicbase = val;
+}
+
+static void kvm_apic_set_tpr(APICState *s, uint8_t val)
+{
+s->tpr = (val & 0x0f) << 4;
+}
+
+static int kvm_apic_init(SysBusDevice *dev)
+{
+APICState *s = FROM_SYSBUS(APICState, dev);
+
+memory_region_init_reservation(&s->io_memory, "kvm-apic-msi",
+

[PATCH v2 11/16] kvm: Introduce core services for in-kernel irqchip support

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Add the basic infrastructure to active in-kernel irqchip support, inject
interrupts into these models, and maintain IRQ routes.

Routing is optional and depends on the host arch supporting
KVM_CAP_IRQ_ROUTING. When it's not available on x86, we looe the HPET as
we can't route GSI0 to IOAPIC pin 2.

In-kernel irqchip support will once be controlled by the machine
property 'kernel_irqchip', but this is not yet wired up.

Signed-off-by: Jan Kiszka 
---
 kvm-all.c |  149 +
 kvm.h |8 +++
 target-i386/kvm.c |   11 
 3 files changed, 168 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index e7faf5c..a85e14f 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -76,6 +76,13 @@ struct KVMState
 int pit_in_kernel;
 int xsave, xcrs;
 int many_ioeventfds;
+int irqchip_inject_ioctl;
+#ifdef KVM_CAP_IRQ_ROUTING
+struct kvm_irq_routing *irq_routes;
+int nr_allocated_irq_routes;
+uint32_t *used_gsi_bitmap;
+unsigned int max_gsi;
+#endif
 };
 
 KVMState *kvm_state;
@@ -692,6 +699,138 @@ static void kvm_handle_interrupt(CPUState *env, int mask)
 }
 }
 
+int kvm_irqchip_set_irq(KVMState *s, int irq, int level)
+{
+struct kvm_irq_level event;
+int ret;
+
+assert(s->irqchip_in_kernel);
+
+event.level = level;
+event.irq = irq;
+ret = kvm_vm_ioctl(s, s->irqchip_inject_ioctl, &event);
+if (ret < 0) {
+perror("kvm_set_irqchip_line");
+abort();
+}
+
+return (s->irqchip_inject_ioctl == KVM_IRQ_LINE) ? 1 : event.status;
+}
+
+#ifdef KVM_CAP_IRQ_ROUTING
+static void set_gsi(KVMState *s, unsigned int gsi)
+{
+assert(gsi < s->max_gsi);
+
+s->used_gsi_bitmap[gsi / 32] |= 1U << (gsi % 32);
+}
+
+static void kvm_init_irq_routing(KVMState *s)
+{
+int gsi_count;
+
+gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING);
+if (gsi_count > 0) {
+unsigned int gsi_bits, i;
+
+/* Round up so we can search ints using ffs */
+gsi_bits = (gsi_count + 31) / 32;
+s->used_gsi_bitmap = g_malloc0(gsi_bits / 8);
+s->max_gsi = gsi_bits;
+
+/* Mark any over-allocated bits as already in use */
+for (i = gsi_count; i < gsi_bits; i++) {
+set_gsi(s, i);
+}
+}
+
+s->irq_routes = g_malloc0(sizeof(*s->irq_routes));
+s->nr_allocated_irq_routes = 0;
+
+kvm_arch_init_irq_routing(s);
+}
+
+static void kvm_add_routing_entry(KVMState *s,
+  struct kvm_irq_routing_entry *entry)
+{
+struct kvm_irq_routing_entry *new;
+int n, size;
+
+if (s->irq_routes->nr == s->nr_allocated_irq_routes) {
+n = s->nr_allocated_irq_routes * 2;
+if (n < 64) {
+n = 64;
+}
+size = sizeof(struct kvm_irq_routing);
+size += n * sizeof(*new);
+s->irq_routes = g_realloc(s->irq_routes, size);
+s->nr_allocated_irq_routes = n;
+}
+n = s->irq_routes->nr++;
+new = &s->irq_routes->entries[n];
+memset(new, 0, sizeof(*new));
+new->gsi = entry->gsi;
+new->type = entry->type;
+new->flags = entry->flags;
+new->u = entry->u;
+
+set_gsi(s, entry->gsi);
+}
+
+void kvm_irqchip_add_route(KVMState *s, int irq, int irqchip, int pin)
+{
+struct kvm_irq_routing_entry e;
+
+e.gsi = irq;
+e.type = KVM_IRQ_ROUTING_IRQCHIP;
+e.flags = 0;
+e.u.irqchip.irqchip = irqchip;
+e.u.irqchip.pin = pin;
+kvm_add_routing_entry(s, &e);
+}
+
+int kvm_irqchip_commit_routes(KVMState *s)
+{
+s->irq_routes->flags = 0;
+return kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s->irq_routes);
+}
+
+#else /* !KVM_CAP_IRQ_ROUTING */
+
+static void kvm_init_irq_routing(KVMState *s)
+{
+}
+#endif /* !KVM_CAP_IRQ_ROUTING */
+
+static int kvm_irqchip_create(KVMState *s)
+{
+QemuOptsList *list = qemu_find_opts("machine");
+int ret;
+
+if (QTAILQ_EMPTY(&list->head) ||
+!qemu_opt_get_bool(QTAILQ_FIRST(&list->head),
+   "kernel_irqchip", false) ||
+!kvm_check_extension(s, KVM_CAP_IRQCHIP)) {
+return 0;
+}
+
+ret = kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP);
+if (ret < 0) {
+fprintf(stderr, "Create kernel irqchip failed\n");
+return ret;
+}
+
+s->irqchip_inject_ioctl = KVM_IRQ_LINE;
+if (kvm_check_extension(s, KVM_CAP_IRQ_INJECT_STATUS)) {
+s->irqchip_inject_ioctl = KVM_IRQ_LINE_STATUS;
+}
+s->irqchip_in_kernel = 1;
+
+kvm_init_irq_routing(s);
+
+return 0;
+}
+
 int kvm_init(void)
 {
 static const char upgrade_note[] =
@@ -786,6 +925,11 @@ int kvm_init(void)
 goto err;
 }
 
+ret = kvm_irqchip_create(s);
+if (ret < 0) {
+goto err;
+}
+
 kvm_state = s;
 cpu_register_phys_memory_client(&kvm_cpu_phys_memory_client);
 
@@ -,6 +1255,11 @@ int kvm_has_many_ioeventfds(void)
 return kvm_state->many_ioeventfds;
 }
 
+int kvm_has_

[PATCH v2 12/16] kvm: x86: Establish IRQ0 override control

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

KVM is forced to disable the IRQ0 override when we run with in-kernel
irqchip but without IRQ routing support of the kernel. Set the fwcfg
value correspondingly. This aligns us with qemu-kvm.

Signed-off-by: Jan Kiszka 
---
 hw/pc.c|3 ++-
 kvm-all.c  |5 +
 kvm-stub.c |5 +
 kvm.h  |2 ++
 sysemu.h   |1 -
 vl.c   |1 -
 6 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 5225d5b..715cc63 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -39,6 +39,7 @@
 #include "msi.h"
 #include "sysbus.h"
 #include "sysemu.h"
+#include "kvm.h"
 #include "blockdev.h"
 #include "ui/qemu-spice.h"
 #include "memory.h"
@@ -609,7 +610,7 @@ static void *bochs_bios_init(void)
 fw_cfg_add_i64(fw_cfg, FW_CFG_RAM_SIZE, (uint64_t)ram_size);
 fw_cfg_add_bytes(fw_cfg, FW_CFG_ACPI_TABLES, (uint8_t *)acpi_tables,
  acpi_tables_len);
-fw_cfg_add_bytes(fw_cfg, FW_CFG_IRQ0_OVERRIDE, &irq0override, 1);
+fw_cfg_add_i32(fw_cfg, FW_CFG_IRQ0_OVERRIDE, kvm_allows_irq0_override());
 
 smbios_table = smbios_get_table(&smbios_len);
 if (smbios_table)
diff --git a/kvm-all.c b/kvm-all.c
index a85e14f..665455c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1260,6 +1260,11 @@ int kvm_has_gsi_routing(void)
 return kvm_check_extension(kvm_state, KVM_CAP_IRQ_ROUTING);
 }
 
+int kvm_allows_irq0_override(void)
+{
+return !kvm_enabled() || !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 if (!kvm_has_sync_mmu()) {
diff --git a/kvm-stub.c b/kvm-stub.c
index 06064b9..6c2b06b 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -78,6 +78,11 @@ int kvm_has_many_ioeventfds(void)
 return 0;
 }
 
+int kvm_allows_irq0_override(void)
+{
+return 1;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 }
diff --git a/kvm.h b/kvm.h
index 0d6c453..a3c87af 100644
--- a/kvm.h
+++ b/kvm.h
@@ -53,6 +53,8 @@ int kvm_has_xcrs(void);
 int kvm_has_many_ioeventfds(void);
 int kvm_has_gsi_routing(void);
 
+int kvm_allows_irq0_override(void);
+
 #ifdef NEED_CPU_H
 int kvm_init_vcpu(CPUState *env);
 
diff --git a/sysemu.h b/sysemu.h
index 22cd720..3bd896e 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -102,7 +102,6 @@ extern int vga_interface_type;
 extern int graphic_width;
 extern int graphic_height;
 extern int graphic_depth;
-extern uint8_t irq0override;
 extern DisplayType display_type;
 extern const char *keyboard_layout;
 extern int win2k_install_hack;
diff --git a/vl.c b/vl.c
index fcce25f..22d02b9 100644
--- a/vl.c
+++ b/vl.c
@@ -218,7 +218,6 @@ int no_reboot = 0;
 int no_shutdown = 0;
 int cursor_hide = 1;
 int graphic_rotate = 0;
-uint8_t irq0override = 1;
 const char *watchdog;
 QEMUOptionRom option_rom[MAX_OPTION_ROMS];
 int nb_option_roms;
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 10/16] memory: Introduce memory_region_init_reservation

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Introduce a memory region type that can reserve I/O space. Such regions
are useful for modeling I/O that is only handled outside of QEMU, i.e.
in the context of an accelerator like KVM.

Any access to such a region from QEMU is a bug, but could theoretically
be triggered by guest code (DMA to reserved region). So only warning
about such events once, then ignore them.

Signed-off-by: Jan Kiszka 
---
 memory.c |   36 
 memory.h |   16 
 2 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index dc5e35d..6d55cf6 100644
--- a/memory.c
+++ b/memory.c
@@ -1003,6 +1003,42 @@ void memory_region_init_rom_device(MemoryRegion *mr,
 mr->backend_registered = true;
 }
 
+static uint64_t invalid_read(void *opaque, target_phys_addr_t addr,
+ unsigned size)
+{
+MemoryRegion *mr = opaque;
+
+if (!mr->warning_printed) {
+fprintf(stderr, "Invalid read from memory region %s\n", mr->name);
+mr->warning_printed = true;
+}
+return -1U;
+}
+
+static void invalid_write(void *opaque, target_phys_addr_t addr, uint64_t data,
+  unsigned size)
+{
+MemoryRegion *mr = opaque;
+
+if (!mr->warning_printed) {
+fprintf(stderr, "Invalid write to memory region %s\n", mr->name);
+mr->warning_printed = true;
+}
+}
+
+static const MemoryRegionOps reservation_ops = {
+.read = invalid_read,
+.write = invalid_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+};
+
+void memory_region_init_reservation(MemoryRegion *mr,
+const char *name,
+uint64_t size)
+{
+memory_region_init_io(mr, &reservation_ops, mr, name, size);
+}
+
 void memory_region_destroy(MemoryRegion *mr)
 {
 assert(QTAILQ_EMPTY(&mr->subregions));
diff --git a/memory.h b/memory.h
index d5b47da..b479350 100644
--- a/memory.h
+++ b/memory.h
@@ -115,6 +115,7 @@ struct MemoryRegion {
 bool terminates;
 bool readable;
 bool readonly; /* For RAM regions */
+bool warning_printed; /* For reservations */
 MemoryRegion *alias;
 target_phys_addr_t alias_offset;
 unsigned priority;
@@ -242,6 +243,21 @@ void memory_region_init_rom_device(MemoryRegion *mr,
uint64_t size);
 
 /**
+ * memory_region_init_reservation: Initialize a memory region that reserves
+ * I/O space.
+ *
+ * A reservation region primariy serves debugging purposes.  It claims I/O
+ * space that is not supposed to be handled by QEMU itself.  Any access via
+ * the memory API will cause an abort().
+ *
+ * @mr: the #MemoryRegion to be initialized
+ * @name: used for debugging; not visible to the user or ABI
+ * @size: size of the region.
+ */
+void memory_region_init_reservation(MemoryRegion *mr,
+const char *name,
+uint64_t size);
+/**
  * memory_region_destroy: Destroy a memory region and relaim all resources.
  *
  * @mr: the region to be destroyed.  May not currently be a subregion
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/16] ioapic: Convert to memory API

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

This maintains the old imprecise access size handling.

Signed-off-by: Jan Kiszka 
---
 hw/ioapic.c |   28 +++-
 1 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/hw/ioapic.c b/hw/ioapic.c
index 61991d7..56b1612 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -86,6 +86,7 @@ typedef struct IOAPICState IOAPICState;
 
 struct IOAPICState {
 SysBusDevice busdev;
+MemoryRegion io_memory;
 uint8_t id;
 uint8_t ioregsel;
 uint32_t irr;
@@ -195,7 +196,8 @@ void ioapic_eoi_broadcast(int vector)
 }
 }
 
-static uint32_t ioapic_mem_readl(void *opaque, target_phys_addr_t addr)
+static uint64_t
+ioapic_mem_read(void *opaque, target_phys_addr_t addr, unsigned int size)
 {
 IOAPICState *s = opaque;
 int index;
@@ -234,7 +236,8 @@ static uint32_t ioapic_mem_readl(void *opaque, 
target_phys_addr_t addr)
 }
 
 static void
-ioapic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
+ioapic_mem_write(void *opaque, target_phys_addr_t addr, uint64_t val,
+ unsigned int size)
 {
 IOAPICState *s = opaque;
 int index;
@@ -309,32 +312,23 @@ static void ioapic_reset(DeviceState *d)
 }
 }
 
-static CPUReadMemoryFunc * const ioapic_mem_read[3] = {
-ioapic_mem_readl,
-ioapic_mem_readl,
-ioapic_mem_readl,
-};
-
-static CPUWriteMemoryFunc * const ioapic_mem_write[3] = {
-ioapic_mem_writel,
-ioapic_mem_writel,
-ioapic_mem_writel,
+static const MemoryRegionOps ioapic_io_ops = {
+.read = ioapic_mem_read,
+.write = ioapic_mem_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
 };
 
 static int ioapic_init1(SysBusDevice *dev)
 {
 IOAPICState *s = FROM_SYSBUS(IOAPICState, dev);
-int io_memory;
 static int ioapic_no;
 
 if (ioapic_no >= MAX_IOAPICS) {
 return -1;
 }
 
-io_memory = cpu_register_io_memory(ioapic_mem_read,
-   ioapic_mem_write, s,
-   DEVICE_NATIVE_ENDIAN);
-sysbus_init_mmio(dev, 0x1000, io_memory);
+memory_region_init_io(&s->io_memory, &ioapic_io_ops, s, "ioapic", 0x1000);
+sysbus_init_mmio_region(dev, &s->io_memory);
 
 qdev_init_gpio_in(&dev->qdev, ioapic_set_irq, IOAPIC_NUM_PINS);
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 03/16] apic: Stop timer on reset

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

All LVTs are masked on reset, so the timer becomes ineffective. Letting
it tick nevertheless is harmless, but will at least create a spurious
trace event.

Signed-off-by: Jan Kiszka 
---
 hw/apic.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 8289eef..2644a82 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -528,6 +528,8 @@ void apic_init_reset(DeviceState *d)
 s->initial_count_load_time = 0;
 s->next_time = 0;
 s->wait_for_sipi = 1;
+
+qemu_del_timer(s->timer);
 }
 
 static void apic_startup(APICState *s, int vector_num)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 01/16] msi: Generalize msix_supported to msi_supported

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

Rename msix_supported to msi_supported and control MSI and MSI-X
activation this way. That was likely to original intention for this
flag, but MSI support came after MSI-X.

Signed-off-by: Jan Kiszka 
---
 hw/msi.c  |8 
 hw/msi.h  |2 ++
 hw/msix.c |9 -
 hw/msix.h |2 --
 hw/pc.c   |4 ++--
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/hw/msi.c b/hw/msi.c
index f214fcf..5d6ceb6 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -36,6 +36,9 @@
 
 #define PCI_MSI_VECTORS_MAX 32
 
+/* Flag for interrupt controller to declare MSI/MSI-X support */
+bool msi_supported;
+
 /* If we get rid of cap allocator, we won't need this. */
 static inline uint8_t msi_cap_sizeof(uint16_t flags)
 {
@@ -116,6 +119,11 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
 uint16_t flags;
 uint8_t cap_size;
 int config_offset;
+
+if (!msi_supported) {
+return -ENOTSUP;
+}
+
 MSI_DEV_PRINTF(dev,
"init offset: 0x%"PRIx8" vector: %"PRId8
" 64bit %d mask %d\n",
diff --git a/hw/msi.h b/hw/msi.h
index 5766018..3040bb0 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -24,6 +24,8 @@
 #include "qemu-common.h"
 #include "pci.h"
 
+extern bool msi_supported;
+
 bool msi_enabled(const PCIDevice *dev);
 int msi_init(struct PCIDevice *dev, uint8_t offset,
  unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask);
diff --git a/hw/msix.c b/hw/msix.c
index b15bafc..8850fbd 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -12,6 +12,7 @@
  */
 
 #include "hw.h"
+#include "msi.h"
 #include "msix.h"
 #include "pci.h"
 #include "range.h"
@@ -32,9 +33,6 @@
 #define MSIX_MAX_ENTRIES 32
 
 
-/* Flag for interrupt controller to declare MSI-X support */
-int msix_supported;
-
 /* Add MSI-X capability to the config space for the device. */
 /* Given a bar and its size, add MSI-X table on top of it
  * and fill MSI-X capability in the config space.
@@ -212,10 +210,11 @@ int msix_init(struct PCIDevice *dev, unsigned short 
nentries,
   unsigned bar_nr, unsigned bar_size)
 {
 int ret;
+
 /* Nothing to do if MSI is not supported by interrupt controller */
-if (!msix_supported)
+if (!msi_supported) {
 return -ENOTSUP;
-
+}
 if (nentries > MSIX_MAX_ENTRIES)
 return -EINVAL;
 
diff --git a/hw/msix.h b/hw/msix.h
index 7e04336..5aba22b 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -29,6 +29,4 @@ void msix_notify(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
-extern int msix_supported;
-
 #endif
diff --git a/hw/pc.c b/hw/pc.c
index 9328ee5..5225d5b 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -36,7 +36,7 @@
 #include "elf.h"
 #include "multiboot.h"
 #include "mc146818rtc.h"
-#include "msix.h"
+#include "msi.h"
 #include "sysbus.h"
 #include "sysemu.h"
 #include "blockdev.h"
@@ -896,7 +896,7 @@ static DeviceState *apic_init(void *env, uint8_t apic_id)
 apic_mapped = 1;
 }
 
-msix_supported = 1;
+msi_supported = true;
 
 return dev;
 }
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 02/16] kvm: Move kvmclock into hw/kvm folder

2011-12-04 Thread Jan Kiszka

From: Jan Kiszka 

More KVM-specific devices will come, so let's start with moving the
kvmclock into a dedicated folder.

Signed-off-by: Jan Kiszka 
---
 Makefile.target|4 ++--
 configure  |1 +
 hw/{kvmclock.c => kvm/clock.c} |4 ++--
 hw/{kvmclock.h => kvm/clock.h} |0
 hw/pc_piix.c   |2 +-
 5 files changed, 6 insertions(+), 5 deletions(-)
 rename hw/{kvmclock.c => kvm/clock.c} (98%)
 rename hw/{kvmclock.h => kvm/clock.h} (100%)

diff --git a/Makefile.target b/Makefile.target
index 1e90df7..3a9e95d 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -231,7 +231,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvmclock.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
@@ -421,7 +421,7 @@ qmp-commands-old.h: $(SRC_PATH)/qmp-commands.hx
 
 clean:
rm -f *.o *.a *~ $(PROGS) nwfpe/*.o fpu/*.o
-   rm -f *.d */*.d tcg/*.o ide/*.o 9pfs/*.o
+   rm -f *.d */*.d tcg/*.o ide/*.o 9pfs/*.o kvm/*.o
rm -f hmp-commands.h qmp-commands-old.h gdbstub-xml.c
 ifdef CONFIG_TRACE_SYSTEMTAP
rm -f *.stp
diff --git a/configure b/configure
index 4f87e0a..d768e44 100755
--- a/configure
+++ b/configure
@@ -3220,6 +3220,7 @@ mkdir -p $target_dir/fpu
 mkdir -p $target_dir/tcg
 mkdir -p $target_dir/ide
 mkdir -p $target_dir/9pfs
+mkdir -p $target_dir/kvm
 if test "$target" = "arm-linux-user" -o "$target" = "armeb-linux-user" -o 
"$target" = "arm-bsd-user" -o "$target" = "armeb-bsd-user" ; then
   mkdir -p $target_dir/nwfpe
 fi
diff --git a/hw/kvmclock.c b/hw/kvm/clock.c
similarity index 98%
rename from hw/kvmclock.c
rename to hw/kvm/clock.c
index 5388bc4..5983271 100644
--- a/hw/kvmclock.c
+++ b/hw/kvm/clock.c
@@ -13,9 +13,9 @@
 
 #include "qemu-common.h"
 #include "sysemu.h"
-#include "sysbus.h"
 #include "kvm.h"
-#include "kvmclock.h"
+#include "hw/sysbus.h"
+#include "hw/kvm/clock.h"
 
 #include 
 #include 
diff --git a/hw/kvmclock.h b/hw/kvm/clock.h
similarity index 100%
rename from hw/kvmclock.h
rename to hw/kvm/clock.h
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index c89042f..22997b0 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -34,7 +34,7 @@
 #include "boards.h"
 #include "ide.h"
 #include "kvm.h"
-#include "kvmclock.h"
+#include "kvm/clock.h"
 #include "sysemu.h"
 #include "sysbus.h"
 #include "arch_init.h"
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 00/16] uq/master: Introduce basic irqchip support

2011-12-04 Thread Jan Kiszka

This is v2, addressing the feedback comments provided so far, namely:
 - dropped #include <> conversions
 - do not abort() on reserved memory region accesses but only warn once
 - use memory_region_init_io in memory_region_init_reservation

Patch 1 of this series has meanwhile been posted for direct upstream
inclusion, see http://thread.gmane.org/gmane.comp.emulators.qemu/127308.
I'm keeping it to easy testing, but it should likely not go via
uq/master. The same may apply to other patches too, e.g. 10.

--- original series description ---

Some weeks back I posted my MSI rework for qemu-kvm that shall once help
integrating those bits into upstream. After that I wondered how a
rewritten in-kernel irqchip model could look like and make use of this.
But then I realized that there is actually no technical need to role out
a first version of kvm irqchips that already support MSI. As the MSI
thing will likely take a few more iterations, I now decided to rush
forward with basic kvm irqchip for QEMU upstream. Here we go.

My idea was always to create proper alternatives to the existing user
space device models while keeping the vmstates 100% compatible. I think
I succeeded in this, tests worked fine so far. The kvm and the user
space models now have a common core where they share logic and specific
code modules where they differ. Also, I moved all kvm devices into
hw/kvm.

The in-kernel irqchip support can be controlled via a machine property
(-machine ...,kernel_irqchip=on), in contrast to qemu-kvm's dedicated
command line switch. This series keeps the support off by default
because we still lack the MSI bits as I explained. Also, in-kernel PIT
is not yet implemented and TPR patching/VAPIC (for Windows guests).

The merge story would basically look similar to what we did before with
the clean-room reimplementation of kvm for QEMU: Merge into upstream,
merge back into qemu-kvm, disabling the new bits for now, then gradually
switching over to the new services, specifically once they are
feature-equivalent. Of course, I will support these steps as usual.

So, feedback and review welcome!

Jan Kiszka (16):
  msi: Generalize msix_supported to msi_supported
  kvm: Move kvmclock into hw/kvm folder
  apic: Stop timer on reset
  apic: Factor out core for KVM reuse
  apic: Open-code timer save/restore
  i8259: Factor out core for KVM reuse
  ioapic: Convert to memory API
  ioapic: Reject non-dword accesses to IOWIN register
  ioapic: Factor out core for KVM reuse
  memory: Introduce memory_region_init_reservation
  kvm: Introduce core services for in-kernel irqchip support
  kvm: x86: Establish IRQ0 override control
  kvm: x86: Add user space part for in-kernel APIC
  kvm: x86: Add user space part for in-kernel i8259
  kvm: x86: Add user space part for in-kernel IOAPIC
  kvm: Arm in-kernel irqchip support

 Makefile.objs  |2 +-
 Makefile.target|6 +-
 configure  |1 +
 hw/apic.c  |  288 
 hw/apic_common.c   |  262 
 hw/apic_internal.h |  111 +++
 hw/i8259.c |   78 +---
 hw/i8259_common.c  |  103 ++
 hw/i8259_internal.h|   67 +
 hw/ioapic.c|  136 +++
 hw/ioapic_common.c |   89 
 hw/ioapic_internal.h   |   94 +
 hw/kvm/apic.c  |  147 
 hw/{kvmclock.c => kvm/clock.c} |4 +-
 hw/{kvmclock.h => kvm/clock.h} |0
 hw/kvm/i8259.c |  154 +
 hw/kvm/ioapic.c|  120 +
 hw/msi.c   |8 +
 hw/msi.h   |2 +
 hw/msix.c  |9 +-
 hw/msix.h  |2 -
 hw/pc.c|   20 ++-
 hw/pc.h|1 +
 hw/pc_piix.c   |   67 +-
 kvm-all.c  |  154 +
 kvm-stub.c |5 +
 kvm.h  |   13 ++
 memory.c   |   36 +
 memory.h   |   16 +++
 qemu-config.c  |4 +
 qemu-options.hx|5 +-
 sysemu.h   |1 -
 target-i386/kvm.c  |   19 +++
 trace-events   |2 +-
 vl.c   |1 -
 35 files changed, 1547 insertions(+), 480 deletions(-)
 create mode 100644 hw/apic_common.c
 create mode 100644 hw/apic_internal.h
 create mode 100644 hw/i8259_common.c
 create mode 100644 hw/i8259_internal.h
 create mode 100644 hw/ioapic_common.c
 create mode 100644 hw/ioapic_internal.h
 create mode 100644 hw/kvm/apic.c
 rename hw/{kvmclock.c => kvm/clock.c} (98%)
 rename hw/{kvmclock.h => kvm/clock.h} (100%)
 create mode 100644 hw/kvm/i8259.c
 create mode

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Jan Kiszka

On 2011-12-04 15:04, Avi Kivity wrote:
> On 12/04/2011 03:51 PM, Jan Kiszka wrote:
>>>
>>> But the name becomes part of the save/restore ABI, so you can't.
>>
>> Nope, the vmstate names are identical. That would ruin migration
>> otherwise. It's just the output of info qtree & co. that changes.
> 
> Oh, okay.  I still think it's wrong, but now it's just a matter of
> taste, and I can live with it.

Wrong in what sense?

I think the way of merging kvm support into the user space models in
qemu-kvm is not particularly beautiful. But that's my taste, and
therefore I modeled the upstream proposal differently. :)

Jan

signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Avi Kivity

On 12/04/2011 03:51 PM, Jan Kiszka wrote:
> > 
> > But the name becomes part of the save/restore ABI, so you can't.
>
> Nope, the vmstate names are identical. That would ruin migration
> otherwise. It's just the output of info qtree & co. that changes.

Oh, okay.  I still think it's wrong, but now it's just a matter of
taste, and I can live with it.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Jan Kiszka

On 2011-12-04 14:49, Avi Kivity wrote:
> On 12/04/2011 03:42 PM, Jan Kiszka wrote:
>> On 2011-12-04 14:31, Avi Kivity wrote:
>>> On 12/03/2011 01:17 PM, Jan Kiszka wrote:
 From: Jan Kiszka 

 Introduce the alternative 'kvm-i8259' device model that exploits KVM
 in-kernel acceleration.

 The PIIX3 initialization code is furthermore extended by KVM specific
 IRQ route setup. Moreover, GSI injection differs in KVM mode from the
 user space model. As we can dispatch ISA-range IRQs to both IOAPIC and
 PIC inside the kernel, we do not need to inject them separately. This is
 reflected by a KVM-specific GSI handler.

 +
 +qemu_irq *kvm_i8259_init(void)
 +{
 +ISADevice *dev;
 +
 +dev = isa_create("kvm-i8259");

>>>
>>> Same issue.  Is this a different device, or an different implementation
>>> of the same device?
>>
>> They are theoretically the same from guest perspective (therefore you
>> can migrate between machines that differ in this).
> 
> But the name becomes part of the save/restore ABI, so you can't.

Nope, the vmstate names are identical. That would ruin migration
otherwise. It's just the output of info qtree & co. that changes.

Jan

signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Avi Kivity

On 12/04/2011 03:42 PM, Jan Kiszka wrote:
> On 2011-12-04 14:31, Avi Kivity wrote:
> > On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> >> From: Jan Kiszka 
> >>
> >> Introduce the alternative 'kvm-i8259' device model that exploits KVM
> >> in-kernel acceleration.
> >>
> >> The PIIX3 initialization code is furthermore extended by KVM specific
> >> IRQ route setup. Moreover, GSI injection differs in KVM mode from the
> >> user space model. As we can dispatch ISA-range IRQs to both IOAPIC and
> >> PIC inside the kernel, we do not need to inject them separately. This is
> >> reflected by a KVM-specific GSI handler.
> >>
> >> +
> >> +qemu_irq *kvm_i8259_init(void)
> >> +{
> >> +ISADevice *dev;
> >> +
> >> +dev = isa_create("kvm-i8259");
> >>
> > 
> > Same issue.  Is this a different device, or an different implementation
> > of the same device?
>
> They are theoretically the same from guest perspective (therefore you
> can migrate between machines that differ in this).

But the name becomes part of the save/restore ABI, so you can't.

> > 
> > We're forcing migration from 1.0 to 1.1 to disable in-kernel irqchip on
> > the target.  For qemu itself, that's no issue.  But for qemu-kvm, it
> > will result in loss of performance, or hacks to alias the two back together.
>
> We should this happen with qemu-kvm? The vmstates are compatible, thus
> you can migration from old qemu-kvm in-kernel devices to the new kvm-*
> ones (once they are feature-equivalent). Not sure how much hacks this
> may require to qemu-kvm, but I don't think it should make the situation
> worse for that tree.

They aren't compatible due to the name clash.  The hack won't be large
(add an alias for the name), but just one hack is enough to keep the
tree alive for a long while.  Better not to add it in the first place.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Jan Kiszka

On 2011-12-04 14:31, Avi Kivity wrote:
> On 12/03/2011 01:17 PM, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> Introduce the alternative 'kvm-i8259' device model that exploits KVM
>> in-kernel acceleration.
>>
>> The PIIX3 initialization code is furthermore extended by KVM specific
>> IRQ route setup. Moreover, GSI injection differs in KVM mode from the
>> user space model. As we can dispatch ISA-range IRQs to both IOAPIC and
>> PIC inside the kernel, we do not need to inject them separately. This is
>> reflected by a KVM-specific GSI handler.
>>
>> +
>> +qemu_irq *kvm_i8259_init(void)
>> +{
>> +ISADevice *dev;
>> +
>> +dev = isa_create("kvm-i8259");
>>
> 
> Same issue.  Is this a different device, or an different implementation
> of the same device?

They are theoretically the same from guest perspective (therefore you
can migrate between machines that differ in this).

> 
> We're forcing migration from 1.0 to 1.1 to disable in-kernel irqchip on
> the target.  For qemu itself, that's no issue.  But for qemu-kvm, it
> will result in loss of performance, or hacks to alias the two back together.

We should this happen with qemu-kvm? The vmstates are compatible, thus
you can migration from old qemu-kvm in-kernel devices to the new kvm-*
ones (once they are feature-equivalent). Not sure how much hacks this
may require to qemu-kvm, but I don't think it should make the situation
worse for that tree.

Jan



signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 11/16] kvm: Introduce core services for in-kernel irqchip support

2011-12-04 Thread Avi Kivity

On 12/04/2011 03:30 PM, Jan Kiszka wrote:
> > Well, I have to comment on something.  If you don't want spelling
> > corrections, leave some trailing whitespace.
>
> I could create a messpatch.pl...

Ah, and with a --reverse flag we could go through the motions of patch
review without requiring a repost.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/16] kvm: x86: Add user space part for in-kernel i8259

2011-12-04 Thread Avi Kivity

On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> From: Jan Kiszka 
>
> Introduce the alternative 'kvm-i8259' device model that exploits KVM
> in-kernel acceleration.
>
> The PIIX3 initialization code is furthermore extended by KVM specific
> IRQ route setup. Moreover, GSI injection differs in KVM mode from the
> user space model. As we can dispatch ISA-range IRQs to both IOAPIC and
> PIC inside the kernel, we do not need to inject them separately. This is
> reflected by a KVM-specific GSI handler.
>
> +
> +qemu_irq *kvm_i8259_init(void)
> +{
> +ISADevice *dev;
> +
> +dev = isa_create("kvm-i8259");
>

Same issue.  Is this a different device, or an different implementation
of the same device?

We're forcing migration from 1.0 to 1.1 to disable in-kernel irqchip on
the target.  For qemu itself, that's no issue.  But for qemu-kvm, it
will result in loss of performance, or hacks to alias the two back together.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 11/16] kvm: Introduce core services for in-kernel irqchip support

2011-12-04 Thread Jan Kiszka

On 2011-12-04 14:28, Avi Kivity wrote:
> On 12/04/2011 03:27 PM, Jan Kiszka wrote:
>> On 2011-12-04 14:23, Avi Kivity wrote:
>>> On 12/03/2011 01:17 PM, Jan Kiszka wrote:
 From: Jan Kiszka 

 Add the basic infrastructure to active in-kernel irqchip support, inject
 interrupts into these models, and maintain IRQ routes.

 Routing is optional and depends on the host arch supporting
 KVM_CAP_IRQ_ROUTING. When it's not available on x86, we loose the HPET
>>>
>>> lose
>>
>> /me is still looking for a semantic proofreader plugin...
>>
> 
> Well, I have to comment on something.  If you don't want spelling
> corrections, leave some trailing whitespace.

I could create a messpatch.pl...

Jan



signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 11/16] kvm: Introduce core services for in-kernel irqchip support

2011-12-04 Thread Avi Kivity

On 12/04/2011 03:27 PM, Jan Kiszka wrote:
> On 2011-12-04 14:23, Avi Kivity wrote:
> > On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> >> From: Jan Kiszka 
> >>
> >> Add the basic infrastructure to active in-kernel irqchip support, inject
> >> interrupts into these models, and maintain IRQ routes.
> >>
> >> Routing is optional and depends on the host arch supporting
> >> KVM_CAP_IRQ_ROUTING. When it's not available on x86, we loose the HPET
> > 
> > lose
>
> /me is still looking for a semantic proofreader plugin...
>

Well, I have to comment on something.  If you don't want spelling
corrections, leave some trailing whitespace.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 11/16] kvm: Introduce core services for in-kernel irqchip support

2011-12-04 Thread Jan Kiszka

On 2011-12-04 14:23, Avi Kivity wrote:
> On 12/03/2011 01:17 PM, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> Add the basic infrastructure to active in-kernel irqchip support, inject
>> interrupts into these models, and maintain IRQ routes.
>>
>> Routing is optional and depends on the host arch supporting
>> KVM_CAP_IRQ_ROUTING. When it's not available on x86, we loose the HPET
> 
> lose

/me is still looking for a semantic proofreader plugin...

> 
>> as we can't route GSI0 to IOAPIC pin 2.
>>
>> In-kernel irqchip support will once be controlled by the machine
>> property 'kernel_irqchip', but this is not yet wired up.
>>
>>
> 




signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 01/16] msi: Generalize msix_supported to msi_supported

2011-12-04 Thread Avi Kivity

On 12/04/2011 03:16 PM, Jan Kiszka wrote:
> On 2011-12-04 14:12, Avi Kivity wrote:
> > On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> >> From: Jan Kiszka 
> >>
> >> Rename msix_supported to msi_supported and control MSI and MSI-X
> >> activation this way. That was likely to original intention for this
> >> flag, but MSI support came after MSI-X.
> > 
> > 'and' is a dangerous word in a changelog entry.
>
> This patch hardly qualifies for two IMHO.

If we don't have to change it, no.

>
> > 
> >>
> >> +
> >> +if (!msi_supported) {
> >> +return -ENOTSUP;
> >> +}
> >> +
> >>
> > 
> > This changes behaviour.  qemu 1.0 -M pc-1.0 and qemu-1.1 -M pc-1.0 will
> > be different after this, no?
> > 
>
> Only isapc had msix_supported = 0, and I doubt we got there (msi_init)
> for that machine. Or am I missing something?
>

Ah, I thought it was a user-settable property, but it isn't.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 13/16] kvm: x86: Add user space part for in-kernel APIC

2011-12-04 Thread Avi Kivity

On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> From: Jan Kiszka 
>
> This introduces the alternative APIC model 'kvm-apic' which makes use of
> KVM's in-kernel device model. MSI is not yet supported, so we disable
> this when the in-kernel model is in use.
>
>  
> -dev = qdev_create(NULL, "apic");
> +if (kvm_enabled() && kvm_irqchip_in_kernel()) {
> +dev = qdev_create(NULL, "kvm-apic");
> +} else {
> +dev = qdev_create(NULL, "apic");
> +}

Is there anything that makes those two devices incompatible?


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 10/16] memory: Introduce memory_region_init_reservation

2011-12-04 Thread Jan Kiszka

On 2011-12-04 14:20, Avi Kivity wrote:
> On 12/03/2011 01:17 PM, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> Introduce a memory region type that can reserve I/O space. Such regions
>> are useful for modeling I/O that is only handled outside of QEMU, i.e.
>> in the context of an accelerator like KVM. Any access to such a region
>> from QEMU is a bug and will be reported as such.
> 
> This is guest triggerable (DMA into the region), so abort() is too drastic.

Mmh, true. Will turn it into a print-once warning.

> 
>> +void memory_region_init_reservation(MemoryRegion *mr,
>> +const char *name,
>> +uint64_t size)
>> +{
>> +memory_region_init(mr, name, size);
>> +mr->ops = &reservation_ops;
>> +mr->opaque = mr;
>> +mr->terminates = true;
>> +mr->backend_registered = false;
>> +}
> 
> Just calling memory_region_init_io() is simpler, no?

Yep.

Thanks,
Jan



signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 11/16] kvm: Introduce core services for in-kernel irqchip support

2011-12-04 Thread Avi Kivity

On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> From: Jan Kiszka 
>
> Add the basic infrastructure to active in-kernel irqchip support, inject
> interrupts into these models, and maintain IRQ routes.
>
> Routing is optional and depends on the host arch supporting
> KVM_CAP_IRQ_ROUTING. When it's not available on x86, we loose the HPET

lose

> as we can't route GSI0 to IOAPIC pin 2.
>
> In-kernel irqchip support will once be controlled by the machine
> property 'kernel_irqchip', but this is not yet wired up.
>
>

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 10/16] memory: Introduce memory_region_init_reservation

2011-12-04 Thread Avi Kivity

On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> From: Jan Kiszka 
>
> Introduce a memory region type that can reserve I/O space. Such regions
> are useful for modeling I/O that is only handled outside of QEMU, i.e.
> in the context of an accelerator like KVM. Any access to such a region
> from QEMU is a bug and will be reported as such.

This is guest triggerable (DMA into the region), so abort() is too drastic.

> +void memory_region_init_reservation(MemoryRegion *mr,
> +const char *name,
> +uint64_t size)
> +{
> +memory_region_init(mr, name, size);
> +mr->ops = &reservation_ops;
> +mr->opaque = mr;
> +mr->terminates = true;
> +mr->backend_registered = false;
> +}

Just calling memory_region_init_io() is simpler, no?


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 01/16] msi: Generalize msix_supported to msi_supported

2011-12-04 Thread Jan Kiszka

On 2011-12-04 14:12, Avi Kivity wrote:
> On 12/03/2011 01:17 PM, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> Rename msix_supported to msi_supported and control MSI and MSI-X
>> activation this way. That was likely to original intention for this
>> flag, but MSI support came after MSI-X.
> 
> 'and' is a dangerous word in a changelog entry.

This patch hardly qualifies for two IMHO.

> 
>>
>> +
>> +if (!msi_supported) {
>> +return -ENOTSUP;
>> +}
>> +
>>
> 
> This changes behaviour.  qemu 1.0 -M pc-1.0 and qemu-1.1 -M pc-1.0 will
> be different after this, no?
> 

Only isapc had msix_supported = 0, and I doubt we got there (msi_init)
for that machine. Or am I missing something?

Jan



signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 01/16] msi: Generalize msix_supported to msi_supported

2011-12-04 Thread Avi Kivity

On 12/03/2011 01:17 PM, Jan Kiszka wrote:
> From: Jan Kiszka 
>
> Rename msix_supported to msi_supported and control MSI and MSI-X
> activation this way. That was likely to original intention for this
> flag, but MSI support came after MSI-X.

'and' is a dangerous word in a changelog entry.

>
> +
> +if (!msi_supported) {
> +return -ENOTSUP;
> +}
> +
>

This changes behaviour.  qemu 1.0 -M pc-1.0 and qemu-1.1 -M pc-1.0 will
be different after this, no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] qemu-kvm-1.0

2011-12-04 Thread Avi Kivity

qemu-kvm-1.0 is now available. This release is based on the upstream
QEMU 1.0 plus kvm-specific enhancements.  Please see the QEMU 1.0
announcement [1] for more details.

This release can be used with the kvm kernel modules provided by your
distribution kernel, or by the modules in the kvm-kmod package, such
as kvm-kmod-3.1.


http://www.linux-kvm.org

[1] http://article.gmane.org/gmane.comp.emulators.qemu/127090
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 13:52 +0200, Avi Kivity wrote:
> On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin"  
> > > wrote:
> > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > We'll presumably need some logic to increment is back,
> > > > > > to account for random workload changes.
> > > > > > Something like slow start?
> > > > > 
> > > > > We can increment it each time the queue was less than 10% full, it
> > > > > should act like slow start, no?
> > > > 
> > > > No, we really shouldn't get an empty ring as long as things behave
> > > > well. What I meant is something like:
> > > 
> > > I was thinking of the network output case, but you're right.  We need to
> > > distinguish between usually full (eg. virtio-net input) and usually
> > > empty (eg. virtio-net output).
> > > 
> > > The signal for "we to pack more into the ring" is different.  We could
> > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > when we set up the virtqueue.
> > > 
> > > Our other alternative, moving the logic to the driver, is worse.
> > > 
> > > As to fading the effect over time, that's harder.  We have to deplete
> > > the ring quite a few times before it turns into always-indirect.  We
> > > could back off every time the ring is totally idle, but that may hurt
> > > bursty traffic.  Let's try simple first?
> >
> > I tried to take a different approach, and tried putting the indirect
> > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > that this way virtio-net actually worked faster with indirect on even in
> > a single stream.
> 
> How much better?

host->guest was same with both indirect on and off, and guest->host went
up by 5% with indirect on.

This was just a simple 1 TCP stream test.

> 
> I think that if indirects benefit networking, then we're doing something
> wrong.  What's going on?  Does the ring get filled too early?  If so we
> should expand it.
> 

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm tools: Allow easily sandboxing applications within a guest

2011-12-04 Thread Sasha Levin

On Sun, 2011-12-04 at 12:25 +0200, Avi Kivity wrote:
> On 12/02/2011 09:16 AM, Sasha Levin wrote:
> > This is useful when testing code that might cause problems on the host, or
> > to automate kernel testing since it's now easy to link a kvm tools test
> > script with 'git bisect run'.
> 
> This tie-up into git bisect is a really cool idea.
> 
> With device assignment, you can even bisect driver bugs this way.

Yup, it makes bisecting most issues which are reproducible pretty easy.

Yesterday I've managed to bisect the issue in '[BUG] net: kernel BUG at
include/net/netns/generic.h:40!' without having to touch the process
once. Obviously I was pretty happy :)

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance

2011-12-04 Thread Gleb Natapov

On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka  wrote:
> > On 2011-12-02 07:26, Liu Ping Fan wrote:
> >> From: Liu Ping Fan 
> >>
> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >
> > I'm lacking the big picture yet (would be good to have in the change log
> > - at least I'm too lazy to read the code):
> >
> > What increments the refcnt, what decrements it again? IOW, how does user
> > space controls the life-cycle of a vcpu after your changes?
> >
> In local APIC mode, delivering IPI to target APIC, target's refcnt is
> incremented, and decremented when finished. At other times, using RCU to
Why is this needed?

> protect the vcpu's reference from its destruction.
> 
> If kvm_vcpu is not needed by guest, user space can close the
> kvm_vcpu's file
> descriptors, and then,if the kvm_vcpu has crossed the period of local
> APCI mode's reference,it will be destroyed.
> 
> Regards,
> ping fan
> 
> > Thanks,
> > Jan
> >
> > --
> > Siemens AG, Corporate Technology, CT T DE IT 1
> > Corporate Competence Center Embedded Linux

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Avi Kivity

On 12/04/2011 02:01 PM, Michael S. Tsirkin wrote:
> > 
> > How much better?
> > 
> > I think that if indirects benefit networking, then we're doing something
> > wrong.  What's going on?  Does the ring get filled too early?  If so we
> > should expand it.
>
> The ring is physically contigious.
> With 256 entries and 64 bytes each, that's already 16K.

A descriptor is just 16 bytes.  There's also the used ring, but that's a
mistake if you have out of order completion.  We should have used copying.

16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
just 1.5 ms.  In any case I think it's sufficient.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Michael S. Tsirkin

On Sun, Dec 04, 2011 at 01:52:16PM +0200, Avi Kivity wrote:
> On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin"  
> > > wrote:
> > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > We'll presumably need some logic to increment is back,
> > > > > > to account for random workload changes.
> > > > > > Something like slow start?
> > > > > 
> > > > > We can increment it each time the queue was less than 10% full, it
> > > > > should act like slow start, no?
> > > > 
> > > > No, we really shouldn't get an empty ring as long as things behave
> > > > well. What I meant is something like:
> > > 
> > > I was thinking of the network output case, but you're right.  We need to
> > > distinguish between usually full (eg. virtio-net input) and usually
> > > empty (eg. virtio-net output).
> > > 
> > > The signal for "we to pack more into the ring" is different.  We could
> > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > when we set up the virtqueue.
> > > 
> > > Our other alternative, moving the logic to the driver, is worse.
> > > 
> > > As to fading the effect over time, that's harder.  We have to deplete
> > > the ring quite a few times before it turns into always-indirect.  We
> > > could back off every time the ring is totally idle, but that may hurt
> > > bursty traffic.  Let's try simple first?
> >
> > I tried to take a different approach, and tried putting the indirect
> > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > that this way virtio-net actually worked faster with indirect on even in
> > a single stream.
> 
> How much better?
> 
> I think that if indirects benefit networking, then we're doing something
> wrong.  What's going on?  Does the ring get filled too early?  If so we
> should expand it.

The ring is physically contigious.
With 256 entries and 64 bytes each, that's already 16K.

> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance

2011-12-04 Thread Liu ping fan

On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka  wrote:
> On 2011-12-02 07:26, Liu Ping Fan wrote:
>> From: Liu Ping Fan 
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction taken when its refcnt is zero,
>> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>
> I'm lacking the big picture yet (would be good to have in the change log
> - at least I'm too lazy to read the code):
>
> What increments the refcnt, what decrements it again? IOW, how does user
> space controls the life-cycle of a vcpu after your changes?
>
In local APIC mode, delivering IPI to target APIC, target's refcnt is
incremented, and decremented when finished. At other times, using RCU to
protect the vcpu's reference from its destruction.

If kvm_vcpu is not needed by guest, user space can close the
kvm_vcpu's file
descriptors, and then,if the kvm_vcpu has crossed the period of local
APCI mode's reference,it will be destroyed.

Regards,
ping fan

> Thanks,
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Avi Kivity

On 12/03/2011 01:50 PM, Sasha Levin wrote:
> On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin"  
> > wrote:
> > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > We'll presumably need some logic to increment is back,
> > > > > to account for random workload changes.
> > > > > Something like slow start?
> > > > 
> > > > We can increment it each time the queue was less than 10% full, it
> > > > should act like slow start, no?
> > > 
> > > No, we really shouldn't get an empty ring as long as things behave
> > > well. What I meant is something like:
> > 
> > I was thinking of the network output case, but you're right.  We need to
> > distinguish between usually full (eg. virtio-net input) and usually
> > empty (eg. virtio-net output).
> > 
> > The signal for "we to pack more into the ring" is different.  We could
> > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > when we set up the virtqueue.
> > 
> > Our other alternative, moving the logic to the driver, is worse.
> > 
> > As to fading the effect over time, that's harder.  We have to deplete
> > the ring quite a few times before it turns into always-indirect.  We
> > could back off every time the ring is totally idle, but that may hurt
> > bursty traffic.  Let's try simple first?
>
> I tried to take a different approach, and tried putting the indirect
> descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> that this way virtio-net actually worked faster with indirect on even in
> a single stream.

How much better?

I think that if indirects benefit networking, then we're doing something
wrong.  What's going on?  Does the ring get filled too early?  If so we
should expand it.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-12-04 Thread Michael S. Tsirkin

On Sat, Dec 03, 2011 at 01:50:28PM +0200, Sasha Levin wrote:
> On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin"  
> > wrote:
> > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > We'll presumably need some logic to increment is back,
> > > > > to account for random workload changes.
> > > > > Something like slow start?
> > > > 
> > > > We can increment it each time the queue was less than 10% full, it
> > > > should act like slow start, no?
> > > 
> > > No, we really shouldn't get an empty ring as long as things behave
> > > well. What I meant is something like:
> > 
> > I was thinking of the network output case, but you're right.  We need to
> > distinguish between usually full (eg. virtio-net input) and usually
> > empty (eg. virtio-net output).
> > 
> > The signal for "we to pack more into the ring" is different.  We could
> > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > when we set up the virtqueue.
> > 
> > Our other alternative, moving the logic to the driver, is worse.
> > 
> > As to fading the effect over time, that's harder.  We have to deplete
> > the ring quite a few times before it turns into always-indirect.  We
> > could back off every time the ring is totally idle, but that may hurt
> > bursty traffic.  Let's try simple first?
> 
> I tried to take a different approach, and tried putting the indirect
> descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> that this way virtio-net actually worked faster with indirect on even in
> a single stream.
> 
> Maybe we can do that instead of playing with threshold for now.
> 
> The question here, how much wasted space we can afford? since indirect
> descriptors would have to be the same size we'd have a bunch of them
> wasted in the cache. Ofcourse we can make that configurable, but how
> much is ok by default?

I think it's a good idea to make that per-device.
For network at least, each skb already has overhead of
around 1/2 K, so using up to 1/2K more seems acceptable.
But even if we went up to MAX_SKB_FRAGS+2, it would be
only 1K per ring entry, so for a ring of 256 entries, we end up with
256K max waste. That's not that terrible.

But I'd say let's do some benchmarking to figure out
the point where the gains are becoming very small.



> -- 
> 
> Sasha.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-04 Thread Rusty Russell

On Sat, 03 Dec 2011 10:09:44 +1100, Benjamin Herrenschmidt 
 wrote:
> On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote:
> > A trivial, albeit sub-optimal, solution would be to simply revert
> > commit d57ed95 "virtio: use smp_XX barriers on SMP". Obviously, though,
> > that's going to have a negative impact on performance of SMP-based
> > virtualization use cases.
> 
> Have you measured the impact of using normal barriers (non-SMP ones)
> like we use on normal HW drivers unconditionally ?
> 
> IE. If the difference is small enough I'd say just go for it and avoid
> the bloat.

Yep.  Plan is:
1) Measure the difference.
2) Difference unmeassurable?  Use normal barriers (ie. revert d57ed95).
3) Difference small?  Revert d57ed95 for 3.2, revisit for 3.3.
4) Difference large?  Runtime switch based on "if you're PCI" for 3.2,
   revisit for 3.3.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC][PATCH 02/16] kvm: Move kvmclock into hw/kvm folder

2011-12-04 Thread Jan Kiszka

On 2011-12-04 11:43, Avi Kivity wrote:
> On 12/04/2011 12:33 AM, Jan Kiszka wrote:
>> Do we have a convention that every include in <> is considered system
>> header? Should probably be documented then (and code should be converted
>> gradually).
> 
> It's documented in "The C Programming Language", by K&R.

It's just a convention, nothing more. If you consider certain parts of
QEMU's API as system (e.g. the parts that may once make our modular
API), it makes some sense to use <> for. Right now this happens for some
parts of the hw API. But inconsistently.

Jan

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [RFC][PATCH 02/16] kvm: Move kvmclock into hw/kvm folder

2011-12-04 Thread Avi Kivity

On 12/04/2011 12:33 AM, Jan Kiszka wrote:
> Do we have a convention that every include in <> is considered system
> header? Should probably be documented then (and code should be converted
> gradually).

It's documented in "The C Programming Language", by K&R.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 01/16] msi: Generalize msix_supported to msi_supported

2011-12-04 Thread Jan Kiszka

On 2011-12-04 11:42, Michael S. Tsirkin wrote:
> On Sat, Dec 03, 2011 at 12:17:26PM +0100, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> Rename msix_supported to msi_supported and control MSI and MSI-X
>> activation this way. That was likely to original intention for this
>> flag, but MSI support came after MSI-X.
>>
>> Signed-off-by: Jan Kiszka 
> 
> Acked-by: Michael S. Tsirkin 
> 
> This patch should go into qemu.git, right?

Right. It was just that this series depends on it. Feel free to pick it
up earlier.

Jan



signature.asc
Description: OpenPGP digital signature

Re: [RFC][PATCH 01/16] msi: Generalize msix_supported to msi_supported

2011-12-04 Thread Michael S. Tsirkin

On Sat, Dec 03, 2011 at 12:17:26PM +0100, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> Rename msix_supported to msi_supported and control MSI and MSI-X
> activation this way. That was likely to original intention for this
> flag, but MSI support came after MSI-X.
> 
> Signed-off-by: Jan Kiszka 

Acked-by: Michael S. Tsirkin 

This patch should go into qemu.git, right?

> ---
>  hw/msi.c  |8 
>  hw/msi.h  |2 ++
>  hw/msix.c |9 -
>  hw/msix.h |2 --
>  hw/pc.c   |4 ++--
>  5 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/msi.c b/hw/msi.c
> index f214fcf..5d6ceb6 100644
> --- a/hw/msi.c
> +++ b/hw/msi.c
> @@ -36,6 +36,9 @@
>  
>  #define PCI_MSI_VECTORS_MAX 32
>  
> +/* Flag for interrupt controller to declare MSI/MSI-X support */
> +bool msi_supported;
> +
>  /* If we get rid of cap allocator, we won't need this. */
>  static inline uint8_t msi_cap_sizeof(uint16_t flags)
>  {
> @@ -116,6 +119,11 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
>  uint16_t flags;
>  uint8_t cap_size;
>  int config_offset;
> +
> +if (!msi_supported) {
> +return -ENOTSUP;
> +}
> +
>  MSI_DEV_PRINTF(dev,
> "init offset: 0x%"PRIx8" vector: %"PRId8
> " 64bit %d mask %d\n",
> diff --git a/hw/msi.h b/hw/msi.h
> index 5766018..3040bb0 100644
> --- a/hw/msi.h
> +++ b/hw/msi.h
> @@ -24,6 +24,8 @@
>  #include "qemu-common.h"
>  #include "pci.h"
>  
> +extern bool msi_supported;
> +
>  bool msi_enabled(const PCIDevice *dev);
>  int msi_init(struct PCIDevice *dev, uint8_t offset,
>   unsigned int nr_vectors, bool msi64bit, bool 
> msi_per_vector_mask);
> diff --git a/hw/msix.c b/hw/msix.c
> index b15bafc..8850fbd 100644
> --- a/hw/msix.c
> +++ b/hw/msix.c
> @@ -12,6 +12,7 @@
>   */
>  
>  #include "hw.h"
> +#include "msi.h"
>  #include "msix.h"
>  #include "pci.h"
>  #include "range.h"
> @@ -32,9 +33,6 @@
>  #define MSIX_MAX_ENTRIES 32
>  
>  
> -/* Flag for interrupt controller to declare MSI-X support */
> -int msix_supported;
> -
>  /* Add MSI-X capability to the config space for the device. */
>  /* Given a bar and its size, add MSI-X table on top of it
>   * and fill MSI-X capability in the config space.
> @@ -212,10 +210,11 @@ int msix_init(struct PCIDevice *dev, unsigned short 
> nentries,
>unsigned bar_nr, unsigned bar_size)
>  {
>  int ret;
> +
>  /* Nothing to do if MSI is not supported by interrupt controller */
> -if (!msix_supported)
> +if (!msi_supported) {
>  return -ENOTSUP;
> -
> +}
>  if (nentries > MSIX_MAX_ENTRIES)
>  return -EINVAL;
>  
> diff --git a/hw/msix.h b/hw/msix.h
> index 7e04336..5aba22b 100644
> --- a/hw/msix.h
> +++ b/hw/msix.h
> @@ -29,6 +29,4 @@ void msix_notify(PCIDevice *dev, unsigned vector);
>  
>  void msix_reset(PCIDevice *dev);
>  
> -extern int msix_supported;
> -
>  #endif
> diff --git a/hw/pc.c b/hw/pc.c
> index 9328ee5..5225d5b 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -36,7 +36,7 @@
>  #include "elf.h"
>  #include "multiboot.h"
>  #include "mc146818rtc.h"
> -#include "msix.h"
> +#include "msi.h"
>  #include "sysbus.h"
>  #include "sysemu.h"
>  #include "blockdev.h"
> @@ -896,7 +896,7 @@ static DeviceState *apic_init(void *env, uint8_t apic_id)
>  apic_mapped = 1;
>  }
>  
> -msix_supported = 1;
> +msi_supported = true;
>  
>  return dev;
>  }
> -- 
> 1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm tools: Allow easily sandboxing applications within a guest

2011-12-04 Thread Avi Kivity

On 12/02/2011 09:16 AM, Sasha Levin wrote:
> This is useful when testing code that might cause problems on the host, or
> to automate kernel testing since it's now easy to link a kvm tools test
> script with 'git bisect run'.

This tie-up into git bisect is a really cool idea.

With device assignment, you can even bisect driver bugs this way.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance

2011-12-04 Thread Avi Kivity

On 12/02/2011 08:26 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan 
>
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction taken when its refcnt is zero,
> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>
>  
> @@ -315,9 +315,17 @@ static void pit_do_work(struct work_struct *work)
>* LVT0 to NMI delivery. Other PIC interrupts are just sent to
>* VCPU0, and only if its LVT0 is in EXTINT mode.
>*/
> - if (kvm->arch.vapics_in_nmi_mode > 0)
> - kvm_for_each_vcpu(i, vcpu, kvm)
> + if (kvm->arch.vapics_in_nmi_mode > 0) {
> + rcu_read_lock();
> + kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
> + vcpu = kvm_get_vcpu(kvm, i);
> + if (vcpu == NULL)
> + continue;
> + cnt++;
>   kvm_apic_nmi_wd_deliver(vcpu);
> + }
> + rcu_read_unlock();
> + }
>   }
>  }

This pattern keeps repeating, please fold it into kvm_for_each_vcpu().

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-12-04 Thread Avi Kivity

On 12/03/2011 06:37 AM, Takuya Yoshikawa wrote:
> Avi Kivity  wrote:
> > That's true.  But some applications do require low latency, and the
> > current code can impose a lot of time with the mmu spinlock held.
> > 
> > The total amount of work actually increases slightly, from O(N) to O(N
> > log N), but since the tree is so wide, the overhead is small.
> > 
>
> Controlling the latency can be achieved by making the user space limit
> the number of dirty pages to scan without hacking the core mmu code.
>
>   The fact that we cannot transfer so many pages on the network at
>   once suggests this is reasonable.

That is true.  Write protecting everything at once means that there is a
large window between the sampling the dirty log, and transferring the
page.  Any writes within that window cause a re-transfer, even when they
should not.

>
> With the rmap write protection method in KVM, the only thing we need is
> a new GET_DIRTY_LOG api which takes the [gfn_start, gfn_end] to scan,
> or max_write_protections optionally.

Right.

>
>   I remember that someone suggested splitting the slot at KVM forum.
>   Same effect with less effort.
>
> QEMU can also avoid unwanted page faults by using this api wisely.
>
>   E.g. you can use this for "Interactivity improvements" TODO on
>   KVM wiki, I think.
>
> Furthermore, QEMU may be able to use multiple threads for the memory
> copy task.
>
>   Each thread has its own range of memory to copy, and does
>   GET_DIRTY_LOG independently.  This will make things easy to
>   add further optimizations in QEMU.
>
> In summary, my impression is that the main cause of the current latency
> problem is not the write protection of KVM but the strategy which tries
> to cook the large slot in one hand.
>
> What do you think?

I agree.  Maybe O(1) write protection has a place, but it is secondary
to fine-grained dirty logging, and if we implement it, it should be
after your idea, and further measurements.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: IOAPIC doesn't handle byte writes

2011-12-04 Thread Gleb Natapov

On Tue, Nov 22, 2011 at 02:13:02PM +0100, Julian Stecklina wrote:
> Hello,
> 
> KVM emulates an IOAPIC that doesn't handle byte writes to its
> IOAPIC_REG_SELECT register, although for example the ICH10 spec[1]
> clearly states that this is an 8-bit register. See
> http://www.intel.com/content/dam/doc/datasheet/io-controller-hub-10-family-datasheet.pdf
>  Table 13-4 on page 433.
> 
The same spec also says that access to data register has to be dword,
but new code as of 1b8cf174cf2c0f0c66030fe81af818e9abf4f302 does not
enforce this. Old code never enforced it for read too :(

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

90 matches

Mail list logo