[PATCH 2/2] [TRIVIAL]KVM: fix a typo in comment

2012-06-14 Thread Guo Chao

Signed-off-by: Guo Chao 
---
 arch/x86/kvm/vmx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f48cef3..7593693 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5880,7 +5880,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
return handle_invalid_guest_state(vcpu);
 
/*
-* the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+* The KVM_REQ_EVENT optimization bit is only on for one entry, and if
 * we did not inject a still-pending event to L1 now because of
 * nested_run_pending, we need to re-enable this bit.
 */
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: code clean for vmx_init()

2012-06-14 Thread Guo Chao

Signed-off-by: Guo Chao 
---
 arch/x86/kvm/vmx.c |   16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 32eb588..f48cef3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7230,23 +7230,21 @@ static int __init vmx_init(void)
if (!vmx_io_bitmap_a)
return -ENOMEM;
 
+   r = -ENOMEM;
+
vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_io_bitmap_b) {
-   r = -ENOMEM;
+   if (!vmx_io_bitmap_b) 
goto out;
-   }
 
vmx_msr_bitmap_legacy = (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_legacy) {
-   r = -ENOMEM;
+   if (!vmx_msr_bitmap_legacy) 
goto out1;
-   }
+   
 
vmx_msr_bitmap_longmode = (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_longmode) {
-   r = -ENOMEM;
+   if (!vmx_msr_bitmap_longmode) 
goto out2;
-   }
+   
 
/*
 * Allow direct access to the PC debug port (it is often used for I/O
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO

2012-06-14 Thread HATAYAMA Daisuke
From: Avi Kivity 
Subject: Re: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill 
VMCSINFO
Date: Thu, 14 Jun 2012 16:37:58 +0300

> On 05/16/2012 10:55 AM, zhangyanfei wrote:

>> +
>> +/*
>> + * The format of VMCSINFO is given below:
>> + *   +-+--+
>> + *   | Byte offset | Contents |
>> + *   +-+--+
>> + *   | 0   | VMCS revision identifier |
>> + *   +-+--+
>> + *   | 4   |   |
>> + *   +-+--+
>> + *   | 16  |   |
>> + *   +-+--+
>> + *   ..
>> + *
>> + * The first 32 bits of VMCSINFO contains the VMCS revision
>> + * identifier.
>> + * The remainder of VMCSINFO is used for 
>> + * sets. Each set takes 12 bytes: field occupys 4 bytes
>> + * and its corresponding encoded offset occupys 8 bytes.
>> + *
>> + * Encoded offsets are raw values read by vmcs_read{16, 64, 32, l},
>> + * and they are all unsigned extended to 8 bytes for each
>> + *  set has the same size.
>> + * We do not decode offsets here. The decoding work is delayed
>> + * in userspace tools.
> 
> It's better to do the decoding here, or no one will know how to do it.
> Also have an nfields field.

We did so because actual internal format is unkown; it could possibly
be encrypted, although unlikely. We thought processing such unkown
data should have been done in userland debugging tools. But decoding
them here is no problem; the decode is of simple shift operations and
has no risk affecting system status, there's only possibility to
display broken values, which is same as the case displaying the
encrypted values for users.

FYI, the assumptions behind the reverse enginerring method are:

- For each field, the corresponding offset is unique; IOW, arbitrary
  two offsets for the corresponding two fields are different each
  other.
- Field size remains 8 bytes, 16 bytes, 32 bytes even on vmcs region.
- Each field is 8 byte alighed on vmcs region.
- Some fields may be big endition on vmcs region.
  - We found this fact under development. We give up if the data is
modified more drastically.
- offset < vmcs region size

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Early boot panic on machine with lots of memory

2012-06-14 Thread Yinghai Lu
On Thu, Jun 14, 2012 at 5:59 PM, Sasha Levin  wrote:
> On Thu, 2012-06-14 at 16:57 -0700, Yinghai Lu wrote:
>> can you please boot with "memtest" to see if there is any memory problem?
>
> The host got a memtest treatment, nothing found.

can you try to boot guest with memtest?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Biweekly KVM Test report, kernel 51bfd299... qemu a1fce560...

2012-06-14 Thread Ren, Yongjie
> -Original Message-
> From: Kevin Wolf [mailto:kw...@redhat.com]
> Sent: Wednesday, June 13, 2012 6:33 PM
> To: Ren, Yongjie
> Cc: Marcelo Tosatti; Stefan Hajnoczi; Avi Kivity; kvm@vger.kernel.org; Liu,
> RongrongX; Anthony Liguori
> Subject: Re: Biweekly KVM Test report, kernel 51bfd299... qemu
> a1fce560...
> 
> Am 13.06.2012 12:28, schrieb Ren, Yongjie:
> >> -Original Message-
> >> From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
> >> Sent: Wednesday, June 13, 2012 7:25 AM
> >> To: Kevin Wolf
> >> Cc: Stefan Hajnoczi; Ren, Yongjie; Avi Kivity; kvm@vger.kernel.org; Liu,
> >> RongrongX; Anthony Liguori
> >> Subject: Re: Biweekly KVM Test report, kernel 51bfd299... qemu
> >> a1fce560...
> >>
> >> On Tue, Jun 12, 2012 at 09:45:16AM +0200, Kevin Wolf wrote:
> >>> Am 12.06.2012 03:52, schrieb Marcelo Tosatti:
>  On Thu, Jun 07, 2012 at 01:13:50PM +0100, Stefan Hajnoczi wrote:
> > The 1st bad commit in your attached list is abc551bd
> > More detailed info:
> > 171d2f2249a360d7d623130d3aa991418c53716d   good
> > fd453a24166e36a3d376c9bc221e520e3ee425afgood
> > abc551bd456cf0407fa798395d83dc5aa35f6dbb bad
> > 823ccf41509baa197dd6a3bef63837a6cf101ad8 bad
> 
>  Thanks, this points to the qcow2 v3 changes. Let's try to find the
> >> exact
>  culprit. I have rebased the qcow2 patches on top of that good
> >> merge
>  (fd453a24). Please apply the attached mbox on top of this
> merge:
> 
>  git checkout fd453a24
>  git am qcow2v3.mbox
> 
>  You can then start a git bisect between your new HEAD and
> >> fd453a24.
> >>>
> >>> Another thing you could try is if reverting e82dabd and bef0fd59
> >> fixes
> >>> the problem for you.
> >>>
> >> After reverting bef0fd59, it can work fine.
> >
> > I'm unable to reproduce this with qemu-kvm.git/master
> > (18b012756fd041556764dfa2bb1f307b6923fb71) or the commit you
> > 3fd9fedb9fae4517d93d76e93a2924559cacf2f6 reported as bad.
> The
> >> RHEL 6
> > guest boots without any issues.  My host kernel is Linux 3.2
> (Debian
> > 3.2.0-2-amd64).  The guest is Linux 2.6.32-71.el6.
> >
> > Since the guest sees a disk read error, it may be useful to add
> > printfs to hw/ide/core.c:ide_sector_read_cb().  Let's find out why
> >> the
> > disk read is failing.
> 
>  Any news on this problem?
> >>>
> >>> We're still waiting for the test results with Jan's timer fix.
> >>
> >> OK, Ren, can you please retest with qemu-kvm.git master? TIA.
> >>
> > Yes, I re-tested with latest qemu-kvm.git master branch (commit:
> 0a948cbb).
> > I found the qcow2 image 'disk error' issue can't be reproduced in latest
> tree.
> > And, I can still reproduce it against commit e54f008e (May 9) as I
> reported for this bug.
> > Any recent patch has been checked in to fix this issue?
> > Maybe, Jan's "kvm: i8254: Fix conversion of in-kernel to userspace
> state"?
> 
> Yes, this is what we expected to fix the problem. If you like to verify,
> you can try applying only this patch on top of the broken commit e54f008e.
> 
Confirmed. Jan's patch "kvm: i8254: Fix conversion of in-kernel to userspace 
state" fixed the following issue.
https://bugs.launchpad.net/qemu/+bug/1002121

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Early boot panic on machine with lots of memory

2012-06-14 Thread Sasha Levin
On Thu, 2012-06-14 at 16:57 -0700, Yinghai Lu wrote:
> On Thu, Jun 14, 2012 at 2:34 PM, Sasha Levin  wrote:
> > On Thu, 2012-06-14 at 13:56 -0700, Yinghai Lu wrote:
> >> On Thu, Jun 14, 2012 at 2:50 AM, Sasha Levin  
> >> wrote:
> >> > On Thu, 2012-06-14 at 12:20 +0900, Tejun Heo wrote:
> >> >> On Wed, Jun 13, 2012 at 11:38:55PM +0200, Sasha Levin wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > I'm seeing the following when booting a KVM guest with 65gb of RAM, 
> >> >> > on latest linux-next.
> >> >> >
> >> >> > Note that it happens with numa=off.
> >> >> >
> >> >> > [0.00] BUG: unable to handle kernel paging request at 
> >> >> > 88102febd948
> >> >> > [0.00] IP: [] 
> >> >> > __next_free_mem_range+0x9b/0x155
> >> >>
> >> >> Can you map it back to the source line please?
> >> >
> >> > mm/memblock.c:583
> >> >
> >> >phys_addr_t r_start = ri ? r[-1].base + 
> >> > r[-1].size : 0;
> >> >  97:   85 d2   test   %edx,%edx
> >> >  99:   74 08   je a3 <__next_free_mem_range+0xa3>
> >> >  9b:   49 8b 48 f0 mov-0x10(%r8),%rcx
> >> >  9f:   49 03 48 e8 add-0x18(%r8),%rcx
> >> >
> >> > It's the deref on 9b (r8=88102febd958).
> >>
> >> that reserved.region is allocated by memblock.
> >>
> >> can you boot with "memblock=debug debug ignore_loglevel" and post
> >> whole boot log?
> >
> > Attached below. I've also noticed it doesn't always happen, but
> > increasing the vcpu count (to something around 254) makes it happen
> > almost every time.
> >
> ...
> [0.00] memblock: reserved array is doubled to 512 at
> [0x102febc080-0x102febf07f]
> [0.00]memblock_free: [0x102febf080-0x102fec0880]
> memblock_double_array+0x1b0/0x1e2
> [0.00] memblock_reserve: [0x102febc080-0x102febf080]
> memblock_double_array+0x1c5/0x1e2
> 
> the reserved regions get double two times to 512.
> 
> > [0.00]memblock_free: [0x102febc080-0x102febf080] 
> > memblock_free_reserved_regions+0x37/0x39
> > [0.00] BUG: unable to handle kernel paging request at 
> > 88102febd948
> > [0.00] IP: [] __next_free_mem_range+0x9b/0x155
> > [0.00] PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 80102febd160
> 
> that page table for them is
> 
> [0.00] kernel direct mapping tables up to 0x102fff @ [mem
> 0xc7e3e000-0xcfff]
> [0.00] memblock_reserve: [0x00c7e3e000-0x00cf7fb000]
> native_pagetable_reserve+0xc/0xe
> 
> only near by allocation is swiotlb.
> 
> [0.00] __ex_table already sorted, skipping sort
> [0.00] memblock_reserve: [0x00c3e3e000-0x00c7e3e000]
> __alloc_memory_core_early+0x5c/0x73
> ...
> [0.00] memblock_reserve: [0x00cfff8000-0x00d000]
> __alloc_memory_core_early+0x5c/0x73
> [0.00] Checking aperture...
> 
> so the memblock allocation is ok...
> 
> can you please boot with "memtest" to see if there is any memory problem?

The host got a memtest treatment, nothing found.

(I'll cc the KVM folks as well.)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/41] postcopy live migration

2012-06-14 Thread Juan Quintela
Isaku Yamahata  wrote:
> After the long time, we have v2. This is qemu part.
> The linux kernel part is sent separatedly.
>
> Changes v1 -> v2:
> - split up patches for review
> - buffered file refactored
> - many bug fixes
>   Espcially PV drivers can work with postcopy
> - optimization/heuristic
>
> Patches
> 1 - 30: refactoring exsiting code and preparation
> 31 - 37: implement postcopy itself (essential part)
> 38 - 41: some optimization/heuristic for postcopy
>

After reviewing the changes.  I think we can merge the patches 1-30.
For the rest of them we still need another round of review /coding (at
least we need to implement the error handling).

IMHO, it makes no sense to add CONFIG_POSTCOPY, we can just compile the
code in.  Furthermore, we have not ifdefed the code calls on the common
code.  But that is just my opinion.

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 37/41] postcopy: implement outgoing part of postcopy live migration

2012-06-14 Thread Juan Quintela
Isaku Yamahata  wrote:
> This patch implements postcopy live migration for outgoing part
>
> Signed-off-by: Isaku Yamahata 
> ---
> Changes v1 -> v2:
> - fix parameter to qemu_fdopen()
> - handle QEMU_UMEM_REQ_EOC properly
>   when PO_STATE_ALL_PAGES_SENT, QEMU_UMEM_REQ_EOC request was ignored.
>   handle properly it.
> - flush on-demand page unconditionally
> - improve postcopy_outgoing_ram_save_live and postcopy_outgoing_begin()
> - use qemu_fopen_fd
> - use memory api instead of obsolete api
> - segv in postcopy_outgoing_check_all_ram_sent()
> - catch up qapi change
> ---
>  arch_init.c   |   19 ++-
>  migration-exec.c  |4 +
>  migration-fd.c|   17 ++
>  migration-postcopy-stub.c |   22 +++
>  migration-postcopy.c  |  450 
> +
>  migration-tcp.c   |   25 ++-
>  migration-unix.c  |   26 ++-
>  migration.c   |   32 +++-
>  migration.h   |   12 ++
>  savevm.c  |   22 ++-
>  sysemu.h  |2 +-
>  11 files changed, 614 insertions(+), 17 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index 22d9691..3599e5c 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -154,6 +154,13 @@ static int is_dup_page(uint8_t *page)
>  return 1;
>  }
>  
> +static bool outgoing_postcopy = false;
> +
> +void ram_save_set_params(const MigrationParams *params, void *opaque)
> +{
> +outgoing_postcopy = params->postcopy;
> +}
> +
>  static RAMBlock *last_block_sent = NULL;
>  static uint64_t bytes_transferred;
>  
> @@ -343,6 +350,15 @@ int ram_save_live(QEMUFile *f, int stage, void *opaque)
>  uint64_t expected_time = 0;
>  int ret;
>  
> +if (stage == 1) {
> +bytes_transferred = 0;
> +last_block_sent = NULL;
> +ram_save_set_last_block(NULL, 0);
> +}
> +if (outgoing_postcopy) {
> +return postcopy_outgoing_ram_save_live(f, stage, opaque);
> +}
> +
>  if (stage < 0) {
>  memory_global_dirty_log_stop();
>  return 0;
> @@ -351,9 +367,6 @@ int ram_save_live(QEMUFile *f, int stage, void *opaque)
>  memory_global_sync_dirty_bitmap(get_system_memory());
>  
>  if (stage == 1) {
> -bytes_transferred = 0;
> -last_block_sent = NULL;
> -ram_save_set_last_block(NULL, 0);
>  sort_ram_list();
>  
>  /* Make sure all dirty bits are set */
> diff --git a/migration-exec.c b/migration-exec.c
> index 7f08b3b..a90da5c 100644
> --- a/migration-exec.c
> +++ b/migration-exec.c
> @@ -64,6 +64,10 @@ int exec_start_outgoing_migration(MigrationState *s, const 
> char *command)
>  {
>  FILE *f;
>  
> +if (s->params.postcopy) {
> +return -ENOSYS;
> +}
> +
>  f = popen(command, "w");
>  if (f == NULL) {
>  DPRINTF("Unable to popen exec target\n");
> diff --git a/migration-fd.c b/migration-fd.c
> index 42b8162..83b5f18 100644
> --- a/migration-fd.c
> +++ b/migration-fd.c
> @@ -90,6 +90,23 @@ int fd_start_outgoing_migration(MigrationState *s, const 
> char *fdname)
>  s->write = fd_write;
>  s->close = fd_close;
>  
> +if (s->params.postcopy) {
> +int flags = fcntl(s->fd, F_GETFL);
> +if ((flags & O_ACCMODE) != O_RDWR) {
> +goto err_after_open;
> +}
> +
> +s->fd_read = dup(s->fd);
> +if (s->fd_read == -1) {
> +goto err_after_open;
> +}
> +s->file_read = qemu_fopen_fd(s->fd_read);
> +if (s->file_read == NULL) {
> +close(s->fd_read);
> +goto err_after_open;
> +}
> +}
> +
>  migrate_fd_connect(s);
>  return 0;
>  
> diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
> index f9ebcbe..9c64827 100644
> --- a/migration-postcopy-stub.c
> +++ b/migration-postcopy-stub.c
> @@ -24,6 +24,28 @@
>  #include "sysemu.h"
>  #include "migration.h"
>  
> +int postcopy_outgoing_create_read_socket(MigrationState *s)
> +{
> +return -ENOSYS;
> +}
> +
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +QEMUFile *f, int stage, void *opaque)
> +{
> +return -ENOSYS;
> +}
> +
> +void *postcopy_outgoing_begin(MigrationState *ms)
> +{
> +return NULL;
> +}
> +
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +  void *postcopy)
> +{
> +return -ENOSYS;
> +}
> +
>  int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
>  {
>  return -ENOSYS;
> diff --git a/migration-postcopy.c b/migration-postcopy.c
> index 5913e05..eb37094 100644
> --- a/migration-postcopy.c
> +++ b/migration-postcopy.c
> @@ -177,6 +177,456 @@ static void postcopy_incoming_send_req(QEMUFile *f,
>  }
>  }
>  
> +static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
> +struct qemu_umem_req *req,
> +size_t *offset

Re: [PATCH v2 35/41] postcopy: introduce helper functions for postcopy

2012-06-14 Thread Juan Quintela
Isaku Yamahata  wrote:
> +//#define DEBUG_UMEM
> +#ifdef DEBUG_UMEM
> +#include 
> +#define DPRINTF(format, ...)\
> +do {\
> +printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),\
> +   __func__, __LINE__, ## __VA_ARGS__); \
> +} while (0)

This should be in a header file that is linux specific?  And (at least
on my systems) gettid is already defined on glibc.


> +#else
> +#define DPRINTF(format, ...)do { } while (0)
> +#endif


> +
> +#define DEV_UMEM"/dev/umem"
> +
> +UMem *umem_new(void *hostp, size_t size)
> +{
> +struct umem_init uinit = {
> +.size = size,
> +};
> +UMem *umem;
> +
> +assert((size % getpagesize()) == 0);
> +umem = g_new(UMem, 1);
> +umem->fd = open(DEV_UMEM, O_RDWR);
> +if (umem->fd < 0) {
> +perror("can't open "DEV_UMEM);
> +abort();

Can we return one error insntead of abort?  the same for the rest of the
file aborts.


> +size_t umem_pages_size(uint64_t nr)
> +{
> +return sizeof(struct umem_pages) + nr * sizeof(uint64_t);

Can we make sure that the pgoffs field is aligned?  I know that as it is
now it is aligned, but better to be sure?

> +}
> +
> +static void umem_write_cmd(int fd, uint8_t cmd)
> +{
> +DPRINTF("write cmd %c\n", cmd);
> +
> +for (;;) {
> +ssize_t ret = write(fd, &cmd, 1);
> +if (ret == -1) {
> +if (errno == EINTR) {
> +continue;
> +} else if (errno == EPIPE) {
> +perror("pipe");
> +DPRINTF("write cmd %c %zd %d: pipe is closed\n",
> +cmd, ret, errno);
> +break;
> +}


Grr, we don't have a function that writes does a "safe_write".  The most
similar thing in qemu looks to be send_all().

> +
> +perror("pipe");

Can we make a different perror() message than previous error?

> +DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
> +abort();
> +}
> +
> +break;
> +}
> +}
> +
> +static void umem_read_cmd(int fd, uint8_t expect)
> +{
> +uint8_t cmd;
> +for (;;) {
> +ssize_t ret = read(fd, &cmd, 1);
> +if (ret == -1) {
> +if (errno == EINTR) {
> +continue;
> +}
> +perror("pipe");
> +DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
> +abort();
> +}
> +
> +if (ret == 0) {
> +DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
> +abort();
> +}
> +
> +break;
> +}
> +
> +DPRINTF("read cmd %c\n", cmd);
> +if (cmd != expect) {
> +DPRINTF("cmd %c expect %d\n", cmd, expect);
> +abort();

Ouch.  If we receive garbage, we just exit?

I really think that we should implement error handling.

> +}
> +}
> +
> +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
> +{
> +int ret;
> +uint64_t nr;
> +size_t size;
> +struct umem_pages *pages;
> +
> +ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
> +*offset += sizeof(nr);
> +DPRINTF("ret %d nr %ld\n", ret, nr);
> +if (ret != sizeof(nr) || nr == 0) {
> +return NULL;
> +}
> +
> +size = umem_pages_size(nr);
> +pages = g_malloc(size);

Just thinking about this.  Couldn't we just decide on a "big enough"
buffer, and never send anything bigger than that?  That would remove the
need to have to malloc()/free() a buffer for each reception?



> +/* qemu side handler */
> +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> +int *offset)
> +{
> +uint64_t i;
> +int page_shift = ffs(getpagesize()) - 1;
> +struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
> +if (pages == NULL) {
> +return NULL;
> +}
> +
> +for (i = 0; i < pages->nr; i++) {
> +ram_addr_t addr = pages->pgoffs[i] << page_shift;
> +
> +/* make pages present by forcibly triggering page fault. */
> +volatile uint8_t *ram = qemu_get_ram_ptr(addr);
> +uint8_t dummy_read = ram[0];
> +(void)dummy_read;   /* suppress unused variable warning */
> +}
> +
> +/*
> + * Very Linux implementation specific.
> + * Make it sure that other thread doesn't fault on the above virtual
> + * address. (More exactly other thread doesn't call fault handler with
> + * the offset.)
> + * the fault handler is called with mmap_sem read locked.
> + * madvise() does down/up_write(mmap_sem)
> + */
> +qemu_madvise(NULL, 0, MADV_NORMAL);

If it is linux specific, should be inside CONFIG_LINUX ifdef, or a
function hided on some header.

Talking about looking, what protects that no other thread enters this
function before this one calls mad

Re: [PATCH v2 36/41] postcopy: implement incoming part of postcopy live migration

2012-06-14 Thread Juan Quintela
Isaku Yamahata  wrote:
> This patch implements postcopy live migration for incoming part
>
> Signed-off-by: Isaku Yamahata 


> +void ram_save_set_params(const MigrationParams *params, void *opaque);

> -register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
> - ram_save_live, NULL, ram_load, NULL);
> +register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
> + ram_save_set_params, ram_save_live, NULL,
> + incoming_postcopy ?
> + postcopy_incoming_ram_load : ram_load, NULL);


ram_save_set_params() used on this patch but defined on next one.

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 36/41] postcopy: implement incoming part of postcopy live migration

2012-06-14 Thread Juan Quintela
Isaku Yamahata  wrote:
> This patch implements postcopy live migration for incoming part
>
>  vl.c   |8 +-
>  13 files changed, 1409 insertions(+), 21 deletions(-)
>  copy linux-headers/linux/umem.h => migration-postcopy-stub.c (55%)

Ouch, git got really confused.


> +void postcopy_incoming_prepare(void)
> +{
> +RAMBlock *block;
> +
> +if (!incoming_postcopy) {
> +return;
> +}

We are testing the negation of this before calling the function, it is
not needed in one of the sides?

> +
> +state.state = 0;
> +state.host_page_size = getpagesize();
> +state.host_page_shift = ffs(state.host_page_size) - 1;
> +state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
> +   ram_save_live() */
> +
> +QLIST_FOREACH(block, &ram_list.blocks, next) {
> +block->umem = umem_new(block->host, block->length);
> +block->flags |= RAM_POSTCOPY_UMEM_MASK;
> +}
> +}
> +
> +static int postcopy_incoming_ram_load_get64(QEMUFile *f,
> +ram_addr_t *addr, int *flags)
> +{
> +*addr = qemu_get_be64(f);
> +*flags = *addr & ~TARGET_PAGE_MASK;
> +*addr &= TARGET_PAGE_MASK;
> +return qemu_file_get_error(f);
> +}
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +ram_addr_t addr;
> +int flags;
> +int error;
> +
> +DPRINTF("incoming ram load\n");
> +/*
> + * RAM_SAVE_FLAGS_EOS or
> + * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
> + * see postcopy_outgoing_ram_save_live()
> + */
> +
> +if (version_id != RAM_SAVE_VERSION_ID) {
> +DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
> +version_id, RAM_SAVE_VERSION_ID);
> +return -EINVAL;
> +}
> +error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> +DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
> +if (error) {
> +DPRINTF("error %d\n", error);
> +return error;
> +}
> +if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> +DPRINTF("EOS\n");
> +return 0;
> +}
> +
> +if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
> +DPRINTF("-EINVAL flags 0x%x\n", flags);
> +return -EINVAL;
> +}
> +error = ram_load_mem_size(f, addr);
> +if (error) {
> +DPRINTF("addr 0x%lx error %d\n", addr, error);
> +return error;
> +}
> +
> +error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> +if (error) {
> +DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
> +return error;
> +}
> +if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> +DPRINTF("done\n");
> +return 0;
> +}
> +DPRINTF("-EINVAL\n");
> +return -EINVAL;
> +}
> +
> +static void postcopy_incoming_pipe_and_fork_umemd(int mig_read_fd,
> +  QEMUFile *mig_read)
> +{
> +int fds[2];
> +RAMBlock *block;
> +
> +DPRINTF("fork\n");
> +
> +/* socketpair(AF_UNIX)? */
> +
> +if (qemu_pipe(fds) == -1) {
> +perror("qemu_pipe");
> +abort();
> +}
> +state.from_umemd_fd = fds[0];
> +umemd.to_qemu_fd = fds[1];
> +
> +if (qemu_pipe(fds) == -1) {
> +perror("qemu_pipe");
> +abort();
> +}
> +umemd.from_qemu_fd = fds[0];
> +state.to_umemd_fd = fds[1];
> +
> +pid_t child = fork();
> +if (child < 0) {
> +perror("fork");
> +abort();
> +}
> +
> +if (child == 0) {
> +int mig_write_fd;
> +
> +fd_close(&state.to_umemd_fd);
> +fd_close(&state.from_umemd_fd);
> +umemd.host_page_size = state.host_page_size;
> +umemd.host_page_shift = state.host_page_shift;
> +
> +umemd.nr_host_pages_per_target_page =
> +TARGET_PAGE_SIZE / umemd.host_page_size;
> +umemd.nr_target_pages_per_host_page =
> +umemd.host_page_size / TARGET_PAGE_SIZE;
> +
> +umemd.target_to_host_page_shift =
> +ffs(umemd.nr_host_pages_per_target_page) - 1;
> +umemd.host_to_target_page_shift =
> +ffs(umemd.nr_target_pages_per_host_page) - 1;
> +
> +umemd.state = 0;
> +umemd.version_id = state.version_id;
> +umemd.mig_read_fd = mig_read_fd;
> +umemd.mig_read = mig_read;
> +
> +mig_write_fd = dup(mig_read_fd);
> +if (mig_write_fd < 0) {
> +perror("could not dup for writable socket \n");
> +abort();
> +}
> +umemd.mig_write_fd = mig_write_fd;
> +umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
> +
> +postcopy_incoming_umemd(); /* noreturn */
> +}
> +
> +DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
> +fd_close(&umemd.to_qemu_fd);
> +fd_close(&umemd.from_qemu_fd);
> +state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAG

Re: [PATCH v2 00/41] postcopy live migration

2012-06-14 Thread Juan Quintela
Avi Kivity  wrote:
> On 06/08/2012 01:16 PM, Juan Quintela wrote:
>> Anthony Liguori  wrote:
>>
>> Once told that, we need to measure what is the time of an async page
>> fault over the network.  If it is too high, post copy just don't work.
>>
>> And no, I haven't seen any measurement that told us that this is going
>> to be fast enough, but there is always hope.
>
> At 10Gb/sec, the time to transfer one page is 4 microseconds.  At
> 40Gb/sec this drops to a microsecond, plus the latency.  This is on par
> with the time to handle a write protection fault that precopy uses.  But
> this can *only* be achieved with RDMA, otherwise the overhead of
> messaging and copying will dominate.
>
> Note this does not mean we should postpone merging until RDMA support is
> ready.  However we need to make sure the kernel interface is RDMA friendly.

Fully agree here.  I always thought that postcopy will work with RDMA or
something like that, any other thing would just add too much latency.

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM internal error with some amounts of guest memory

2012-06-14 Thread Michael Tokarev
On 14.06.2012 23:45, Michael Tokarev wrote:
> On 14.06.2012 23:22, Michael Tokarev wrote:
>> Now that's something else.  Reported by a debian user, but
>> trivially reproducible.
>>
>> $ kvm -m 1.4g
>> KVM internal error. Suberror: 1
>> emulation failure
>> EAX=000e3c54 EBX= ECX= EDX=0cfd
>> ESI= EDI= EBP= ESP=6fe8
>> EIP=000f309b EFL=0016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0010   00c09300 DPL=0 DS   [-WA]
>> CS =0008   00c09b00 DPL=0 CS32 [-RA]
>> SS =0010   00c09300 DPL=0 DS   [-WA]
>> DS =0010   00c09300 DPL=0 DS   [-WA]
>> FS =0010   00c09300 DPL=0 DS   [-WA]
>> GS =0010   00c09300 DPL=0 DS   [-WA]
>> LDT=   8200 DPL=0 LDT
>> TR =   8b00 DPL=0 TSS32-busy
>> GDT= 000fd3a8 0037
>> IDT= 000fd3e6 
>> CR0=0011 CR2= CR3= CR4=
>> DR0= DR1= DR2= 
>> DR3=
>> DR6=0ff0 DR7=0400
>> EFER=
>> Code=ff ff ba 59 00 00 00 a8 10 89 d8 75 09 b9 ef 2f ff ff ff d1  23 59 
>> 5b 5e e9 4a ff ff ff 31 d2 89 f0 e8 6c fa ff ff 89 c6 85 c0 79 ab c7 04 24 
>> 8c 4c

Bisected.

This is introduced by this commit:

8f6f962b994e1402935055ac7093ac977ccc9a5c is the first bad commit
commit 8f6f962b994e1402935055ac7093ac977ccc9a5c
Author: Avi Kivity 
Date:   Wed Feb 29 13:22:12 2012 +0200

kvm: fix unaligned slots

kvm_set_phys_mem() may be passed sections that are not aligned to a page
boundary.  The current code simply brute-forces the alignment which leads
to an inconsistency and an abort().

Fix by aligning the start and the end of the section correctly, discarding
and unaligned head or tail.

This was triggered by a guest sizing a 64-bit BAR that is smaller than a 
page
with PCI_COMMAND_MEMORY enabled and the upper dword clear.

Signed-off-by: Avi Kivity 

:100644 100644 c4babdac0dd3335eab1a9e45371b7df2c0dd1c9c 
4b7a4ae5dd6d9bd0b4cfa37159382654f0641e8d M  kvm-all.c

Once again, this affects both qemu-kvm and qemu (with -enable-kvm) 1.1,
at least on AMD host, and the issue gets reported immediately when
starting the virtual machine with -m 1.4g (no other arguments).

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Fix PCI header check on device assignment

2012-06-14 Thread Alex Williamson
Avi, Marcelo,

Can we get this in for 3.5 please?  There's already an ack from me on
the list.  Thanks,

Alex

On Tue, 2012-06-05 at 10:37 +0200, Jan Kiszka wrote:
> The masking was wrong (must have been 0x7f), and there is no need to
> re-read the value as pci_setup_device already does this for us.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43339
> Signed-off-by: Jan Kiszka 
> ---
>  virt/kvm/assigned-dev.c |4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 01f572c..b1e091a 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -635,7 +635,6 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>   int r = 0, idx;
>   struct kvm_assigned_dev_kernel *match;
>   struct pci_dev *dev;
> - u8 header_type;
>  
>   if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
>   return -EINVAL;
> @@ -668,8 +667,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>   }
>  
>   /* Don't allow bridges to be assigned */
> - pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type);
> - if ((header_type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL) {
> + if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL) {
>   r = -EPERM;
>   goto out_put;
>   }



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM internal error with some amounts of guest memory

2012-06-14 Thread Michael Tokarev
On 14.06.2012 23:22, Michael Tokarev wrote:
> Now that's something else.  Reported by a debian user, but
> trivially reproducible.
> 
> $ kvm -m 1.4g
> KVM internal error. Suberror: 1
> emulation failure
> EAX=000e3c54 EBX= ECX= EDX=0cfd
> ESI= EDI= EBP= ESP=6fe8
> EIP=000f309b EFL=0016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0010   00c09300 DPL=0 DS   [-WA]
> CS =0008   00c09b00 DPL=0 CS32 [-RA]
> SS =0010   00c09300 DPL=0 DS   [-WA]
> DS =0010   00c09300 DPL=0 DS   [-WA]
> FS =0010   00c09300 DPL=0 DS   [-WA]
> GS =0010   00c09300 DPL=0 DS   [-WA]
> LDT=   8200 DPL=0 LDT
> TR =   8b00 DPL=0 TSS32-busy
> GDT= 000fd3a8 0037
> IDT= 000fd3e6 
> CR0=0011 CR2= CR3= CR4=
> DR0= DR1= DR2= 
> DR3=
> DR6=0ff0 DR7=0400
> EFER=
> Code=ff ff ba 59 00 00 00 a8 10 89 d8 75 09 b9 ef 2f ff ff ff d1  23 59 
> 5b 5e e9 4a ff ff ff 31 d2 89 f0 e8 6c fa ff ff 89 c6 85 c0 79 ab c7 04 24 8c 
> 4c
> 
> This is 1.1.

Qemu 1.1 with -enable-kvm also has this very issue.  Switching to qemu-devel@.

> -m 1.5g works.  -no-kvm works.

I can trivially reproduce this on two machines, both are AMD-based.  I don't
know if it is amd-specific or not.

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM internal error with some amounts of guest memory

2012-06-14 Thread Michael Tokarev
Now that's something else.  Reported by a debian user, but
trivially reproducible.

$ kvm -m 1.4g
KVM internal error. Suberror: 1
emulation failure
EAX=000e3c54 EBX= ECX= EDX=0cfd
ESI= EDI= EBP= ESP=6fe8
EIP=000f309b EFL=0016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS32-busy
GDT= 000fd3a8 0037
IDT= 000fd3e6 
CR0=0011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=ff ff ba 59 00 00 00 a8 10 89 d8 75 09 b9 ef 2f ff ff ff d1  23 59 5b 
5e e9 4a ff ff ff 31 d2 89 f0 e8 6c fa ff ff 89 c6 85 c0 79 ab c7 04 24 8c 4c

This is 1.1.

-m 1.5g works.  -no-kvm works.

What's that? :)

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest

2012-06-14 Thread Liu, Jinsong
Eduardo Habkost wrote:
> On Thu, Jun 14, 2012 at 07:02:03PM +, Liu, Jinsong wrote:
>> Eduardo, Jan
>> 
>> I will update tsc deadline timer patch (at qemu-kvm side) recently.
>> Have you made a final agreement of the issue
>> 'KVM_CAP_TSC_DEADLINE_TIMER' vs. 'GET_SUPPORTED_CPUID'? 
> 
> I don't think there's a final agreement, but I was convinced later
> that it's probably better to _not_ have TSC-deadline on
> GET_SUPPORTED_CPUID, at least not by default.
> 
> Even if this is changed in the future, it's a good idea to make qemu
> support the KVM_CAP_TSC_DEADLINE_TIMER method if running on older
> kernels, anyway.

OK, so I will coding based on current KVM_CAP_TSC_DEADLINE_TIMER method.

Thanks for clarifying!


> 
> 
>> Eduardo Habkost wrote:
>>> (CCing Andre Przywara, in case he can help to clarify what's the
>>> expected meaning of "-cpu host")
>>> 
>>> On Tue, Apr 24, 2012 at 06:06:55PM +0200, Jan Kiszka wrote:
 On 2012-04-23 22:02, Eduardo Habkost wrote:
> On Mon, Apr 23, 2012 at 06:31:25PM +0200, Jan Kiszka wrote:
>> However, that was how I interpreted this GET_SUPPORTED_CPUID. In
>> fact, it is used as "kernel or hardware does not _prevent_"
>> already. And in that sense, it's ok to enable even features that
>> are not in kernel/hardware hands. We should point out this fact
>> in the documentation.
> 
> I see GET_SUPPORTED_CPUID as just a "what userspace can enable
> because the kernel and the hardware support it (= don't prevent
> it), as long as userspace has the required support" (meaning A+B).
> It's a bit like KVM_CHECK_EXTENSION, but with the nice feature
> that that the capabilities map directly to CPUID bits.
> 
> So, it's not clear to me: now you are OK with adding TSC_DEADLINE
> to GET_SUPPORTED_CPUID? 
> 
> But we still have the issue of "-cpu host" not knowing what can be
> safely enabled (without userspace feature-specific setup code), or
> not. Do you have any suggestion for that? Avi, do you have any
> suggestion?
 
 First of all, I bet this was already broken with the introduction
 of x2apic. So TSC deadline won't make it worse. I guess we need to
 address this in userspace, first by masking those features out,
 later by actually emulating them.
>>> 
>>> I am not sure I understand what you are proposing. Let me explain
>>> the use case I am thinking about: 
>>> 
>>> - Feature FOO is of type (A) (e.g. just a new instruction set that
>>>   doesn't require additional userspace support)
>>> - User has a Qemu vesion that doesn't know anything about feature
>>> FOO 
>>> - User gets a new CPU that supports feature FOO
>>> - User gets a new kernel that supports feature FOO (i.e. has FOO in
>>> GET_SUPPORTED_CPUID) 
>>> - User does _not_ upgrade Qemu.
>>> - User expects to get feature FOO enabled if using "-cpu host",  
>>> without upgrading Qemu. 
>>> 
>>> The problem here is: to support the above use-case, userspace need a
>>> probing mechanism that can differentiate _new_ (previously unknown)
>>> features that are in group (A) (safe to blindly enable) from
>>> features that are in group (B) (that can't be enabled without an
>>> userspace upgrade). 
>>> 
>>> In short, it becomes a problem if we consider the following case:
>>> 
>>> - Feature BAR is of type (B) (it can't be enabled without extra  
>>> userspace support) 
>>> - User has a Qemu version that doesn't know anything about feature
>>> BAR 
>>> - User gets a new CPU that supports feature BAR
>>> - User gets a new kernel that supports feature BAR (i.e. has BAR in
>>> GET_SUPPORTED_CPUID) 
>>> - User does _not_ upgrade Qemu.
>>> - User simply shouldn't get feature BAR enabled, even if using "-cpu
>>>   host", otherwise Qemu would break.
>>> 
>>> If userspace always limited itself to features it knows about, it
>>> would be really easy to implement the feature without any new
>>> probing mechanism from the kernel. But that's not how I think users
>>> expect "-cpu host" to work. Maybe I am wrong, I don't know. I am
>>> CCing Andre, who introduced the "-cpu host" feature, in case he can
>>> explain what's the expected semantics on the cases above.
>>> 
 
> 
> And I still don't know the answer to:
> 
>>> - How to precisely define the groups (A) and (B)?
>>>   - "requires additional code only if migration is required"
>>> qualifies as (B) or (A)?
> 
> 
> Re: documentation, isn't the following paragraph (already present
> on api.txt) sufficient? 
> 
> "The entries returned are the host cpuid as returned by the cpuid
> instruction, with unknown or unsupported features masked out. 
> Some features (for example, x2apic), may not be present in the
> host cpu, but are exposed by kvm if it can emulate them
> efficiently." 
 
 That suggests such features are always emulated - which is not
 true. They are either emulated, or nothing _prevents_ their
 emulat

Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest

2012-06-14 Thread Eduardo Habkost
On Thu, Jun 14, 2012 at 07:02:03PM +, Liu, Jinsong wrote:
> Eduardo, Jan
> 
> I will update tsc deadline timer patch (at qemu-kvm side) recently.
> Have you made a final agreement of the issue 'KVM_CAP_TSC_DEADLINE_TIMER' vs. 
> 'GET_SUPPORTED_CPUID'?

I don't think there's a final agreement, but I was convinced later that
it's probably better to _not_ have TSC-deadline on GET_SUPPORTED_CPUID,
at least not by default.

Even if this is changed in the future, it's a good idea to make qemu
support the KVM_CAP_TSC_DEADLINE_TIMER method if running on older
kernels, anyway.


> Eduardo Habkost wrote:
> > (CCing Andre Przywara, in case he can help to clarify what's the
> > expected meaning of "-cpu host")
> > 
> > On Tue, Apr 24, 2012 at 06:06:55PM +0200, Jan Kiszka wrote:
> >> On 2012-04-23 22:02, Eduardo Habkost wrote:
> >>> On Mon, Apr 23, 2012 at 06:31:25PM +0200, Jan Kiszka wrote:
>  However, that was how I interpreted this GET_SUPPORTED_CPUID. In
>  fact, 
>  it is used as "kernel or hardware does not _prevent_" already. And
>  in that sense, it's ok to enable even features that are not in
>  kernel/hardware hands. We should point out this fact in the
>  documentation. 
> >>> 
> >>> I see GET_SUPPORTED_CPUID as just a "what userspace can enable
> >>> because the kernel and the hardware support it (= don't prevent
> >>> it), as long as userspace has the required support" (meaning A+B).
> >>> It's a bit like KVM_CHECK_EXTENSION, but with the nice feature that
> >>> that the capabilities map directly to CPUID bits.
> >>> 
> >>> So, it's not clear to me: now you are OK with adding TSC_DEADLINE
> >>> to GET_SUPPORTED_CPUID? 
> >>> 
> >>> But we still have the issue of "-cpu host" not knowing what can be
> >>> safely enabled (without userspace feature-specific setup code), or
> >>> not. Do you have any suggestion for that? Avi, do you have any
> >>> suggestion? 
> >> 
> >> First of all, I bet this was already broken with the introduction of
> >> x2apic. So TSC deadline won't make it worse. I guess we need to
> >> address this in userspace, first by masking those features out,
> >> later by actually emulating them.
> > 
> > I am not sure I understand what you are proposing. Let me explain the
> > use case I am thinking about:
> > 
> > - Feature FOO is of type (A) (e.g. just a new instruction set that
> >   doesn't require additional userspace support)
> > - User has a Qemu vesion that doesn't know anything about feature FOO
> > - User gets a new CPU that supports feature FOO
> > - User gets a new kernel that supports feature FOO (i.e. has FOO in
> >   GET_SUPPORTED_CPUID)
> > - User does _not_ upgrade Qemu.
> > - User expects to get feature FOO enabled if using "-cpu host",
> >   without upgrading Qemu.
> > 
> > The problem here is: to support the above use-case, userspace need a
> > probing mechanism that can differentiate _new_ (previously unknown)
> > features that are in group (A) (safe to blindly enable) from features
> > that are in group (B) (that can't be enabled without an userspace
> > upgrade).
> > 
> > In short, it becomes a problem if we consider the following case:
> > 
> > - Feature BAR is of type (B) (it can't be enabled without extra
> >   userspace support)
> > - User has a Qemu version that doesn't know anything about feature BAR
> > - User gets a new CPU that supports feature BAR
> > - User gets a new kernel that supports feature BAR (i.e. has BAR in
> >   GET_SUPPORTED_CPUID)
> > - User does _not_ upgrade Qemu.
> > - User simply shouldn't get feature BAR enabled, even if using "-cpu
> >   host", otherwise Qemu would break.
> > 
> > If userspace always limited itself to features it knows about, it
> > would be really easy to implement the feature without any new probing
> > mechanism from the kernel. But that's not how I think users expect
> > "-cpu host" to work. Maybe I am wrong, I don't know. I am CCing
> > Andre, who introduced the "-cpu host" feature, in case he can explain
> > what's the expected semantics on the cases above.
> > 
> >> 
> >>> 
> >>> And I still don't know the answer to:
> >>> 
> > - How to precisely define the groups (A) and (B)?
> >   - "requires additional code only if migration is required"
> > qualifies as (B) or (A)?
> >>> 
> >>> 
> >>> Re: documentation, isn't the following paragraph (already present
> >>> on api.txt) sufficient? 
> >>> 
> >>> "The entries returned are the host cpuid as returned by the cpuid
> >>> instruction, with unknown or unsupported features masked out.  Some
> >>> features (for example, x2apic), may not be present in the host cpu,
> >>> but are exposed by kvm if it can emulate them efficiently."
> >> 
> >> That suggests such features are always emulated - which is not true.
> >> They are either emulated, or nothing _prevents_ their emulation by
> >> user space.
> > 
> > Well... it's a bit more complicated than that: the current semantics
> > are a bit more than "doesn't prevent", as in theory every s

RE: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest

2012-06-14 Thread Liu, Jinsong
Eduardo, Jan

I will update tsc deadline timer patch (at qemu-kvm side) recently.
Have you made a final agreement of the issue 'KVM_CAP_TSC_DEADLINE_TIMER' vs. 
'GET_SUPPORTED_CPUID'?

Thanks,
Jinsong


Eduardo Habkost wrote:
> (CCing Andre Przywara, in case he can help to clarify what's the
> expected meaning of "-cpu host")
> 
> On Tue, Apr 24, 2012 at 06:06:55PM +0200, Jan Kiszka wrote:
>> On 2012-04-23 22:02, Eduardo Habkost wrote:
>>> On Mon, Apr 23, 2012 at 06:31:25PM +0200, Jan Kiszka wrote:
 However, that was how I interpreted this GET_SUPPORTED_CPUID. In
 fact, 
 it is used as "kernel or hardware does not _prevent_" already. And
 in that sense, it's ok to enable even features that are not in
 kernel/hardware hands. We should point out this fact in the
 documentation. 
>>> 
>>> I see GET_SUPPORTED_CPUID as just a "what userspace can enable
>>> because the kernel and the hardware support it (= don't prevent
>>> it), as long as userspace has the required support" (meaning A+B).
>>> It's a bit like KVM_CHECK_EXTENSION, but with the nice feature that
>>> that the capabilities map directly to CPUID bits.
>>> 
>>> So, it's not clear to me: now you are OK with adding TSC_DEADLINE
>>> to GET_SUPPORTED_CPUID? 
>>> 
>>> But we still have the issue of "-cpu host" not knowing what can be
>>> safely enabled (without userspace feature-specific setup code), or
>>> not. Do you have any suggestion for that? Avi, do you have any
>>> suggestion? 
>> 
>> First of all, I bet this was already broken with the introduction of
>> x2apic. So TSC deadline won't make it worse. I guess we need to
>> address this in userspace, first by masking those features out,
>> later by actually emulating them.
> 
> I am not sure I understand what you are proposing. Let me explain the
> use case I am thinking about:
> 
> - Feature FOO is of type (A) (e.g. just a new instruction set that
>   doesn't require additional userspace support)
> - User has a Qemu vesion that doesn't know anything about feature FOO
> - User gets a new CPU that supports feature FOO
> - User gets a new kernel that supports feature FOO (i.e. has FOO in
>   GET_SUPPORTED_CPUID)
> - User does _not_ upgrade Qemu.
> - User expects to get feature FOO enabled if using "-cpu host",
>   without upgrading Qemu.
> 
> The problem here is: to support the above use-case, userspace need a
> probing mechanism that can differentiate _new_ (previously unknown)
> features that are in group (A) (safe to blindly enable) from features
> that are in group (B) (that can't be enabled without an userspace
> upgrade).
> 
> In short, it becomes a problem if we consider the following case:
> 
> - Feature BAR is of type (B) (it can't be enabled without extra
>   userspace support)
> - User has a Qemu version that doesn't know anything about feature BAR
> - User gets a new CPU that supports feature BAR
> - User gets a new kernel that supports feature BAR (i.e. has BAR in
>   GET_SUPPORTED_CPUID)
> - User does _not_ upgrade Qemu.
> - User simply shouldn't get feature BAR enabled, even if using "-cpu
>   host", otherwise Qemu would break.
> 
> If userspace always limited itself to features it knows about, it
> would be really easy to implement the feature without any new probing
> mechanism from the kernel. But that's not how I think users expect
> "-cpu host" to work. Maybe I am wrong, I don't know. I am CCing
> Andre, who introduced the "-cpu host" feature, in case he can explain
> what's the expected semantics on the cases above.
> 
>> 
>>> 
>>> And I still don't know the answer to:
>>> 
> - How to precisely define the groups (A) and (B)?
>   - "requires additional code only if migration is required"
> qualifies as (B) or (A)?
>>> 
>>> 
>>> Re: documentation, isn't the following paragraph (already present
>>> on api.txt) sufficient? 
>>> 
>>> "The entries returned are the host cpuid as returned by the cpuid
>>> instruction, with unknown or unsupported features masked out.  Some
>>> features (for example, x2apic), may not be present in the host cpu,
>>> but are exposed by kvm if it can emulate them efficiently."
>> 
>> That suggests such features are always emulated - which is not true.
>> They are either emulated, or nothing _prevents_ their emulation by
>> user space.
> 
> Well... it's a bit more complicated than that: the current semantics
> are a bit more than "doesn't prevent", as in theory every single
> feature can be emulated by userspace, without any help from the
> kernel. So, if "doesn't prevent" were the only criteria, the kernel
> would set every single feature bit on GET_SUPPORTED_CPUID, making it
> not very useful. 
> 
> At least in the case of x2apic, the kernel is using
> GET_SUPPORTED_CPUID to expose a _capability_ too: when x2apic is
> present on GET_SUPPORTED_CPUID, userspace knows that in addition to
> "not preventing" the feature from being enabled, the kernel is now
> able to emulate x2apic (if proper setup is made by userspace). A
> k

Re: [PATCH 2/5] drivers/net/ethernet/dec/tulip: Use standard __set_bit_le() function

2012-06-14 Thread Takuya Yoshikawa
On Thu, 14 Jun 2012 18:36:42 +0900
Akinobu Mita  wrote:

> >> 1) while I agree with Akinobu and thank him for pointing out a
> >> _potential_ alignment problem, this is a separate issue and your
> >> existing patch should go in anyway. There are probably other drivers
> >> with _potential_ alignment issues. Akinobu could get credit for
> >> finding them by submitting patches after reviewing calls to set_bit
> >> and set_bit_le() - similar to what you are doing now.
> >
> > I prefer approach 1.
> >
> > hash_table is local in build_setup_frame_hash(), so if further
> > improvement is also required, we can do that locally there later.
> 
> This potential alignment problem is introduced by this patch.  Because
> the original set_bit_le() in tulip driver can handle unaligned bitmap.
> This is why I recommended it should be fixed in this patch.

The original set_bit_le() was used only in build_setup_frame_hash().

If it's clear that the table is aligned locally in the function, I do
not think the __potential__ problem is introduced by this patch.

As you can see from my response to Arnd in v1 thread, I knew the
alignment requirement at that time and checked the definition of
hash_table before using __set_bit_le().

> But please just ignore me if I'm too much paranoid.  And I'll handle
> this issue if no one wants to do it.

I'm open to suggestions.

But now that the maintainer who can test the driver on real hardware
has suggested this patch should go in, I won't change the patch without
any real issue.

I would thank you if you improve this driver later on top of that.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv7 8/8] kvm: host side for eoi optimization

2012-06-14 Thread Michael S. Tsirkin
Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit

For testing results see description of previous patch
'kvm_para: guest side for eoi avoidance'.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/include/asm/kvm_host.h |  12 
 arch/x86/kvm/cpuid.c|   1 +
 arch/x86/kvm/lapic.c| 140 ++--
 arch/x86/kvm/lapic.h|   2 +
 arch/x86/kvm/trace.h|  34 ++
 arch/x86/kvm/x86.c  |   7 ++
 6 files changed, 192 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db7c1f2..dfc066c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -175,6 +175,13 @@ enum {
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC   0
+/*
+ * The following bit is set with PV-EOI, unset on EOI.
+ * We detect PV-EOI changes by guest by comparing
+ * this bit with PV-EOI in guest memory.
+ * See the implementation in apic_update_pv_eoi.
+ */ 
+#define KVM_APIC_PV_EOI_PENDING1
 
 /*
  * We don't want allocation failures within the mmu code, so we preallocate
@@ -484,6 +491,11 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   } pv_eoi;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7df1c6d..61ccbdf 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
 (1 << KVM_FEATURE_NOP_IO_DELAY) |
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
+(1 << KVM_FEATURE_PV_EOI) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 
if (sched_info_on())
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 805d887..9bf78af 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -311,6 +311,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq)
irq->level, irq->trig_mode);
 }
 
+static int pv_eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
+{
+
+   return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data, &val,
+ sizeof(val));
+}
+
+static int pv_eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
+{
+
+   return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data, val,
+ sizeof(*val));
+}
+
+static inline bool pv_eoi_enabled(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pv_eoi.msr_val & KVM_MSR_ENABLED;
+}
+
+static bool pv_eoi_get_pending(struct kvm_vcpu *vcpu)
+{
+   u8 val;
+   if (pv_eoi_get_user(vcpu, &val) < 0)
+   apic_debug("Can't read EOI MSR value: 0x%llx\n",
+  (unsigned long long)vcpi->arch.pv_eoi.msr_val);
+   return val & 0x1;
+}
+
+static void pv_eoi_set_pending(struct kvm_vcpu *vcpu)
+{
+   if (pv_eoi_put_user(vcpu, KVM_PV_EOI_ENABLED) < 0) {
+   apic_debug("Can't set EOI MSR value: 0x%llx\n",
+  (unsigned long long)vcpi->arch.pv_eoi.msr_val);
+   return;
+   }
+   __set_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
+}
+
+static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
+{
+   if (pv_eoi_put_user(vcpu, KVM_PV_EOI_DISABLED) < 0) {
+   apic_debug("Can't clear EOI MSR value: 0x%llx\n",
+  (unsigned long long)vcpi->arch.pv_eoi.msr_val);
+   return;
+   }
+   __clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
+}
+
 static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 {
int result;
@@ -527,15 +575,18 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
kvm_vcpu *vcpu2)
return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
-static void apic_set_eoi(struct kvm_lapic *apic)
+static int apic_set_eoi(struct kvm_lapic *apic)
 {
int vector = apic_find_highest_isr(apic);
+
+   trace_kvm_eoi(apic, vector);
+
/*
 * Not every write EOI will has corresponding ISR,
 * one example is when Ke

[PATCHv7 7/8] kvm: rearrange injection cancelling code

2012-06-14 Thread Michael S. Tsirkin
Each time we need to cancel injection we invoke same code
(cancel_injection callback).  Move it towards the end of function using
the familiar goto on error pattern.

Will make it easier to do more cleanups for PV EOI.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/x86.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4e582c0..cb19d3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5296,8 +5296,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
r = kvm_mmu_reload(vcpu);
if (unlikely(r)) {
-   kvm_x86_ops->cancel_injection(vcpu);
-   goto out;
+   goto cancel_injection;
}
 
preempt_disable();
@@ -5322,9 +5321,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
smp_wmb();
local_irq_enable();
preempt_enable();
-   kvm_x86_ops->cancel_injection(vcpu);
r = 1;
-   goto out;
+   goto cancel_injection;
}
 
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
@@ -5392,6 +5390,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_lapic_sync_from_vapic(vcpu);
 
r = kvm_x86_ops->handle_exit(vcpu);
+   return r;
+
+cancel_injection:
+   kvm_x86_ops->cancel_injection(vcpu);
 out:
return r;
 }
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv7 5/8] kvm: eoi msi documentation

2012-06-14 Thread Michael S. Tsirkin
Document the new EOI MSR. Couldn't decide whether this change belongs
conceptually on guest or host side, so a separate patch.

Signed-off-by: Michael S. Tsirkin 
---
 Documentation/virtual/kvm/msr.txt | 32 
 1 file changed, 32 insertions(+)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 96b41bd..6ae5a85 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -223,3 +223,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
steal: the amount of time in which this vCPU did not run, in
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
+
+MSR_KVM_EOI_EN: 0x4b564d04
+   data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+   when disabled.  When enabled, bits 63-1 hold 2-byte aligned physical 
address
+   of a 2 byte memory area which must be in guest RAM and must be zeroed.
+
+   The first, least significant bit of 2 byte memory location will be
+   written to by the hypervisor, typically at the time of interrupt
+   injection.  Value of 1 means that guest can skip writing EOI to the apic
+   (using MSR or MMIO write); instead, it is sufficient to signal
+   EOI by clearing the bit in guest memory - this location will
+   later be polled by the hypervisor.
+   Value of 0 means that the EOI write is required.
+
+   It is always safe for the guest to ignore the optimization and perform
+   the APIC EOI write anyway.
+
+   Hypervisor is guaranteed to only modify this least
+   significant bit while in the current VCPU context, this means that
+   guest does not need to use either lock prefix or memory ordering
+   primitives to synchronise with the hypervisor.
+
+   However, hypervisor can set and clear this memory bit at any time:
+   therefore to make sure hypervisor does not interrupt the
+   guest and clear the least significant bit in the memory area
+   in the window between guest testing it to detect
+   whether it can skip EOI apic write and between guest
+   clearing it to signal EOI to the hypervisor,
+   guest must both read the least significant bit in the memory area and
+   clear it using a single CPU instruction, such as test and clear, or
+   compare and exchange.
+
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv7 6/8] kvm: only sync when attention bits set

2012-06-14 Thread Michael S. Tsirkin
Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
attention bitmask but kvm still syncs lapic unconditionally.
As that commit suggested and in anticipation of adding more attention
bits, only sync lapic if(apic_attention).

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be6d549..4e582c0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5388,7 +5388,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (unlikely(vcpu->arch.tsc_always_catchup))
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
-   kvm_lapic_sync_from_vapic(vcpu);
+   if (vcpu->arch.apic_attention)
+   kvm_lapic_sync_from_vapic(vcpu);
 
r = kvm_x86_ops->handle_exit(vcpu);
 out:
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv7 3/8] kvm_para: guest side for eoi avoidance

2012-06-14 Thread Michael S. Tsirkin
The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
Guest tests it using a single est and clear operation - this is
necessary so that host can detect interrupt nesting - and if set, it can
skip the EOI MSR.

I run a simple microbenchmark to show exit reduction
(note: for testing, need to apply follow-up patch
'kvm: host side for eoi optimization' + a qemu patch
 I posted separately, on host):

Before:

Performance counter stats for 'sleep 1s':

47,357 kvm:kvm_entry
[99.98%]
 0 kvm:kvm_hypercall
[99.98%]
 0 kvm:kvm_hv_hypercall 
[99.98%]
 5,001 kvm:kvm_pio  
[99.98%]
 0 kvm:kvm_cpuid
[99.98%]
22,124 kvm:kvm_apic 
[99.98%]
49,849 kvm:kvm_exit 
[99.98%]
21,115 kvm:kvm_inj_virq 
[99.98%]
 0 kvm:kvm_inj_exception
[99.98%]
 0 kvm:kvm_page_fault   
[99.98%]
22,937 kvm:kvm_msr  
[99.98%]
 0 kvm:kvm_cr   
[99.98%]
 0 kvm:kvm_pic_set_irq  
[99.98%]
 0 kvm:kvm_apic_ipi 
[99.98%]
22,207 kvm:kvm_apic_accept_irq  
[99.98%]
22,421 kvm:kvm_eoi  
[99.98%]
 0 kvm:kvm_pv_eoi   
[99.99%]
 0 kvm:kvm_nested_vmrun 
[99.99%]
 0 kvm:kvm_nested_intercepts
[99.99%]
 0 kvm:kvm_nested_vmexit
[99.99%]
 0 kvm:kvm_nested_vmexit_inject 
   [99.99%]
 0 kvm:kvm_nested_intr_vmexit   
 [99.99%]
 0 kvm:kvm_invlpga  
[99.99%]
 0 kvm:kvm_skinit   
[99.99%]
57 kvm:kvm_emulate_insn 
[99.99%]
 0 kvm:vcpu_match_mmio  
[99.99%]
 0 kvm:kvm_userspace_exit   
[99.99%]
 2 kvm:kvm_set_irq  
[99.99%]
 2 kvm:kvm_ioapic_set_irq   
[99.99%]
23,609 kvm:kvm_msi_set_irq  
[99.99%]
 1 kvm:kvm_ack_irq  
[99.99%]
   131 kvm:kvm_mmio 
[99.99%]
   226 kvm:kvm_fpu  
[100.00%]
 0 kvm:kvm_age_page 
[100.00%]
 0 kvm:kvm_try_async_get_page   
 [100.00%]
 0 kvm:kvm_async_pf_doublefault 
   [100.00%]
 0 kvm:kvm_async_pf_not_present 
   [100.00%]
 0 kvm:kvm_async_pf_ready   
[100.00%]
 0 kvm:kvm_async_pf_completed

   1.002100578 seconds time elapsed

After:

 Performance counter stats for 'sleep 1s':

28,354 kvm:kvm_entry
[99.98%]
 0 kvm:kvm_hypercall
[99.98%]
 0 kvm:kvm_hv_hypercall 
[99.98%]
 1,347 kvm:kvm_pio  
[99.98%]
 0 kvm:kvm_cpuid
[99.98%]
 1,931 kvm:kvm_apic 
[99.98%]
29,595 kvm:kvm_exit 
[99.98%]
24,884 kvm:kvm_inj_virq 
[99.98%]
 0 kvm:kvm_inj_exception
[99.98%]
 0 kvm:kvm_page_fault   
[99.98%]
 1,

[PATCHv7 4/8] x86/bitops: note on __test_and_clear_bit atomicity

2012-06-14 Thread Michael S. Tsirkin
__test_and_clear_bit is actually atomic with respect
to the local CPU. Add a note saying that KVM on x86
relies on this behaviour so people don't accidentaly break it.
Also warn not to rely on this in portable code.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/include/asm/bitops.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 47f9eff..75e3787 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -266,6 +266,13 @@ static inline int test_and_clear_bit(int nr, volatile 
unsigned long *addr)
  * This operation is non-atomic and can be reordered.
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
+ *
+ * Note: the operation is performed atomically with respect to
+ * the local CPU, but not other CPUs. Portable code should not
+ * rely on this behaviour.
+ * KVM relies on this behaviour on x86 for modifying memory that is also
+ * accessed from a hypervisor on the same CPU if running in a VM: don't change
+ * this without also updating arch/x86/kernel/kvm.c
  */
 static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 {
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv7 2/8] kvm: optimize ISR lookups

2012-06-14 Thread Michael S. Tsirkin
We perform ISR lookups twice: during interrupt
injection and on EOI. Typical workloads only have
a single bit set there. So we can avoid ISR scans by
1. counting bits as we set/clear them in ISR
2. on set, caching the injected vector number
3. on clear, invalidating the cache

The real purpose of this is enabling PV EOI
which needs to quickly validate the vector.
But non PV guests also benefit: with this patch,
and without interrupt nesting, apic_find_highest_isr
will always return immediately without scanning ISR.

Note on likely() usage in apic_find_highest_isr:
in my testing the cache was valid in >99% cases,
so the annotation seems justified.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/lapic.c | 53 ++--
 arch/x86/kvm/lapic.h |  4 
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 93c1574..805d887 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -107,6 +107,16 @@ static inline void apic_clear_vector(int vec, void *bitmap)
clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 }
 
+static inline int __apic_test_and_set_vector(int vec, void *bitmap)
+{
+   return __test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
+}
+
+static inline int __apic_test_and_clear_vector(int vec, void *bitmap)
+{
+   return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
+}
+
 static inline int apic_hw_enabled(struct kvm_lapic *apic)
 {
return (apic)->vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE;
@@ -210,6 +220,16 @@ static int find_highest_vector(void *bitmap)
return fls(word[word_offset << 2]) - 1 + (word_offset << 5);
 }
 
+static u8 count_vectors(void *bitmap)
+{
+   u32 *word = bitmap;
+   int word_offset;
+   u8 count = 0;
+   for (word_offset = 0; word_offset < MAX_APIC_VECTOR >> 5; ++word_offset)
+   count += hweight32(word[word_offset << 2]);
+   return count;
+}
+
 static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic)
 {
apic->irr_pending = true;
@@ -242,6 +262,27 @@ static inline void apic_clear_irr(int vec, struct 
kvm_lapic *apic)
apic->irr_pending = true;
 }
 
+static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
+{
+   if (!__apic_test_and_set_vector(vec, apic->regs + APIC_ISR))
+   ++apic->isr_count;
+   BUG_ON(apic->isr_count > MAX_APIC_VECTOR);
+   /*
+* ISR (in service register) bit is set when injecting an interrupt.
+* The highest vector is injected. Thus the latest bit set matches
+* the highest bit in ISR.
+*/
+   apic->highest_isr_cache = vec;
+}
+
+static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
+{
+   if (__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
+   --apic->isr_count;
+   BUG_ON(apic->isr_count < 0);
+   apic->highest_isr_cache = -1;
+}
+
 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu->arch.apic;
@@ -273,6 +314,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq)
 static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 {
int result;
+   if (!apic->isr_count)
+   return -1;
+   if (likely(apic->highest_isr_cache != -1))
+   return apic->highest_isr_cache;
 
result = find_highest_vector(apic->regs + APIC_ISR);
ASSERT(result == -1 || result >= 16);
@@ -492,7 +537,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
if (vector == -1)
return;
 
-   apic_clear_vector(vector, apic->regs + APIC_ISR);
+   apic_clear_isr(vector, apic);
apic_update_ppr(apic);
 
if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
@@ -1081,6 +1126,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
apic_set_reg(apic, APIC_TMR + 0x10 * i, 0);
}
apic->irr_pending = false;
+   apic->isr_count = 0;
+   apic->highest_isr_cache = -1;
update_divide_count(apic);
atomic_set(&apic->lapic_timer.pending, 0);
if (kvm_vcpu_is_bsp(vcpu))
@@ -1248,7 +1295,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
if (vector == -1)
return -1;
 
-   apic_set_vector(vector, apic->regs + APIC_ISR);
+   apic_set_isr(vector, apic);
apic_update_ppr(apic);
apic_clear_irr(vector, apic);
return vector;
@@ -1267,6 +1314,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
update_divide_count(apic);
start_apic_timer(apic);
apic->irr_pending = true;
+   apic->isr_count = count_vectors(apic->regs + APIC_ISR);
+   apic->highest_isr_cache = -1;
kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index d29da25..5ac9e5e 100644
--- a/arch/x86/kvm/lapic.h
+++ b/a

[PATCHv7 1/8] kvm: document lapic regs field

2012-06-14 Thread Michael S. Tsirkin
Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/lapic.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6f4ce25..d29da25 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -13,6 +13,11 @@ struct kvm_lapic {
u32 divide_count;
struct kvm_vcpu *vcpu;
bool irr_pending;
+   /**
+* APIC register page.  The layout matches the register layout seen by
+* the guest 1:1, because it is accessed by the vmx microcode.
+* Note: Only one register, the TPR, is used by the microcode.
+*/
void *regs;
gpa_t vapic_addr;
struct page *vapic_page;
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv7 0/8] kvm: eoi optimization support

2012-06-14 Thread Michael S. Tsirkin
I'm looking at reducing the interrupt overhead for virtualized guests:
some workloads spend a large part of their time processing interrupts.

On kvm, an EOI write from the guest causes an expensive exit to host; we
avoid this using shared memory.

The patches work fine on my boxes. See individual patches
for perf tests. You need to patch qemu to whitelist the kvm feature.
qemu patch was sent separately.

The patches are against Linus's master and apply to kvm.git
cleanly.  The last patch in the series, supplying the host
part, also depends on the ISR optimization patch that I
have for convenience included in the series (patch 2),
I also included a documentation patch (patch 1) - it is
here since it clarifies patch 2.  This revision does not yet address
Thomas's idea of reworking the APIC page handling.  Changes to this
optimization would require reworking this last patch in the series.

The rest of the patchset has not changed significantly since v2.

Thanks,
MST

changes from v6:
Address Marcelo's comments:
marcelo's comments for isr optimization: isr_cache -> highest_*
marcelo's comments for host side eoi: rename 
isr_cache->highest_*
kvm eoi msr doc: fix typo pointed out by marcelo
isr optimization: add comment to address marcelo's request
kvm: don't make lapic attention check unlikely. at marcelo's 
request
eoi host side: remove unlikely annotations

Changes from v5:
Clear PV EOI when we cancel interrupts.
Pointed out by Gleb.
Always set ISR cache when we inject an interrupt.
Suggested by Ronen Hod.

Changes from v4:
Turn off PV EOI on each exit. Turn it back on when safe.
Suggested by Avi.
Address bug with nested interrupts pointed out by Marcelo.

Changes from v3:
Address review comments by Marcelo:
Multiple cosmetic changes eoi -> pv_eoi
Added multiple comments
Changes from v2:
Kill guest with GP on an illegal MSR value
Add documentation

Changes from v1:
Add host side patch to series
Remove kvm-specific __test_and_clear_bit, document
that x86 one does what we want already
Clear msr on cpu unplug

Michael S. Tsirkin (8):
  kvm: document lapic regs field
  kvm: optimize ISR lookups
  kvm_para: guest side for eoi avoidance
  x86/bitops: note on __test_and_clear_bit atomicity
  kvm: eoi msi documentation
  kvm: only sync when attention bits set
  kvm: rearrange injection cancelling code
  kvm: host side for eoi optimization

 Documentation/virtual/kvm/msr.txt |  32 +++
 arch/x86/include/asm/bitops.h |  13 ++-
 arch/x86/include/asm/kvm_host.h   |  12 +++
 arch/x86/include/asm/kvm_para.h   |   7 ++
 arch/x86/kernel/kvm.c |  51 +-
 arch/x86/kvm/cpuid.c  |   1 +
 arch/x86/kvm/lapic.c  | 193 --
 arch/x86/kvm/lapic.h  |  11 +++
 arch/x86/kvm/trace.h  |  34 +++
 arch/x86/kvm/x86.c|  20 +++-
 10 files changed, 358 insertions(+), 16 deletions(-)

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO

2012-06-14 Thread Avi Kivity
On 05/16/2012 10:55 AM, zhangyanfei wrote:
> This patch implements a new module named vmcsinfo-intel. The
> module fills VMCSINFO with the VMCS revision identifier,
> and encoded offsets of VMCS fields.
> 
> Note, offsets of fields below will not be filled into VMCSINFO:
> 1. fields defined in Intel specification (Intel® 64 and
>IA-32 Architectures Software Developer’s Manual, Volume
>3C) but not defined in *vmcs_field*.
> 2. fields don't exist because their corresponding control bits
>are not set.
> 
> +
> +/*
> + * We separate these five control fields from other fields
> + * because some fields only exist on processors that support
> + * the 1-setting of control bits in the five control fields.
> + */

I thought this was checked only during VMENTRY.  So perhaps you don't
need this special casing.

In fact you might be able to

   // pre-fill vmcs with patterns

  for (i = 0; i < 64k; ++i)
  if (vmcs_read_checking(i, &pattern)) {
   // decode pattern
  } else
   // field does not exist (VM Instruction error 12), ignore

with no knowledge of the control fields, or of any field name.


> +
> +/*
> + * The format of VMCSINFO is given below:
> + *   +-+--+
> + *   | Byte offset | Contents |
> + *   +-+--+
> + *   | 0   | VMCS revision identifier |
> + *   +-+--+
> + *   | 4   |   |
> + *   +-+--+
> + *   | 16  |   |
> + *   +-+--+
> + *   ..
> + *
> + * The first 32 bits of VMCSINFO contains the VMCS revision
> + * identifier.
> + * The remainder of VMCSINFO is used for 
> + * sets. Each set takes 12 bytes: field occupys 4 bytes
> + * and its corresponding encoded offset occupys 8 bytes.
> + *
> + * Encoded offsets are raw values read by vmcs_read{16, 64, 32, l},
> + * and they are all unsigned extended to 8 bytes for each
> + *  set has the same size.
> + * We do not decode offsets here. The decoding work is delayed
> + * in userspace tools.

It's better to do the decoding here, or no one will know how to do it.
Also have an nfields field.
-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] x86: Add helper variables and functions to hold VMCSINFO

2012-06-14 Thread Avi Kivity
On 05/16/2012 10:52 AM, zhangyanfei wrote:
> This patch provides a set of variables to hold the VMCSINFO and also
> some helper functions to help fill the VMCSINFO.

Need to document the format.


> +void vmcsinfo_append_id(u32 id)
> +{
> + size_t r;
> +
> + r = sizeof(id);
> + if (r + vmcsinfo_size > vmcsinfo_max_size)
> + return;
> +
> + memcpy(&vmcsinfo_data[vmcsinfo_size], &id, r);
> + vmcsinfo_size += r;
> +}
> +EXPORT_SYMBOL_GPL(vmcsinfo_append_id);
> +
> +void vmcsinfo_append_field(u32 field, u64 offset)

Why u64?  It's guaranteed to fit within a page.

> +{
> + size_t r;
> +
> + r = sizeof(field) + sizeof(offset);
> + if (r + vmcsinfo_size > vmcsinfo_max_size)
> + return;
> +
> + memcpy(&vmcsinfo_data[vmcsinfo_size], &field, sizeof(field));
> + vmcsinfo_size += sizeof(field);
> + memcpy(&vmcsinfo_data[vmcsinfo_size], &offset, sizeof(offset));
> + vmcsinfo_size += sizeof(offset);

Instead of this vmcsinfo_data, how about a struct with fields for the
revision ID and field count, and an array for the fields?  Should be a
lot simpler.

> +}
> +EXPORT_SYMBOL_GPL(vmcsinfo_append_field);
> +
> +unsigned long paddr_vmcsinfo_note(void)
> +{
> + return __pa((unsigned long)(char *)&vmcsinfo_note);
> +}
> 


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] Documentation: Add ABI entry for sysfs file vmcsinfo and vmcsinfo_maxsize

2012-06-14 Thread Avi Kivity
On 05/16/2012 10:57 AM, zhangyanfei wrote:
> We create two new sysfs files, vmcsinfo and vmcsinfo_maxsize. And
> here we add an Documentation/ABI entry for them.
> 
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-vmcsinfo 
> b/Documentation/ABI/testing/sysfs-kernel-vmcsinfo
> new file mode 100644
> index 000..adbf866
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-vmcsinfo
> @@ -0,0 +1,16 @@
> +What:/sys/kernel/vmcsinfo
> +Date:April 2012
> +KernelVersion:   3.4.0
> +Contact: Zhang Yanfei 
> +Description
> + Shows physical address of VMCSINFO. VMCSINFO contains
> + the VMCS revision identifier and encoded offsets of fields
> + in VMCS data on Intel processors equipped with the VT
> + extensions.
> +
> +What:/sys/kernel/vmcsinfo_maxsize
> +Date:April 2012
> +KernelVersion:   3.4.0
> +Contact: Zhang Yanfei 
> +Description
> + Shows max size of VMCSINFO.
> 

This describes the cpu, so maybe /sys/devices/cpu is a better place for
these files.

Would it make sense to expose the actual fields?

that is, have /sys/devices/cpu/vmcs/0806 contain the offset of
GUEST_DS_SELECTOR.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

2012-06-14 Thread Avi Kivity
On 06/11/2012 08:35 AM, Yanfei Zhang wrote:
> Hello Avi,


Sorry about the delay...

> 
> 于 2012年05月29日 15:06, Yanfei Zhang 写道:
>> 于 2012年05月28日 21:28, Avi Kivity 写道:
>>> On 05/28/2012 08:25 AM, Yanfei Zhang wrote:

 Dou you have any comments about this patch set?
>>>
>>> I still have a hard time understanding why it is needed.  If the host
>>> crashes, there is no reason to look at guest state; the host should
>>> survive no matter what the guest does.
>>>
>>>
>> 
>> OK. Let me summarize it.
>> 
>> 1. Why is this patch needed? (Our requirement)
>>
>> We once came to a buggy situation: a host scheduler bug caused guest 
>> machine's
>> vcpu stopped for a long time and then led to heartbeat stop (host is still 
>> running).
>>
>> we want to have an efficient way to make the bug analysis when we come to 
>> the similar
>> situation where guest machine doesn't work well due to something of host 
>> machine's, 
>> 
>> Because we should debug both host machine's and guest machine's sides to 
>> look for
>> the reasons, so we want to get both host machine's crash dump and guest 
>> machine's
>> crash dump at the same time when the buggy situation remains.

I would argue that there are two separate bugs here: (1) a host bug
which caused the scheduling delay (2) putting a heartbeat service on a
virtualized guests with no real time guarantees.

But I understand your situation.

>> 
>> 2. What will we do?
>>
>> If this bug was found on customer's environment, we have two ways to avoid
>> affecting other guest machines running on the same host. First, we could do 
>> bug
>> analysis on another environment to reproduce the buggy situation; Second, we
>> could migrate other guest machines to other hosts. 

You could also use tracing (there's the latency tracer and the scheduler
tracepoints) to debug this on a live system.

>> 
>> After the buggy situation is reproduced, we panic the host *manually*.
>> Then we could use userland tools to get guest machine's crash dump from host 
>> machine's
>> with the feature provided by this patch set. Finally we could analyse them 
>> separately
>> to find which side causes the problem.
>> 
> 
> Could you please tell me your attitude towards this patch? 

I still dislike it conceptually.  But let me do a technical review of
the latest version.

> And here is a new case from the LinuxCon Japan:
> 
> Developers from Hitach are now developing a new livedump mechanism for the
> same reason as ours. They have come to the situation *many times* that guest
> machines crashed due to host's failures, in particular, under development.

This has happened to me as well, possible even more times :).  I don't
use crash dumps for debugging but different people may use different
techniques.

> So they develop this mechanism to get crash dump while retaining the buggy
> situation between host and guest machine. The difference between theirs and
> ours is whether or not to use the feature on _customer's running machine_.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-06-14 Thread Raghavendra K T

On 05/30/2012 04:56 PM, Raghavendra K T wrote:

On 05/16/2012 08:49 AM, Raghavendra K T wrote:

On 05/14/2012 12:15 AM, Raghavendra K T wrote:

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.

[...]

To summarise,
with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
very low/undercommitted systems we may see very small improvement or
small acceptable degradation ( which it deserves).



For large guests, current value SPIN_THRESHOLD, along with ple_window
needed some of research/experiment.

[Thanks to Jeremy/Nikunj for inputs and help in result analysis ]

I started with debugfs spinlock/histograms, and ran experiments with 32,
64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with
1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench.
[ spinlock/histogram gives logarithmic view of lockwait times ]

machine: PLE machine with 32 cores.

Here is the result summary.
The summary includes 2 part,
(1) %improvement w.r.t 2K spin threshold,
(2) improvement w.r.t sum of histogram numbers in debugfs (that gives
rough indication of contention/cpu time wasted)

For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98%
reduction in sigma(histogram values) compared to 2k case

Result for 32 vcpu guest
==
++---+---+---+---+
| Base-2k | 4k | 8k | 16k | 32k |
++---+---+---+---+
| kbench-1vm | 44 | 50 | 46 | 41 |
| SPINHisto-1vm | 98 | 99 | 99 | 99 |
| kbench-2vm | 25 | 45 | 49 | 45 |
| SPINHisto-2vm | 31 | 91 | 99 | 99 |
| kbench-4vm | -13 | -27 | -2 | -4 |
| SPINHisto-4vm | 29 | 66 | 95 | 99 |
++---+---+---+---+
| ebizzy-1vm | 954 | 942 | 913 | 915 |
| SPINHisto-1vm | 96 | 99 | 99 | 99 |
| ebizzy-2vm | 158 | 135 | 123 | 106 |
| SPINHisto-2vm | 90 | 98 | 99 | 99 |
| ebizzy-4vm | -13 | -28 | -33 | -37 |
| SPINHisto-4vm | 83 | 98 | 99 | 99 |
++---+---+---+---+
| hbench-1vm | 48 | 56 | 52 | 64 |
| SPINHisto-1vm | 92 | 95 | 99 | 99 |
| hbench-2vm | 32 | 40 | 39 | 21 |
| SPINHisto-2vm | 74 | 96 | 99 | 99 |
| hbench-4vm | 27 | 15 | 3 | -57 |
| SPINHisto-4vm | 68 | 88 | 94 | 97 |
++---+---+---+---+
| sysbnch-1vm | 0 | 0 | 1 | 0 |
| SPINHisto-1vm | 76 | 98 | 99 | 99 |
| sysbnch-2vm | -1 | 3 | -1 | -4 |
| SPINHisto-2vm | 82 | 94 | 96 | 99 |
| sysbnch-4vm | 0 | -2 | -8 | -14 |
| SPINHisto-4vm | 57 | 79 | 88 | 95 |
++---+---+---+---+

result for 64 vcpu guest
=
++---+---+---+---+
| Base-2k | 4k | 8k | 16k | 32k |
++---+---+---+---+
| kbench-1vm | 1 | -11 | -25 | 31 |
| SPINHisto-1vm | 3 | 10 | 47 | 99 |
| kbench-2vm | 15 | -9 | -66 | -15 |
| SPINHisto-2vm | 2 | 11 | 19 | 90 |
++---+---+---+---+
| ebizzy-1vm | 784 | 1097 | 978 | 930 |
| SPINHisto-1vm | 74 | 97 | 98 | 99 |
| ebizzy-2vm | 43 | 48 | 56 | 32 |
| SPINHisto-2vm | 58 | 93 | 97 | 98 |
++---+---+---+---+
| hbench-1vm | 8 | 55 | 56 | 62 |
| SPINHisto-1vm | 18 | 69 | 96 | 99 |
| hbench-2vm | 13 | -14 | -75 | -29 |
| SPINHisto-2vm | 57 | 74 | 80 | 97 |
++---+---+---+---+
| sysbnch-1vm | 9 | 11 | 15 | 10 |
| SPINHisto-1vm | 80 | 93 | 98 | 99 |
| sysbnch-2vm | 3 | 3 | 4 | 2 |
| SPINHisto-2vm | 72 | 89 | 94 | 97 |
++---+---+---+---+

 From this, value around 4k-8k threshold seem to be optimal one. [ This
is amost inline with ple_window default ]
(lower the spin threshold, we would cover lesser % of spinlocks, that
would result in more halt_exit/wakeups.

[ www.xen.org/files/xensummitboston08/LHP.pdf also has good graphical
detail on covering spinlock waits ]

After 8k threshold, we see no more contention but that would mean we
have wasted lot of cpu time in busy waits.

Will get a PLE machine again, and 'll continue experimenting with
further tuning of SPIN_THRESHOLD.


Sorry for delayed response. Was doing too much of analysis and
experiments.

Continued my experiment, with spin threshold. unfortunately could
not settle between which one of 4k/8k threshold is better, since it
depends on load and type of workload.

Here is the result for 32 vcpu guest for sysbench and kernebench for 4 
8GB RAM vms on same PLE machine with:


1x: benchmark running on 1 guest
2x: same benchmark running on 2 guest and so on

1x run is taken over 8*3 run averages
2x run was taken with 4*3 runs
3x run was with 6*3
4x run was with 4*3


kernbench
=
total j

Re: [PATCH 2/5] drivers/net/ethernet/dec/tulip: Use standard __set_bit_le() function

2012-06-14 Thread Akinobu Mita
2012/6/14 Takuya Yoshikawa :
> On Wed, 13 Jun 2012 08:21:18 -0700
> Grant Grundler  wrote:
>
>> >> >> Should this hash_table be converted from u16 hash_table[32] to
>> >> >> DECLARE_BITMAP(hash_table, 16 * 32) to ensure that it is aligned
>> >> >> on long-word boundary?
>> >> >
>> >> > I think hash_table is already long-word aligned because it is placed
>> >> > right after a pointer.
>> >>
>> >> I recommend converting to proper bitmap.  Because such an implicit
>> >> assumption is easily broken by someone touching this function.
>> >
>> > Do you mean something like:
>> >        DECLARE_BITMAP(__hash_table, 16 * 32);
>> >        u16 *hash_table = (u16 *)__hash_table;
>> > ?
>> >
>> > Grant, what do you think about this?
>>
>> Hi Takuya,
>> two thoughts:
>> 1) while I agree with Akinobu and thank him for pointing out a
>> _potential_ alignment problem, this is a separate issue and your
>> existing patch should go in anyway. There are probably other drivers
>> with _potential_ alignment issues. Akinobu could get credit for
>> finding them by submitting patches after reviewing calls to set_bit
>> and set_bit_le() - similar to what you are doing now.
>
> I prefer approach 1.
>
> hash_table is local in build_setup_frame_hash(), so if further
> improvement is also required, we can do that locally there later.

This potential alignment problem is introduced by this patch.  Because
the original set_bit_le() in tulip driver can handle unaligned bitmap.
This is why I recommended it should be fixed in this patch.

But please just ignore me if I'm too much paranoid.  And I'll handle
this issue if no one wants to do it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Bug? 100% load on core after physically removing USB storage from host

2012-06-14 Thread Veruca Salt




> >>> qemu-kvm-0.12.1.2-2.209.el6_2.4.x86_64

We had the same problem with 0.13
We were using it on Sandy Bridge motherboards when it happened. It was an issue 
then, but we hanged to 1.0 a long time ago.
Why are you using 0.12 years after it was replaced?
> More majordomo info at http://vger.kernel.org/majordomo-info.html
  --
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] block: Add blk_bio_map_sg() helper

2012-06-14 Thread Asias He

On 06/14/2012 03:29 PM, Jens Axboe wrote:

On 2012-06-14 04:31, Tejun Heo wrote:

On Wed, Jun 13, 2012 at 03:41:46PM +0800, Asias He wrote:

Add a helper to map a bio to a scatterlist, modelled after
blk_rq_map_sg.

This helper is useful for any driver that wants to create
a scatterlist from its ->make_request_fn method.


This may not be possible but I really wanna avoid having two copies of
that complex logic.  Any chance blk_rq_map_bio() can be implemented in
a way that allows blk_rq_map_sg() can be built on top of it?  Also,


Was thinking the same thing, definitely code we don't want to have
duplicated. We've had mapping bugs in the past.

Asias, this should be trivial to do, except that blk_rq_map_sg()
potentially maps across bio's as well. The tracking of the prev bio_vec
does not care about cross bio boundaries.


Sure. I will try this and send v2.


--
Asias


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6 6/8] kvm: only sync when attention bits set

2012-06-14 Thread Michael S. Tsirkin
On Wed, Jun 13, 2012 at 08:38:46PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 14, 2012 at 12:04:23AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 13, 2012 at 05:53:36PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jun 13, 2012 at 11:35:07AM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 13, 2012 at 11:19:24AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jun 12, 2012 at 07:27:48PM -0300, Marcelo Tosatti wrote:
> > > > > > On Sun, Jun 03, 2012 at 10:28:29AM +0300, Michael S. Tsirkin wrote:
> > > > > > > Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
> > > > > > > attention bitmask but kvm still syncs lapic unconditionally.
> > > > > > > As that commit suggested and in anticipation of adding more 
> > > > > > > attention
> > > > > > > bits, only sync lapic if(apic_attention).
> > > > > > > 
> > > > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > > > ---
> > > > > > >  arch/x86/kvm/x86.c |3 ++-
> > > > > > >  1 files changed, 2 insertions(+), 1 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index be6d549..2f70861 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -5388,7 +5388,8 @@ static int vcpu_enter_guest(struct kvm_vcpu 
> > > > > > > *vcpu)
> > > > > > >   if (unlikely(vcpu->arch.tsc_always_catchup))
> > > > > > >   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> > > > > > >  
> > > > > > > - kvm_lapic_sync_from_vapic(vcpu);
> > > > > > > + if (unlikely(vcpu->arch.apic_attention))
> > > > > > > + kvm_lapic_sync_from_vapic(vcpu);
> > > > > > 
> > > > > > void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
> > > > > > {
> > > > > > u32 data;
> > > > > > void *vapic;
> > > > > > 
> > > > > > if (!test_bit(KVM_APIC_CHECK_VAPIC, 
> > > > > > &vcpu->arch.apic_attention))
> > > > > > return;
> > > > > > 
> > > > > > Please use unlikely more carefully, when a gain is measureable:
> > > > > > http://lwn.net/Articles/420019/
> > > > > 
> > > > > Do we have to measure every single thing?
> > > > > Sometimes it's obvious: vapic is slow path, isn't it?
> > > > 
> > > > Just to clarify the question: I think it's obvious this condition is
> > > > false more often than true. By how much, depends on the workload.
> > > > Do you think this is enough to tag this unlikely?
> > > 
> > > Depends whether your processor supports flexpriority or not. I don't
> > > want to argue in favour/against the particular instance
> > > 
> > > GCC docs:
> > > 
> > > "
> > > — Built-in Function: long __builtin_expect (long exp, long c)
> > > 
> > > You may use __builtin_expect to provide the compiler with branch
> > > prediction information. In general, you should prefer to use actual
> > > profile feedback for this (-fprofile-arcs), as programmers are
> > > notoriously bad at predicting how their programs actually perform.
> > > However, there are applications in which this data is hard to collect.
> > > "
> > > 
> > > Lately half of branches in your patches are annotated.
> > 
> > So if I instrument and show that branch is almost never taken that is 
> > enough?
> > This citation does not require measuring the perf impact.
> 
> Without flexpriority its always taken.

OK I will remove this one and keep around the unlikely
for the cases that my measurements show occur in
less than 1% of cases.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6 8/8] kvm: host side for eoi optimization

2012-06-14 Thread Michael S. Tsirkin
On Wed, Jun 13, 2012 at 06:10:15PM -0300, Marcelo Tosatti wrote:
> On Sun, Jun 03, 2012 at 10:28:43AM +0300, Michael S. Tsirkin wrote:
> > Implementation of PV EOI using shared memory.
> > This reduces the number of exits an interrupt
> > causes as much as by half.
> > 
> > The idea is simple: there's a bit, per APIC, in guest memory,
> > that tells the guest that it does not need EOI.
> > We set it before injecting an interrupt and clear
> > before injecting a nested one. Guest tests it using
> > a test and clear operation - this is necessary
> > so that host can detect interrupt nesting -
> > and if set, it can skip the EOI MSR.
> > 
> > There's a new MSR to set the address of said register
> > in guest memory. Otherwise not much changed:
> > - Guest EOI is not required
> > - Register is tested & ISR is automatically cleared on exit
> > 
> > For testing results see description of previous patch
> > 'kvm_para: guest side for eoi avoidance'.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  arch/x86/include/asm/kvm_host.h |   12 
> >  arch/x86/kvm/cpuid.c|1 +
> >  arch/x86/kvm/lapic.c|  140 
> > +-
> >  arch/x86/kvm/lapic.h|2 +
> >  arch/x86/kvm/trace.h|   34 ++
> >  arch/x86/kvm/x86.c  |7 ++
> >  6 files changed, 192 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index db7c1f2..dfc066c 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -175,6 +175,13 @@ enum {
> >  
> >  /* apic attention bits */
> >  #define KVM_APIC_CHECK_VAPIC   0
> > +/*
> > + * The following bit is set with PV-EOI, unset on EOI.
> > + * We detect PV-EOI changes by guest by comparing
> > + * this bit with PV-EOI in guest memory.
> > + * See the implementation in apic_update_pv_eoi.
> > + */ 
> > +#define KVM_APIC_PV_EOI_PENDING1
> >  
> >  /*
> >   * We don't want allocation failures within the mmu code, so we preallocate
> > @@ -484,6 +491,11 @@ struct kvm_vcpu_arch {
> > u64 length;
> > u64 status;
> > } osvw;
> > +
> > +   struct {
> > +   u64 msr_val;
> > +   struct gfn_to_hva_cache data;
> > +   } pv_eoi;
> >  };
> >  
> >  struct kvm_lpage_info {
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 7df1c6d..61ccbdf 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
> > u32 function,
> >  (1 << KVM_FEATURE_NOP_IO_DELAY) |
> >  (1 << KVM_FEATURE_CLOCKSOURCE2) |
> >  (1 << KVM_FEATURE_ASYNC_PF) |
> > +(1 << KVM_FEATURE_PV_EOI) |
> >  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> >  
> > if (sched_info_on())
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index db54e63..d16cfc5 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -306,6 +306,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
> > kvm_lapic_irq *irq)
> > irq->level, irq->trig_mode);
> >  }
> >  
> > +static int pv_eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
> > +{
> > +
> > +   return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data, &val,
> > + sizeof(val));
> > +}
> > +
> > +static int pv_eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
> > +{
> > +
> > +   return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data, val,
> > + sizeof(*val));
> > +}
> > +
> > +static inline bool pv_eoi_enabled(struct kvm_vcpu *vcpu)
> > +{
> > +   return vcpu->arch.pv_eoi.msr_val & KVM_MSR_ENABLED;
> > +}
> > +
> > +static bool pv_eoi_get_pending(struct kvm_vcpu *vcpu)
> > +{
> > +   u8 val;
> > +   if (pv_eoi_get_user(vcpu, &val) < 0)
> > +   apic_debug("Can't read EOI MSR value: 0x%llx\n",
> > +  (unsigned long long)vcpi->arch.pv_eoi.msr_val);
> > +   return val & 0x1;
> > +}
> > +
> > +static void pv_eoi_set_pending(struct kvm_vcpu *vcpu)
> > +{
> > +   if (pv_eoi_put_user(vcpu, KVM_PV_EOI_ENABLED) < 0) {
> > +   apic_debug("Can't set EOI MSR value: 0x%llx\n",
> > +  (unsigned long long)vcpi->arch.pv_eoi.msr_val);
> > +   return;
> > +   }
> > +   __set_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
> > +}
> > +
> > +static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
> > +{
> > +   if (pv_eoi_put_user(vcpu, KVM_PV_EOI_DISABLED) < 0) {
> > +   apic_debug("Can't clear EOI MSR value: 0x%llx\n",
> > +  (unsigned long long)vcpi->arch.pv_eoi.msr_val);
> > +   return;
> > +   }
> > +   __clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
> > +}
> > +
> >  static inline int apic_fin

Re: [PATCH RFC 1/2] block: Add blk_bio_map_sg() helper

2012-06-14 Thread Jens Axboe
On 2012-06-14 04:31, Tejun Heo wrote:
> On Wed, Jun 13, 2012 at 03:41:46PM +0800, Asias He wrote:
>> Add a helper to map a bio to a scatterlist, modelled after
>> blk_rq_map_sg.
>>
>> This helper is useful for any driver that wants to create
>> a scatterlist from its ->make_request_fn method.
> 
> This may not be possible but I really wanna avoid having two copies of
> that complex logic.  Any chance blk_rq_map_bio() can be implemented in
> a way that allows blk_rq_map_sg() can be built on top of it?  Also,

Was thinking the same thing, definitely code we don't want to have
duplicated. We've had mapping bugs in the past.

Asias, this should be trivial to do, except that blk_rq_map_sg()
potentially maps across bio's as well. The tracking of the prev bio_vec
does not care about cross bio boundaries.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC 1/2] kvm: implement kvm_set_msi_inatomic

2012-06-14 Thread Gleb Natapov
On Wed, Jun 13, 2012 at 07:12:26PM +0300, Michael S. Tsirkin wrote:
> > > > > @@ -134,10 +141,30 @@ int kvm_set_msi(struct 
> > > > > kvm_kernel_irq_routing_entry *e,
> > > > >   irq.level = 1;
> > > > >   irq.shorthand = 0;
> > > > >  
> > > > > + /* Multicast MSI doesn't really block but might take a long 
> > > > > time. */
> > > > > + if (unlikely(noblock && kvm_msi_is_multicast(irq.dest_id,
> > > > > +  
> > > > > irq.delivery_mode)))
> > > > delivery_mode? Should be dest_mode.
> 
> Yes. Good catch, thanks.
> 
> > But you probably need to check that
> > > > delivery_mode is not ExtINT either.
> 
> It does not look like anything happens with ExtInt
> if you try to trigger it from MSI.
> 
Currently no, but it should appear as if interrupt comes from PIC.
I wouldn't allow anything but fixed mode here just to be on a safe side.
Lowest prio will have to loop even after introducing irq cache.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html